Has anyone incorporated an LLM pipeline in a kedro...
# questions
e
Has anyone incorporated an LLM pipeline in a kedro project yet? I'd like to try using OpenAI to perform some processing on a collection of json documents and I'd love to see a working example or hear about any lessons. THanks kindly!
y
Talk to @Debanjan Banerjee 😄
e
Ok, thank you. Hopefully he looks here 😛
d
Hey Emilio , sorry indeed i dont see this channel too often unless mentioned , lemme know what you want to do and can discuss how we are doing this ! 🙂
e
Thank you kindly @Debanjan Banerjee. What we're looking to accomplish is use a self-hosted LLM that can parse public posts on a few RSS feeds/google groups and then do some minor feature engineering and then produce a report based on the latest posts. Some posts are pure questions, some posts are answers, some posts are just salutations like "thank you". We'd like to classify the posts, and update the metadata about the post. So far, I've got all the posts as josn documents in various mongo collections. Then each day, we want to generate a report based on the latest posts we've ingested and highlight particular posts (based on date or new answer) and then do some similarity searches for posts we've ingested that might be relevant. This is my first kedro project so I'm just learning how all this works. I have gotten a couple of augmented-retrieval PoCs built using a notebook and self-hosted GPT4All model, but nothing I can put into a kedro pipeline. Thanks so much for any wisdom you can share!
d
Hey @Emilio Gagliardi , thanks! so what we have done is used is the following
credentials.yml
• openai api key
parameters.yml
• set limits on tokens , max tokens per responses etc. • set limits on timeout etc.
node.py
• create a node where openai api is called in batches • create a prompt • setup guardrails (can move this to utils to make it more modular) • load credentials using
configLoader
• invoke API like below • run batches so long story short , we are using nodes abstractions to invoke the API and in batches. Surely theres better ways to do this but we are invoking the below and has worked well for us
apologies i cannot share the prompt with you as it is highly contextual in what we are trying to do , but feel free to ask questions on this thread for us and the community to answer 🙂
e
Thank you so much. 1. how do you handle prompt templates, they're one piece I don't have any experience with? 2. why do you have to call configLoader manually? 3. are you using any framework like langchain or llamaindex?
d
Hey Emilio • Prompt templates : Not using any prompt templates for now. We were looking into
DataSette
but turned out killing a mosquito with sword 🙂 . We have coded our prompts after robust testing on the chatgpt interface • You can either set your api key as an env variable and openai can read it implicitly , i personally dont trust that processs way too much as it feel its less control in my hands and i use kedro's
credentials.yml
, its a dev's discretion. No concerns from openai API side • we are exploring lang chain. Actually in the product im using this right now we have single openai use but we are very swifty moving to about 7 genai usecase 😛 so managing it via langchain will save us a lot of effort i think so yes , its on the radar