Hi I am curious about the best practice to use kedro Current Kedro #questions

Hi, I am curious about the best practice to use ke...

Jackson

07/18/2023, 6:54 AM

Hi, I am curious about the best practice to use kedro. Currently, my applications involving initialize the vector stores, adding documents with their corresponding embeddings into the vector stores, which itself isn't a standalone function that can be written in nodes.py. The following is how the code will be written.

Copy code

class VectorStore:
    def __init__(
            self,
            client_path,
            embedding_func) -> None:
        self.collections = None
        self.client = chromadb.PersistentClient(path=client_path)
        self.embedding_func = embedding_func
        
    def create_collections(self,collection_name):
        self.collections = self.client.create_collection(collection_name,self.embedding_func)
        return self.collections
    
    def add_docs(
            collections,
            embeddings,
            metadatas,
            ids):
        collections.add(
            embeddings = embeddings,
            metadatas = metadatas,
            ids = ids
        )

However, putting this inside nodes.py doesn't seems ideal due to I still have other classes (like model class) and I believe mixing everything inside a nodes is an anti-pattern. But if I write a standalone function in nodes.py like below seems redundant.

Copy code

def create_collections(collections,collections_name):
    collections.create_collections(collections_name)

So my question is what are the best way to separate classes and nodes, while avoiding code redundant at the same time?

datajoely

07/18/2023, 7:35 AM

This feels like a hook or dataset rather than within your pipeline logic

datajoely

07/18/2023, 7:35 AM

See spark or dask docs for inspiration?

Juan Luis

07/18/2023, 7:43 AM

so cool to see people using vector stores with Kedro @jackson 🙌🏼 as @datajoely says, I'd take the initialization of the vector store to a plugin using some of the Kedro hooks, for example

after_context_created

after_catalog_created

https://docs.kedro.org/en/stable/hooks/introduction.html I see you're using Chroma right? any thoughts on how it compares to, say, Weaviate or others?

Jackson

07/18/2023, 9:05 AM

Thanks for the answer! @datajoely @Juan Luis. I appreciate the suggestion and haven't yet considered integrating it with kedro hooks. Will definitely give it a try. As for the vector database, I'm currently using Chroma, but I believe the choice heavily relies on the specific requirements and scale of your application. In our situation, we were looking for something open-source, easy to grasp, and able to be self-hosted. This steered us away from certain options like Pinecone. We've experimented with Chroma, Weaviate, and pgvector in conjunction with PostgreSQL. Both Chroma and Weaviate have proven to be capable for our needs (in fact we use both for different projects). However, we noticed that non-dedicated vector databases like pgvector and Elasticsearch didn't perform as well in search operations when compared to Chroma and Weaviate.

💡 1

datajoely

07/18/2023, 9:14 AM

This is a super cool space for all Kedro users, so please shout if you have any questions we’ll be super keen to help you think through this 🙂

Nok Lam Chan

07/18/2023, 10:32 AM

+1 for hooks/dataset, not 100% sure where each function should goes. In general connection stuff should go to

hooks

and maybe

add_docs

can be analog to

save

method.

139 Views

Open in Slack

Previous Next