Hi, I am curious about the best practice to use ke...
# questions
j
Hi, I am curious about the best practice to use kedro. Currently, my applications involving initialize the vector stores, adding documents with their corresponding embeddings into the vector stores, which itself isn't a standalone function that can be written in nodes.py. The following is how the code will be written.
Copy code
class VectorStore:
    def __init__(
            self,
            client_path,
            embedding_func) -> None:
        self.collections = None
        self.client = chromadb.PersistentClient(path=client_path)
        self.embedding_func = embedding_func
        
    def create_collections(self,collection_name):
        self.collections = self.client.create_collection(collection_name,self.embedding_func)
        return self.collections
    
    def add_docs(
            collections,
            embeddings,
            metadatas,
            ids):
        collections.add(
            embeddings = embeddings,
            metadatas = metadatas,
            ids = ids
        )
However, putting this inside nodes.py doesn't seems ideal due to I still have other classes (like model class) and I believe mixing everything inside a nodes is an anti-pattern. But if I write a standalone function in nodes.py like below seems redundant.
Copy code
def create_collections(collections,collections_name):
    collections.create_collections(collections_name)
So my question is what are the best way to separate classes and nodes, while avoiding code redundant at the same time?
d
This feels like a hook or dataset rather than within your pipeline logic
See spark or dask docs for inspiration?
j
so cool to see people using vector stores with Kedro @jackson 🙌🏼 as @datajoely says, I'd take the initialization of the vector store to a plugin using some of the Kedro hooks, for example
after_context_created
or
after_catalog_created
https://docs.kedro.org/en/stable/hooks/introduction.html I see you're using Chroma right? any thoughts on how it compares to, say, Weaviate or others?
j
Thanks for the answer! @datajoely @Juan Luis. I appreciate the suggestion and haven't yet considered integrating it with kedro hooks. Will definitely give it a try. As for the vector database, I'm currently using Chroma, but I believe the choice heavily relies on the specific requirements and scale of your application. In our situation, we were looking for something open-source, easy to grasp, and able to be self-hosted. This steered us away from certain options like Pinecone. We've experimented with Chroma, Weaviate, and pgvector in conjunction with PostgreSQL. Both Chroma and Weaviate have proven to be capable for our needs (in fact we use both for different projects). However, we noticed that non-dedicated vector databases like pgvector and Elasticsearch didn't perform as well in search operations when compared to Chroma and Weaviate.
💡 1
d
This is a super cool space for all Kedro users, so please shout if you have any questions we’ll be super keen to help you think through this 🙂
n
+1 for hooks/dataset, not 100% sure where each function should goes. In general connection stuff should go to
hooks
and maybe
add_docs
can be analog to
save
method.