Looking for ideas for a kedronic implementation of...
# questions
Looking for ideas for a kedronic implementation of a vector store using LangChain. Specifically with the FAISS implementation. I’m expecting the
method to be pretty simple. Something like:
Copy code
# Inside my class FAISSDataSet:
def _load(self): return FAISS.load_local(self.filepath, self.embeddings)
I have another dataset to handle loading
(see thread). Specifically the
class in langchain. However, I’m not exactly sure how I’d get this into my hypothetical
. Ideally, I could version control the
. •
can be pickled, but my credentials are going to be changing constantly. (and it feels icky to pickle something with secrets in it) • So I’m considering taking in the API credentials to both
• Then I can do something hacky like save everything but the credentials of the embedding inside
and then swap them in on a
to handle versioning. Maybe I’m missing something though. Does this seem logical?
for those who are interested.
Copy code
class OpenAIEmbeddingsDataSet(AbstractDataSet[None, OpenAIEmbeddings]):
    """OpenAI Embeddings dataset.
    Must be a dataset to access credentials at runtime.

    def __init__(self, credentials: Dict[str, str], **kwargs):

            credentials: must contain `openai_api_base` and `openai_api_key`.
            **kwargs: keyword arguments passed to the `OpenAIEmbeddings` class.
        self.openai_api_base = credentials["openai_api_base"]
        self.openai_api_key = credentials["openai_api_key"]
        self.kwargs = kwargs

    def _describe(self) -> dict[str, Any]:
        return {**self.kwargs}

    def _save(self, data: None) -> NoReturn:
        raise DatasetError(f"{self.__class__.__name__} is a read only data set type")

    def _load(self) -> OpenAIEmbeddings:
        return OpenAIEmbeddings(
If anyone wants my dumb hacky solution here it is:
Copy code
from kedro.extras.datasets.pickle import PickleDataSet
from langchain.vectorstores import FAISS

class FAISSDataSet(PickleDataSet):
    """Saves and loads a FAISS vector store."""

    # TODO: find a better way to do this dataset.
    #   - Using anything but an API will also serialize the embedding model which will be too big.

    def __init__(self, *args, **kwargs):
        super().__init__(*args, **kwargs)

        self.openai_api_base = kwargs["credentials"]["openai_api_base"]
        self.openai_api_key = kwargs["credentials"]["openai_api_key"]

    def _load(self) -> FAISS:
        faiss = super()._load()

        # TODO: this assumes we're using an OpenAIEmbeddings embedding function.
        faiss.embedding_function.__self__.openai_api_base = self.openai_api_base
        faiss.embedding_function.__self__.openai_api_key = self.openai_api_key

        return faiss
See TODOs as well
great to see others working with LLM tooling. I hadn't even thought of using datasets for my vector embeddings. I use LLama Index and was treating those resources outside of kedro functionality. I'm still learning all this stuff...I was learning how to use chromadb so just call them directly in my code.
👍 1