Does Kedro work with DVC or any Data Version Control soluti Kedro #questions

Join Slack

Does Kedro work with DVC or any Data Version Contr...

# questions

Ofir

05/05/2023, 2:22 PM

Does Kedro work with DVC or any Data Version Control solution?

Juan Luis

05/05/2023, 2:25 PM

there's an external plugin here https://github.com/FactFiber/kedro-dvc could you give it a try and let us know if it still works?

Ofir

05/05/2023, 2:27 PM

Thanks @Juan Luis, I’m looking at the repo, looks a bit scary

😅 1

Ofir

05/05/2023, 2:29 PM

The use case is: How do you reuse datasets with Kedro runs across different experiments?

Ofir

05/05/2023, 2:30 PM

While Kedro dictates a very well defined folder hierarchy, I wouldn’t want to kill my disk space, and I would want to be able to bootstrap / execute a pipeline given an external data store.

Ofir

05/05/2023, 2:34 PM

Perhaps implementing a concrete class of AbstractDataSet?

Ofir

05/05/2023, 2:35 PM

with a (process/function) call to dvc or some external data store to read/load the data

Nok Lam Chan

05/05/2023, 2:35 PM

What do you want to reuse exactly?

Nok Lam Chan

05/05/2023, 2:36 PM

Re: external data store almost all datasets are implemented with

fsspec

, so it works on most object store i.e. s3, gcfs, etc

Ofir

05/05/2023, 2:36 PM

I have an ETL process that transforms data and writes it to a Data Warehouse (DWH). Then I generate .parquet files (by running SQL queries against the DWH) and dump them to disk. These data sets will be used as input to a Kedro pipeline.

Nok Lam Chan

05/05/2023, 2:37 PM

You can read these parquet file with

ParquetDataSet

already

Ofir

05/05/2023, 2:38 PM

Nice 🙂 What if you need an abstraction layer on top of it? Say the file was archived / deleted (due to retention policy and disk space constraints), but you know (through metadata) how the data was generated.

Ofir

05/05/2023, 2:38 PM

So you could re-generate it (the .parquet) by running the SQL query.

Ofir

05/05/2023, 2:39 PM

So think

LazyParquetDataSet

class (possibly regenerate and then load from disk).

Nok Lam Chan

05/05/2023, 2:43 PM

that’s one possible solution. I have implemented a cache layer, where it will look for data locally, if not it will fetch from a remote storage instead (it also check with the metadata md5 to avoid unnecessary IO). You can do the same instead of fetching from a remote storage, you can re-trigger a SQL pipeline to generate that data. Then you need to be aware that part of your data pipeline is no longer within Kedro, and you may lost the ability to reproduce experiments. i.e. how do you know if your regenerated dataset doesn’t change?

Nok Lam Chan

05/05/2023, 2:44 PM

Kedro offer some basic data versioning feature. If you are using S3 or similar object store, they usually have built-in versioning feature already and you can take advantage of that. (also for retention and clean up)

Ofir

05/05/2023, 2:44 PM

So checksumming (e.g. md5 / sha256sum) the dataset is one way to achieve that, through tightly coupling a Kedro pipeline code with this SHA.

Ofir

05/05/2023, 2:45 PM

But this feels like reinventing the wheel / reinventing DVC and I don’t like that

Nok Lam Chan

05/05/2023, 2:45 PM

this isn’t necessary implemented on a pipeline level, you can mix-in this caching layer with datasets

Ofir

05/05/2023, 2:45 PM

What I want actually is to be able to: 1. Version data properly (think Git for data + a nice browser) 2. Integrate Kedro with this external data store, and couple a pipeline to a particular dataset

👍🏼 1

Nok Lam Chan

05/05/2023, 2:48 PM

For 1. Kedro doesn’t offer a git base versioning approach, you may need to try

kedro-dvc

or build a plugin to integrate kedro with it.

Ofir

05/05/2023, 2:49 PM

Gotcha! What is the best strategy to implement something that addresses these 2 needs? Follow your design or the S3DataSet class?

Nok Lam Chan

05/05/2023, 2:53 PM

I think they are two separate problem - for connecting external data store it should work already. https://docs.kedro.org/en/stable/data/data_catalog.html#specify-the-location-of-the-dataset

Nok Lam Chan

05/05/2023, 2:53 PM

For 1. I am not familiar with any Git-base data versioning tool, DVC is probably the only one doing this?

Ofir

05/05/2023, 2:56 PM

Gotcha, so I could specify to Kedro that the dataset is located outside my (Git) repository, right?

Ofir

05/05/2023, 2:56 PM

Through this: • Local or Network File System:

file://

- the local file system is default in the absence of any protocol, it also permits relative paths.

Ofir

05/05/2023, 2:58 PM

@Nok Lam Chan what about the coupling and data integrity? Given an experiment that you ran last week on some dataset, how do you ensure that the data was not tampered with the next time you want to run it?

Nok Lam Chan

05/05/2023, 5:06 PM

If data versioning is important to you, you should create version for your input data too.

Nok Lam Chan

05/05/2023, 5:06 PM

It’s much simpler you just don’t change your data, if you have a new set of data, just create a new version of it.

Nok Lam Chan

05/05/2023, 5:07 PM

And if you really need the guarantee, just lock the data make sure no one can write/delete it

Nok Lam Chan

05/05/2023, 5:08 PM

But this fall outside of Kedro, kedro isn’t a data management system.

430 Views

Open in Slack

Previous Next