Does Kedro work with DVC or any Data Version Contr...
# questions
Does Kedro work with DVC or any Data Version Control solution?
there's an external plugin here could you give it a try and let us know if it still works?
Thanks @Juan Luis, I’m looking at the repo, looks a bit scary
😅 1
The use case is: How do you reuse datasets with Kedro runs across different experiments?
While Kedro dictates a very well defined folder hierarchy, I wouldn’t want to kill my disk space, and I would want to be able to bootstrap / execute a pipeline given an external data store.
Perhaps implementing a concrete class of AbstractDataSet?
with a (process/function) call to dvc or some external data store to read/load the data
What do you want to reuse exactly?
Re: external data store almost all datasets are implemented with
, so it works on most object store i.e. s3, gcfs, etc
I have an ETL process that transforms data and writes it to a Data Warehouse (DWH). Then I generate .parquet files (by running SQL queries against the DWH) and dump them to disk. These data sets will be used as input to a Kedro pipeline.
You can read these parquet file with
Nice 🙂 What if you need an abstraction layer on top of it? Say the file was archived / deleted (due to retention policy and disk space constraints), but you know (through metadata) how the data was generated.
So you could re-generate it (the .parquet) by running the SQL query.
So think
class (possibly regenerate and then load from disk).
that’s one possible solution. I have implemented a cache layer, where it will look for data locally, if not it will fetch from a remote storage instead (it also check with the metadata md5 to avoid unnecessary IO). You can do the same instead of fetching from a remote storage, you can re-trigger a SQL pipeline to generate that data. Then you need to be aware that part of your data pipeline is no longer within Kedro, and you may lost the ability to reproduce experiments. i.e. how do you know if your regenerated dataset doesn’t change?
Kedro offer some basic data versioning feature. If you are using S3 or similar object store, they usually have built-in versioning feature already and you can take advantage of that. (also for retention and clean up)
So checksumming (e.g. md5 / sha256sum) the dataset is one way to achieve that, through tightly coupling a Kedro pipeline code with this SHA.
But this feels like reinventing the wheel / reinventing DVC and I don’t like that
this isn’t necessary implemented on a pipeline level, you can mix-in this caching layer with datasets
What I want actually is to be able to: 1. Version data properly (think Git for data + a nice browser) 2. Integrate Kedro with this external data store, and couple a pipeline to a particular dataset
For 1. Kedro doesn’t offer a git base versioning approach, you may need to try
or build a plugin to integrate kedro with it.
Gotcha! What is the best strategy to implement something that addresses these 2 needs? Follow your design or the S3DataSet class?
I think they are two separate problem - for connecting external data store it should work already.
For 1. I am not familiar with any Git-base data versioning tool, DVC is probably the only one doing this?
Gotcha, so I could specify to Kedro that the dataset is located outside my (Git) repository, right?
Through this: • Local or Network File System:
- the local file system is default in the absence of any protocol, it also permits relative paths.
@Nok Lam Chan what about the coupling and data integrity? Given an experiment that you ran last week on some dataset, how do you ensure that the data was not tampered with the next time you want to run it?
If data versioning is important to you, you should create version for your input data too.
It’s much simpler you just don’t change your data, if you have a new set of data, just create a new version of it.
And if you really need the guarantee, just lock the data make sure no one can write/delete it
But this fall outside of Kedro, kedro isn’t a data management system.