Hello I am using kedro with dvc for data version control The Kedro #questions

Hello, I am using kedro with dvc for data version ...

Rafał Nowak

06/14/2023, 4:49 PM

Hello, I am using kedro with dvc for data version control. The dvc is based on

gto

which depends on

semver >= 3

Unfortunately I cannot install

kedro-viz

since

kedro-viz 6.3.0

depends on

semver < 3

Is there any reason why

kedro-viz

is limited to

semver < 3

? The current

semver

3.0.1

. Could anyone from kedro-viz team relax this dependency limitation?

Nok Lam Chan

06/15/2023, 10:56 AM

Sorry for the late response. Any chance you know why they did this? https://github.com/iterative/gto/pull/348, it looks like it’s just an automated bump but not necessarily breaking anything. Ccing @Rashida Kanchwala @Ravi Kumar Pilla Do you know why we pin

semver <3

Nok Lam Chan

06/15/2023, 10:57 AM

On the other hand, I am quite curious how you are using Kedro with DVC, would you mind sharing a bit your setup and workflow? Are you using Kedro versioning or just DVC? Cc: @Juan Luis

👀 1

Rafał Nowak

06/15/2023, 11:01 AM

Hi @Nok Lam Chan, thanks for this reply. I am using dvc just to save the binary files in order to keep them close to my repo. I could use GitLFS, but I prefer to version binary files with DVC. GTO is very simple “Dataset registry” for my case. It is using git tags in order to tags the artifacts name+version, so I can quickly gith checkout the proper revision (providin the artifact name + version), get the MD5 hash of the binary file (dataset in tar) and run

dvc pull

, which pretty close to

git pull

🙂 Everything has nice CLI with `dvc`; see dvc.org I do not know why the are using semver >= 3

Rafał Nowak

06/15/2023, 11:02 AM

I wish I could use kedro’s versioned dataset, but actually

kedro

does not provide any dataset registry.

Juan Luis

06/15/2023, 11:03 AM

@Rafał Nowak dataset registry = being able to browse past versions of a dataset?

Nok Lam Chan

06/15/2023, 11:03 AM

What do you mean by “data registry”? what are you using it for

Rafał Nowak

06/15/2023, 11:07 AM

For me dataset registry is for 1. Prepare data file (possibly binary file, for example with dataset created by kedro pipeline -> stored locally on disk) 2. Send data file to the remote storage 3. Register the data file semehow -> GTO is adding git tag the recent commit of GIT -> for example

dataset@v0.3.5

DVC

is able to get the registered artifact

dataset

provided any version. So

dvc get dataset@v0.3.5

it cloning the repo in temporary dir -> checkout out the

dvc

hash -> downloading the proper file from remote storage -> moving the file to you current directory location 🙂 DVC can work with storage being s3, gcp, goodle drive 🙂 … and many more that I do not use,

Rafał Nowak

06/15/2023, 11:08 AM

so they are using