Hello, I am using kedro with dvc for data version ...
# questions
r
Hello, I am using kedro with dvc for data version control. The dvc is based on
gto
which depends on
semver >= 3
Unfortunately I cannot install
kedro-viz
since
kedro-viz 6.3.0
depends on
semver < 3
Is there any reason why
kedro-viz
is limited to
semver < 3
? The current
semver
is
3.0.1
. Could anyone from kedro-viz team relax this dependency limitation?
n
Sorry for the late response. Any chance you know why they did this? https://github.com/iterative/gto/pull/348, it looks like it’s just an automated bump but not necessarily breaking anything. Ccing @Rashida Kanchwala @Ravi Kumar Pilla Do you know why we pin
semver <3
?
On the other hand, I am quite curious how you are using Kedro with DVC, would you mind sharing a bit your setup and workflow? Are you using Kedro versioning or just DVC? Cc: @Juan Luis
👀 1
r
Hi @Nok Lam Chan, thanks for this reply. I am using dvc just to save the binary files in order to keep them close to my repo. I could use GitLFS, but I prefer to version binary files with DVC. GTO is very simple “Dataset registry” for my case. It is using git tags in order to tags the artifacts name+version, so I can quickly gith checkout the proper revision (providin the artifact name + version), get the MD5 hash of the binary file (dataset in tar) and run
dvc pull
, which pretty close to
git pull
🙂 Everything has nice CLI with `dvc`; see dvc.org I do not know why the are using semver >= 3
I wish I could use kedro’s versioned dataset, but actually
kedro
does not provide any dataset registry.
j
@Rafał Nowak dataset registry = being able to browse past versions of a dataset?
n
What do you mean by “data registry”? what are you using it for
r
For me dataset registry is for 1. Prepare data file (possibly binary file, for example with dataset created by kedro pipeline -> stored locally on disk) 2. Send data file to the remote storage 3. Register the data file semehow -> GTO is adding git tag the recent commit of GIT -> for example
dataset@v0.3.5
DVC
is able to get the registered artifact
dataset
provided any version. So
dvc get dataset@v0.3.5
it cloning the repo in temporary dir -> checkout out the
dvc
hash -> downloading the proper file from remote storage -> moving the file to you current directory location 🙂 DVC can work with storage being s3, gcp, goodle drive 🙂 … and many more that I do not use,
so they are using
semver
for example, to create nice version names 😉
X.Y.Z-alpha.5
and comare them, bump minor, bump patch, bump major, etc
j
thanks a lot for the insight @Rafał Nowak. it's not the first time someone uses DVC alongside Kedro https://kedro-org.slack.com/archives/C03RKP2LW64/p1683296725893669?thread_ts=1683296578.127849&amp;cid=C03RKP2LW64 we should definitely look more into this