Is there some flag in `kedro viz` to disable dataset checkin Kedro #questions

Is there some flag in `kedro viz` to disable datas...

marrrcin

12/18/2023, 9:48 AM

Is there some flag in

kedro viz

to disable dataset checking? I just want to see the pipeline structure on of a project but the project itself has data catalog entries do a directories / files that I don’t have access locally (in S3). Right now it fails on

.exists()

call in the datasets.

Nok Lam Chan

12/18/2023, 9:57 AM

Are you on the latest version of viz? I don't recall it checking datasets

marrrcin

12/18/2023, 9:58 AM

Yes, 6.7.0

marrrcin

12/18/2023, 9:59 AM

It goes into

kedro_viz.server.populate_data

data_access_manager.resolve_dataset_factory_patterns

catalog.exists(dataset_name)

Simon Wolf

12/18/2023, 9:59 AM

Hey I had the same problem, so I would also be interested. In some other cases the check of data took too long and kedro-viz was canceled after 60s timeout. In this case, it would also be good to deactivate the check. My interim solution was my own dataset class in which I adapted the _exists function...

🙈 1

marrrcin

12/18/2023, 10:01 AM

I imagine sth like `kedro viz --skip-datsets`would be nice

👍 1

Simon Wolf

12/18/2023, 10:06 AM

Other interim solution would be to go back to kedro-viz version 6.6.1. If I remember correctly, the data check did not yet exist in this version

👍🏼 1

marrrcin

12/18/2023, 10:08 AM

That’s actually super convenient, thanks @Simon Wolf

marrrcin

12/18/2023, 10:13 AM

FYI @Nero Okwa

Nero Okwa

12/18/2023, 10:42 AM

Interesting, I don't recall this either. FYI @Rashida Kanchwala

Nero Okwa

12/18/2023, 10:44 AM

@Simon Wolf please provide more context on the other cases when the data check too long and Kedro-Viz cancelled.

Simon Wolf

12/18/2023, 11:05 AM

I use a custom PartitionedDataset to load tables from databricks hive storage. In this case the _list_partitions function takes a while, because there are many tables in the database. executing the spark function to list all tables takes a while. Probably also because the tables are distributed in the cluster. The _exists() function normally simply calls bool(self._list_partitions()) for a PartitionedDataset. and if the list_partitions() takes too long, the whole thing doesn't work. I could also imagine that this could be a problem in other scenarios (using the standard PartitionedDataset or other single Dataset Classes) if the data is not stored locally and there are therefore delays when loading/listing files...

Ankita Katiyar

12/18/2023, 12:00 PM

I don’t know if it’s been released yet but this has been resolved - https://github.com/kedro-org/kedro-viz/issues/1645

👍🏽 1

👍 2

Rashida Kanchwala

12/18/2023, 12:18 PM

Yes we will be releasing the fix for it hopefully today in Kedro-viz 7.0

marrrcin

12/18/2023, 12:19 PM

It will probably not cover the latency problem though

Rashida Kanchwala

12/18/2023, 6:59 PM

True. Looping @Ravi Kumar Pilla

Ravi Kumar Pilla

12/18/2023, 7:18 PM

The check was introduced in kedro viz 6.7.0 to discover factory patterns but I see the problem here with the latency. Quick solution could be as @marrrcin suggested to have a flag to disable this discovery. But, this will also disable the dataset factory pattern discovery. This needs further discussion with the team and I will have a look at it. Thank you

👍🏽 1

Ravi Kumar Pilla

12/18/2023, 11:00 PM

Hi All, Thank you for your patience. We had an internal discussion with the team and decided to drop dataset factory pattern discovery implementation for kedro viz 7.0.0. This removes the dataset existence check and will resolve the issues mentioned in this thread. However, this will also remove the support for dataset factory patterns from experiment tracking. We will add this to our backlog and work on it.

👀 1

👍 1

2 Views

Open in Slack

Previous Next