https://kedro.org/ logo
#questions
Title
# questions
m

marrrcin

12/18/2023, 9:48 AM
Is there some flag in
kedro viz
to disable dataset checking? I just want to see the pipeline structure on of a project but the project itself has data catalog entries do a directories / files that I donโ€™t have access locally (in S3). Right now it fails on
.exists()
call in the datasets.
n

Nok Lam Chan

12/18/2023, 9:57 AM
Are you on the latest version of viz? I don't recall it checking datasets
m

marrrcin

12/18/2023, 9:58 AM
Yes, 6.7.0
It goes into
kedro_viz.server.populate_data
->
data_access_manager.resolve_dataset_factory_patterns
->
catalog.exists(dataset_name)
s

Simon Wolf

12/18/2023, 9:59 AM
Hey I had the same problem, so I would also be interested. In some other cases the check of data took too long and kedro-viz was canceled after 60s timeout. In this case, it would also be good to deactivate the check. My interim solution was my own dataset class in which I adapted the _exists function...
๐Ÿ™ˆ 1
m

marrrcin

12/18/2023, 10:01 AM
I imagine sth like `kedro viz --skip-datsets`would be nice
๐Ÿ‘ 1
s

Simon Wolf

12/18/2023, 10:06 AM
Other interim solution would be to go back to kedro-viz version 6.6.1. If I remember correctly, the data check did not yet exist in this version
๐Ÿ‘๐Ÿผ 1
m

marrrcin

12/18/2023, 10:08 AM
Thatโ€™s actually super convenient, thanks @Simon Wolf
FYI @Nero Okwa
n

Nero Okwa

12/18/2023, 10:42 AM
Interesting, I don't recall this either. FYI @Rashida Kanchwala
@Simon Wolf please provide more context on the other cases when the data check too long and Kedro-Viz cancelled.
s

Simon Wolf

12/18/2023, 11:05 AM
I use a custom PartitionedDataset to load tables from databricks hive storage. In this case the _list_partitions function takes a while, because there are many tables in the database. executing the spark function to list all tables takes a while. Probably also because the tables are distributed in the cluster. The _exists() function normally simply calls bool(self._list_partitions()) for a PartitionedDataset. and if the list_partitions() takes too long, the whole thing doesn't work. I could also imagine that this could be a problem in other scenarios (using the standard PartitionedDataset or other single Dataset Classes) if the data is not stored locally and there are therefore delays when loading/listing files...
a

Ankita Katiyar

12/18/2023, 12:00 PM
I donโ€™t know if itโ€™s been released yet but this has been resolved - https://github.com/kedro-org/kedro-viz/issues/1645
๐Ÿ‘๐Ÿฝ 1
๐Ÿ‘ 2
r

Rashida Kanchwala

12/18/2023, 12:18 PM
Yes we will be releasing the fix for it hopefully today in Kedro-viz 7.0
m

marrrcin

12/18/2023, 12:19 PM
It will probably not cover the latency problem though
r

Rashida Kanchwala

12/18/2023, 6:59 PM
True. Looping @Ravi Kumar Pilla
r

Ravi Kumar Pilla

12/18/2023, 7:18 PM
The check was introduced in kedro viz 6.7.0 to discover factory patterns but I see the problem here with the latency. Quick solution could be as @marrrcin suggested to have a flag to disable this discovery. But, this will also disable the dataset factory pattern discovery. This needs further discussion with the team and I will have a look at it. Thank you
๐Ÿ‘๐Ÿฝ 1
Hi All, Thank you for your patience. We had an internal discussion with the team and decided to drop dataset factory pattern discovery implementation for kedro viz 7.0.0. This removes the dataset existence check and will resolve the issues mentioned in this thread. However, this will also remove the support for dataset factory patterns from experiment tracking. We will add this to our backlog and work on it.
๐Ÿ‘€ 1
๐Ÿ‘ 1
2 Views