Hello guys I have a problem with kedro node definition In or Kedro #questions

Hello guys, I have a problem with kedro node defin...

Adrien

01/11/2024, 4:58 PM

Hello guys, I have a problem with kedro node definition. In order to preprocess my data, I use Dask Cluster in one of my nodes. My problem : for each parallel processing, I need the output path witch is not accessible in the function of a node. Has anyone solved the problem ?

datajoely

01/11/2024, 5:04 PM

it’s less of a problem and more that kedro intentionally separates business logic from IO logic

datajoely

01/11/2024, 5:04 PM

we don’t really support conditional flow based on filepath outputs

datajoely

01/11/2024, 5:05 PM

there is a belief the combinatorial complexity leads to headeaches which is why we steer people away from it

Adrien

01/11/2024, 5:05 PM

Ok ok but how to you handle distributed computing with huge clusters ?

datajoely

01/11/2024, 5:05 PM

are you using the

dask.ParquetDataSet

datajoely

01/11/2024, 5:06 PM

https://docs.kedro.org/projects/kedro-datasets/en/kedro-datasets-2.0.0/api/kedro_datasets.dask.ParquetDataset.html

datajoely

01/11/2024, 5:06 PM

https://docs.kedro.org/en/stable/deployment/dask.html

Adrien

01/11/2024, 5:07 PM

I'm using custom webdataset for I/O efficiency but this type of dataset is not supported by kedro 😞

Adrien

01/11/2024, 5:07 PM

I'm parallel processing small audio files

datajoely

01/11/2024, 5:08 PM

Okay but I’m still not sure why the filepaths need to be part of the node logic

datajoely

01/11/2024, 5:08 PM

why do regular catalog entries not work?

Adrien

01/11/2024, 5:09 PM

I was not aware of dask deployment tutorial, thanks for sharing ! I'll read it and comeback to you

Open in Slack

Previous Next