Hello everyone I am working on a dynamic pipeline that gener Kedro #questions

Hello everyone, I am working on a dynamic pipelin...

Hugo Acosta

10/01/2024, 2:55 PM

Hello everyone, I am working on a dynamic pipeline that generates a file for each year in a list, such that the catalog entry would be

Copy code

data_{year}:
  type: pandas.ExcelDataset
  filepath: reports/folder/data_{year}.xlsx
  save_args:
    index: False

Then, I have another pipeline that aggregates all files to process them loading them as a PartitionedDataset, with entry:

Copy code

partitioned_data:
  type: partitions.PartitionedDataset
  path: reports/folder
  dataset:
    type: pandas.ExcelDataset

The main problem with my approach is that even though these two entries refer to the same data, they are in fact different entries, so Kedro runs the second pipeline before the dynamic one. I would appreciate your input on this issue, Thanks a lot!

👋🏼 1

👋 1

Nok Lam Chan

10/01/2024, 3:10 PM

Hi @Hugo Acosta, thanks for the question.

The main problem with my approach is that even though these two entries refer to the same data, they are in fact different entries, so Kedro runs the second pipeline before the dynamic one.

Is it possible to use partition dataset instead of dynamic pipeline in this case? I understand the reason for this to happen is that, if you try to visualise this pipeline with

kedro viz

, it will be a disconnect one so Kedro don't know that the 1st one need to be executed before the other. The other option is to create a fake dummy input/output pair, to ensure the dependencies is resolved correctly.

Hugo Acosta

10/01/2024, 3:54 PM

Thanks a lot for the early answer! I am a bit concerned that loading as a partition instead of looping through the files will cause memory issues, could you elaborate a bit on your suggestion?

Nok Lam Chan

10/01/2024, 4:06 PM

which suggestion are you referring to?

Hugo Acosta

10/01/2024, 4:12 PM

My concern is that by using a partition dataset instead of a dynamic pipeline I will encounter memory issues, since the data files are kinda heavy, so I wanted to know your take on this.

Nok Lam Chan

10/01/2024, 4:14 PM

https://docs.kedro.org/en/stable/data/partitioned_and_incremental_datasets.html For partitioned dataset, you could use lazy loading/lazy saving to help with the memory issue. If you prefer the dynamic pipeline way, it's totally fine, but as mentioned you would need a dummy input/output to control the execution order.

Nok Lam Chan

10/01/2024, 4:15 PM

Side note: https://github.com/kedro-org/kedro/discussions/3758 There has been some discussion for adding custom execution order, feel free to comment if this is in your interest

17 Views

Open in Slack

Previous Next