Hi I m having trouble when combining yielding nodes and over Kedro #questions

Hi, I'm having trouble when combining yielding no...

Nicolas Betancourt Cardona

04/02/2025, 2:24 PM

Hi, I'm having trouble when combining yielding nodes and overwriting partitioned datasets. I am using a partitioned dataset of a custom datset like this:

Copy code

partitioned_audio_dataset:
  type: partitions.PartitionedDataset
  path: data/output/mainfolder
  dataset:
    type: my_kedro_project.datasets.audio_dataset.SoundDataset
  filename_suffix: ".WAV"

The node which outputs correspond to this catalog entry yields several dictionaries with keys of the form "subfolder_name/file.wav" so that when the node is done the output main folder should look like this:

Copy code

mainfolder:
      subdolder_1
      subfolder_2
      subfolder_3
      ....
      subfolder_n

and inside each subfolder_i there must be several .wav files. This is working fine but the problem is when I run the node a second time. I would like the possibility to overwrite instead of adding new files to each subfolder. I thought the

overwrite

parameter of partitioned datasets would help but I think it does not quite works as desired when yielding. If I change the catalog entry to

Copy code

partitioned_audio_dataset:
  type: partitions.PartitionedDataset
  path: data/output/mainfolder
  overwrite: True
  dataset:
    type: my_kedro_project.datasets.audio_dataset.SoundDataset
  filename_suffix: ".WAV"

then the main folder looks like this:

Copy code

mainfolder:
      subdolder_n

with only one single WAV file in subfolder_n because each time the node yields it is deleting previous yielded files and folders. Is there a way I can use the overwrite parameter of partitioned dataset when yielding and obtain the desired folder structure?

👀 1

Ravi Kumar Pilla

04/02/2025, 4:37 PM

Hi @Nicolas Betancourt Cardona, yes

overwrite

will remove existing partitions. Let me have a look and will get back on this. Thank you

Ravi Kumar Pilla

04/02/2025, 5:03 PM

Does your node output consistent partition keys ? I think if the partition keys are same in both runs, it should overwrite the same file

Ravi Kumar Pilla

04/02/2025, 5:06 PM

This should be the default behavior. I see you are using a custom dataset type. Does your custom dataset has a save method which prevents overriding of data ?

Nicolas Betancourt Cardona

04/02/2025, 5:45 PM

I want

overwrite

to remove existing partitions and this works fine when my node returns an output. The problem is when I yield it. If i return { path_1: file_1, path_2: file_2,..., path_n:file_n }, then the partitioned dataset is overwriten as desired. If I yield each of the {path_i: file_i} one by one then I end up only with the last one because it deletes everything each time it yields.

👀 1

Ravi Kumar Pilla

04/02/2025, 5:51 PM

using overwrite: True in your catalog config will remove previous partitions. what happens when you do not use overwrite ?

Nicolas Betancourt Cardona

04/02/2025, 6:05 PM

When I don't use it I get more files than desired in each subfolder_i

Copy code

mainfolder:
      subdolder_1:
        old_file_1
        old_file_2
        ...
        old_file_k
        new_file_1
        new_file_2
        ...
        new_file_m
      subfolder_2
      subfolder_3
      ....
      subfolder_n

I want Kedro to overwrite this folder but in such a way that I can yield and that the resulting folder looks like this:

Copy code

mainfolder:
      subdolder_1
       new_file_1
        new_file_2
        ...
        new_file_m
      subfolder_2
      subfolder_3
      ....
      subfolder_n

I want old files to be deleted and only preserve new files coming from the multiple yieldings of the last run. I'm sorry I I was not clear enough in my initial comment

Ravi Kumar Pilla

04/02/2025, 6:14 PM

okay, based on the information - 1. I think

overwrite: True

will not serve your purpose. So you need to avoid using it. 2. If the file names or the node yield has unique partition keys (i.e., same file names), the underlying dataset save method should overwrite by default (like if you are using kedro datasets CSVDataset). 3. If you have a custom dataset implementation, you need to make sure the

save

method does not create a new file, as in Kedro the save operation is delegated to underlying dataset implementation I am not sure how

yield

is different to

return

in case of saving datasets. I do not think kedro does some magic here. I will loop someone here to help you if this does not resolve your issue. @Ankita Katiyar any quick suggestion here ? Thank you

👍 1

Nicolas Betancourt Cardona

04/02/2025, 6:22 PM

Regarding your second point, maybe the problem is that the files I'm storing in the subfolder are selected at random and the filenames are rarely the same from un run to annother. That's why when

overwrite: False

I'm ending up with both, the old and the new files. On the other hand, when

overwrite: True

the immediate previously yielded output is deleted and that's why I end up only with the last file of the last subfolder. So I agree with you that the

overwrite

parameter won't do for me. I'm wondering if there is a way to inform Kedro that I'm yielding for the first time and only in the first time it has to delete the content of

mainfolder

and for all the other yieldings keep the files of

mainfolder

👀 1

Ravi Kumar Pilla

04/02/2025, 6:27 PM

There might be some workarounds but I think having a consistent partition key would help you resolve this issue. For now, I could think of having a function that gives you a consistent key (something like a hash) for specific input. Even if the files are selected at random, the saved file names are consistent and this will overwrite if the same file is processed in next run and creates a new file when it was not previously saved.

Ravi Kumar Pilla

04/02/2025, 6:29 PM

If this is not urgent, I would have someone else to suggest a different/better approach here. Thanks for your patience 🙂

Nicolas Betancourt Cardona

04/02/2025, 8:01 PM

It is not urgen, I would be happy to hear your suggestions 🙂

6 Views

Open in Slack

Previous Next