Nicolas Betancourt Cardona
04/02/2025, 2:24 PMpartitioned_audio_dataset:
type: partitions.PartitionedDataset
path: data/output/mainfolder
dataset:
type: my_kedro_project.datasets.audio_dataset.SoundDataset
filename_suffix: ".WAV"
The node which outputs correspond to this catalog entry yields several dictionaries with keys of the form "subfolder_name/file.wav" so that when the node is done the output main folder should look like this:
mainfolder:
subdolder_1
subfolder_2
subfolder_3
....
subfolder_n
and inside each subfolder_i there must be several .wav files. This is working fine but the problem is when I run the node a second time. I would like the possibility to overwrite instead of adding new files to each subfolder. I thought the overwrite
parameter of partitioned datasets would help but I think it does not quite works as desired when yielding. If I change the catalog entry to
partitioned_audio_dataset:
type: partitions.PartitionedDataset
path: data/output/mainfolder
overwrite: True
dataset:
type: my_kedro_project.datasets.audio_dataset.SoundDataset
filename_suffix: ".WAV"
then the main folder looks like this:
mainfolder:
subdolder_n
with only one single WAV file in subfolder_n because each time the node yields it is deleting previous yielded files and folders. Is there a way I can use the overwrite parameter of partitioned dataset when yielding and obtain the desired folder structure?Ravi Kumar Pilla
04/02/2025, 4:37 PMoverwrite
will remove existing partitions. Let me have a look and will get back on this. Thank youRavi Kumar Pilla
04/02/2025, 5:03 PMRavi Kumar Pilla
04/02/2025, 5:06 PMNicolas Betancourt Cardona
04/02/2025, 5:45 PMoverwrite
to remove existing partitions and this works fine when my node returns an output. The problem is when I yield it.
If i return { path_1: file_1, path_2: file_2,..., path_n:file_n }, then the partitioned dataset is overwriten as desired.
If I yield each of the {path_i: file_i} one by one then I end up only with the last one because it deletes everything each time it yields.Ravi Kumar Pilla
04/02/2025, 5:51 PMNicolas Betancourt Cardona
04/02/2025, 6:05 PMmainfolder:
subdolder_1:
old_file_1
old_file_2
...
old_file_k
new_file_1
new_file_2
...
new_file_m
subfolder_2
subfolder_3
....
subfolder_n
I want Kedro to overwrite this folder but in such a way that I can yield and that the resulting folder looks like this:
mainfolder:
subdolder_1
new_file_1
new_file_2
...
new_file_m
subfolder_2
subfolder_3
....
subfolder_n
I want old files to be deleted and only preserve new files coming from the multiple yieldings of the last run. I'm sorry I I was not clear enough in my initial commentRavi Kumar Pilla
04/02/2025, 6:14 PMoverwrite: True
will not serve your purpose. So you need to avoid using it.
2. If the file names or the node yield has unique partition keys (i.e., same file names), the underlying dataset save method should overwrite by default (like if you are using kedro datasets CSVDataset).
3. If you have a custom dataset implementation, you need to make sure the save
method does not create a new file, as in Kedro the save operation is delegated to underlying dataset implementation
I am not sure how yield
is different to return
in case of saving datasets. I do not think kedro does some magic here. I will loop someone here to help you if this does not resolve your issue. @Ankita Katiyar any quick suggestion here ? Thank youNicolas Betancourt Cardona
04/02/2025, 6:22 PMoverwrite: False
I'm ending up with both, the old and the new files.
On the other hand, when overwrite: True
the immediate previously yielded output is deleted and that's why I end up only with the last file of the last subfolder. So I agree with you that the overwrite
parameter won't do for me.
I'm wondering if there is a way to inform Kedro that I'm yielding for the first time and only in the first time it has to delete the content of mainfolder
and for all the other yieldings keep the files of mainfolder
Ravi Kumar Pilla
04/02/2025, 6:27 PMRavi Kumar Pilla
04/02/2025, 6:29 PMNicolas Betancourt Cardona
04/02/2025, 8:01 PM