Hi, I'm having trouble when combining yielding no...
# questions
n
Hi, I'm having trouble when combining yielding nodes and overwriting partitioned datasets. I am using a partitioned dataset of a custom datset like this:
Copy code
partitioned_audio_dataset:
  type: partitions.PartitionedDataset
  path: data/output/mainfolder
  dataset:
    type: my_kedro_project.datasets.audio_dataset.SoundDataset
  filename_suffix: ".WAV"
The node which outputs correspond to this catalog entry yields several dictionaries with keys of the form "subfolder_name/file.wav" so that when the node is done the output main folder should look like this:
Copy code
mainfolder:
      subdolder_1
      subfolder_2
      subfolder_3
      ....
      subfolder_n
and inside each subfolder_i there must be several .wav files. This is working fine but the problem is when I run the node a second time. I would like the possibility to overwrite instead of adding new files to each subfolder. I thought the
overwrite
parameter of partitioned datasets would help but I think it does not quite works as desired when yielding. If I change the catalog entry to
Copy code
partitioned_audio_dataset:
  type: partitions.PartitionedDataset
  path: data/output/mainfolder
  overwrite: True
  dataset:
    type: my_kedro_project.datasets.audio_dataset.SoundDataset
  filename_suffix: ".WAV"
then the main folder looks like this:
Copy code
mainfolder:
      subdolder_n
with only one single WAV file in subfolder_n because each time the node yields it is deleting previous yielded files and folders. Is there a way I can use the overwrite parameter of partitioned dataset when yielding and obtain the desired folder structure?
👀 1
r
Hi @Nicolas Betancourt Cardona, yes
overwrite
will remove existing partitions. Let me have a look and will get back on this. Thank you
Does your node output consistent partition keys ? I think if the partition keys are same in both runs, it should overwrite the same file
This should be the default behavior. I see you are using a custom dataset type. Does your custom dataset has a save method which prevents overriding of data ?
n
I want
overwrite
to remove existing partitions and this works fine when my node returns an output. The problem is when I yield it. If i return { path_1: file_1, path_2: file_2,..., path_n:file_n }, then the partitioned dataset is overwriten as desired. If I yield each of the {path_i: file_i} one by one then I end up only with the last one because it deletes everything each time it yields.
👀 1
r
using overwrite: True in your catalog config will remove previous partitions. what happens when you do not use overwrite ?
n
When I don't use it I get more files than desired in each subfolder_i
Copy code
mainfolder:
      subdolder_1:
        old_file_1
        old_file_2
        ...
        old_file_k
        new_file_1
        new_file_2
        ...
        new_file_m
      subfolder_2
      subfolder_3
      ....
      subfolder_n
I want Kedro to overwrite this folder but in such a way that I can yield and that the resulting folder looks like this:
Copy code
mainfolder:
      subdolder_1
       new_file_1
        new_file_2
        ...
        new_file_m
      subfolder_2
      subfolder_3
      ....
      subfolder_n
I want old files to be deleted and only preserve new files coming from the multiple yieldings of the last run. I'm sorry I I was not clear enough in my initial comment
r
okay, based on the information - 1. I think
overwrite: True
will not serve your purpose. So you need to avoid using it. 2. If the file names or the node yield has unique partition keys (i.e., same file names), the underlying dataset save method should overwrite by default (like if you are using kedro datasets CSVDataset). 3. If you have a custom dataset implementation, you need to make sure the
save
method does not create a new file, as in Kedro the save operation is delegated to underlying dataset implementation I am not sure how
yield
is different to
return
in case of saving datasets. I do not think kedro does some magic here. I will loop someone here to help you if this does not resolve your issue. @Ankita Katiyar any quick suggestion here ? Thank you
👍 1
n
Regarding your second point, maybe the problem is that the files I'm storing in the subfolder are selected at random and the filenames are rarely the same from un run to annother. That's why when
overwrite: False
I'm ending up with both, the old and the new files. On the other hand, when
overwrite: True
the immediate previously yielded output is deleted and that's why I end up only with the last file of the last subfolder. So I agree with you that the
overwrite
parameter won't do for me. I'm wondering if there is a way to inform Kedro that I'm yielding for the first time and only in the first time it has to delete the content of
mainfolder
and for all the other yieldings keep the files of
mainfolder
👀 1
r
There might be some workarounds but I think having a consistent partition key would help you resolve this issue. For now, I could think of having a function that gives you a consistent key (something like a hash) for specific input. Even if the files are selected at random, the saved file names are consistent and this will overwrite if the same file is processed in next run and creates a new file when it was not previously saved.
If this is not urgent, I would have someone else to suggest a different/better approach here. Thanks for your patience 🙂
n
It is not urgen, I would be happy to hear your suggestions 🙂