Hi everyone, I'm working in a kedro project where...
# questions
n
Hi everyone, I'm working in a kedro project where I want to automatically label thousands of audio files, apply transformations to them and then store them in a folder of folders, each subfolder corresponding to one label. I want that folder of folders to be a catalog entry on my yml file I followed this Kedro tutorial and created my own custom dataset for saving/loading .wav files in kedro catalog. I also am able to create
PartitionedDataset
catalog entries in
catalog.yml
such as
Copy code
audio_folder:
  type: partitions.PartitionedDataset
  dataset: my_kedro_project.datasets.audio_dataset.SoundDataset
  path: data/output/audios/
  filename_suffix: ".WAV"
The next level of abstraction I would require is to be able to create a catalog entry corresponding to a folder containig folders such as the
audio_folder
above. Here is my try to do so but I'm having an issue with the
_save
method
Copy code
class AudioFolderDataset(PartitionedDataset):
    

    def __init__(self, main_folder_path: str):
        """Creates a new instance of SoundDataset to load / save audio data for given filepath.

        Args:
            filepath: The location of the audio file to load / save data.
        """
        protocol, mainfolderpath = get_protocol_and_path(main_folder_path)
        self._protocol = protocol
        self._mainfolderpath = PurePosixPath(mainfolderpath)
        self._fs = fsspec.filesystem(self._protocol)

    def _load(self,subfolders_dictionary):
        # loading code 
        .
    def _save(self, subfolders_dictionary):
        os.path.normpath(self._mainfolderpath)
        for subfolder_name in subfolders_dictionary.keys():
            subfolder_path=os.path.join(self._mainfolderpath, subfolder_name) 
            
            partitioned_dataset = PartitionedDataset(
            path=subfolder_path,
            dataset=SoundDataset,
            filename_suffix=".WAV",
            )
            
            partitioned_dataset.save(subfolders_dictionary[subfolder_name])
    
    
    partitioned_dataset.save(subfolders_dictionary[subfolder_name])
    
    def _describe(self):
        # describe code
The problem is I'm working on windows but it seems that
PartitionedDataset
assumes that my system separator is
/
instead of
\
. When I print the path in
_save
method in
SoundDataset
class I get
folder\\subfolder/file.WAV
which off course os leading to an error. Is there a way in which I can change this default behaviour?
1
n
Hey @Nicolas Betancourt Cardona, I guess you are the author of the same Stackoverflow thread?
I think it's mostly due to your implementation of doing os.path.join etc, if you use
pathlib.Path
you should be handle these path properly regardless of your OS.
On the other hand - I see you have a `PartitionedDataset`inside your implementation, this feels a bit weird since you are inheriting
PartitionedDataset
at the same time. I would approach this differently, since you mentioned a folder of files is consider as a single "Dataset". 1. Keep PartitionedDataset if it's flexible enough for you, other wise extend it to iterate folders however you need 2. Implement your own AudioDataset, that load a single folder as a data.
n
@Nok Lam Chan Hey, yes I'm the author. Thank you for your answer. After trying different approaches, extending
AbstractDataset
to load and save folders of foldes using dictionaries of dictionaries did what I needed. Thank you for your help
👍🏼 1
n
Perfect! If you don't mind share your solution, it would be great if you can self answer it in that thread
n
I will immediately because it finally worked. Thanks!
🙏🏼 1