Good morning,
I'm very much new to Kedro and machine learning in general, so sorry if I say something stupid. I'm trying a new classification project where I want to use a very large dataset for the training. So I want the images to be loaded on the fly and managed as required by tensorflow to avoid saturating the RAM. I will as such create a tf.data pipeline. My dataset is currently made of the following:
• metadata.csv : contains 2 columns : label + img_path (relative path from that file to the corresponsing .png file)
• img/*.png : subfolder containing all the images
In Kedro, I added the dataset in data/01_raw and created a new dataset in the catalog.yml pointing to the csv file using the pandas.CSVDataset loader. In my pipeline, I'm getting the dataset content and start creating the tf.data pipeline. I want to map a function (tf.map) that will load the image from the file using tf.io.read_file. But I have the problem that the path I have for one example is only relative to the metadata.csv file, so to load it I would probably need to make it absolute or something like that. So I'm wondering how I can then retreive from inside the kedro node the path to the metadata.csv file so I can add it to the image relative path. I'm thinking of adding a parameter but that is a bit stupid as it would duplicate the path location between the parameters.yml file and the catalog.yml file... Any better solution ? Or should I restructure my dataset differently ?
Thanks for your opinions,