https://kedro.org/ logo
#questions
Title
# questions
m

Marc Gris

11/23/2023, 7:50 AM
Hi everyone, I hope you’re all doing well ! As often… I’m struggling a little bit with something that, I am sure, is completely trivial 😅 I’m experimenting with a
PartitionedDataset
of
ImageDataset
to handle the classic use-case of having thousands of images to download from different locations and then have those centralized in S3. I first went with the “eager” option and created a node that returned a dict mapping from filename to
Image
I quickly realized that this was not ideal, since, in case of failure for a single download, the node & pipeline would fail, and all the images that were successfully downloaded were lost. I therefore switched to the “lazy” option, and created a node that returned a dict mapping from filename to a
callable
that would return an
Image
. But here again, I am facing an issue: In case of failure, the pipeline crashes with a
DatasetError: Saving 'None' to a 'Dataset' is not allowed
. Granted… All there is a silver lining: What has been downloaded is not lost… But still… Not great. So… What is the “proper” / “kedro-ic” way of doing such tasks ? Many thanks in advance for your help, Regards, Marc
👀 1
n

Nok Lam Chan

11/23/2023, 8:08 AM
Hi Marc, what is the desired behavior without Kedro? Do you want to skip the failing download or do you want it to retry until success?
1
m

Marc Gris

11/23/2023, 8:25 AM
Hi @Nok Lam Chan Thanks for your message. Ideally… Retry x times and in case of “final” failure: skip & log 🙂
n

Nok Lam Chan

11/23/2023, 8:30 AM
The easiest way to do this is roll your own
PartitionedDataset
or
ImageDataSet
m

Marc Gris

11/23/2023, 8:34 AM
OK. I’ll go that way then 🙂 👍🏼 Thx Nok.
🙂 1