Hi all I have a question regarding how nodes pipelines read Kedro #questions

Hi, all. I have a question regarding how nodes/pip...

Sen

10/16/2024, 5:38 AM

Hi, all. I have a question regarding how nodes/pipelines read dataset as input datasets. Take this catalog configuration in the following link as example, I assume the kedro pipeline read data from CSV file stored in Amazon S3 when you specify as inputs=["cars"] in node configuration. I was wondering if there are multiple different nodes that take "cars" as input datasets, does kedro pipeline use those datasets from memory, or does it read from Amazon S3 every time they need the datasets? https://docs.kedro.org/en/stable/data/data_catalog_yaml_examples.html#load-multiple-datasets-with-similar-configuration-using-yaml-anchors And if it does read the same datasets from certain data source every time it runs the various nodes, is it possible to store the dataset in memory after the first reading from whatever the data source is (Amazon S3 CSV file in this case) and reuse them from memory so that you don't need to read from the data source multiple times and possibly leading to shorter processing time?

👀 1

Ravi Kumar Pilla

10/16/2024, 1:31 PM

Hi Sen, I am not sure if the datasets are cached after the initial load. Let me get back to you on this. Thank you

Ravi Kumar Pilla

10/16/2024, 2:00 PM

I could not find any docs talking about caching datasets for reuse but I think we do reuse same datasets within a single pipeline run. Also we have - https://docs.kedro.org/en/stable/api/kedro.io.CachedDataset.html . @Ankita Katiyar have you come across any doc which mentions the dataset reuse within a pipeline run ?

❤️ 1

Ankita Katiyar

10/16/2024, 3:58 PM

I believe they are read and saved to s3 each time, to avoid this

CachedDataset

is the right call like you said ^

👍 1

❤️ 1

Sen

10/17/2024, 1:41 AM

@Ravi Kumar Pilla @Ankita Katiyar Thank both of you for the reply. I’ll play around with it!

👍 1

Open in Slack

Previous Next