Kedro is an open-sourced Python framework for creating maintainable and modular data science code.

Kedro

Using the last version of the `0.18`, I found out that when you save intermediate data, kedro loads the data from file storage again instead of using the dataset already in memory. This wastes precious I/O operation in my pipeline run. Is there a specific reason why it was implemented this way?

As far as I know it’s always been this way. This implements the logic you want <https://docs.kedro.org/en/0.18.13/kedro.io.CachedDataset.html#kedro.io.CachedDataset|https://docs.kedro.org/en/0.18.13/kedro.io.CachedDataset.html#kedro.io.CachedDataset>

<@U04HQRFPM0C> I never like this default:
• Cache validation/idempotent pipeline is hard, some dataset like pandas csv if you load it after save resulted in different dataframe from what you have in memory. Using stronger data type should avoid that.
• In general you save only the data you need and saving itself incurs some I/O anyway(could be mitigate with async etc)
• Cachedataset resolves this partly, but it was regard as an experimental feature and designed for interactive workflow in mind.
This is not ideal in my opinion but hopefully explains a little bit.

Please open an issue/discussion thread if you have opinion on this. I'd love to see more push for performance optimisation