Kedro is an open-sourced Python framework for creating maintainable and modular data science code.

Kedro

Hello,
I'm evaluating Kedro for my company, it is currently one of the closest to what we need. But I have a question about something very common in our workflow and i'm not sure how we would implement it in Kedro.
Some of our pipelines start with someting like this
- Download a dataset (between 20 and 100GB)
- Create a local index in a temporary folder of the data (with lucene for example) using bash command
- Use the index to extract a dataset using bash command
- Remove the temporary local index
- Use the dataset in subsequent steps (after that step kedro seems to handle our needs)

It is similar to that kind of thing in some way <https://docs.dagster.io/tutorial/assets/non-argument-deps>

To summarize
- Doing operations outside the graph by using local filesystem
- Another thing, instead of loading the data in memory and let Kedro serialize it to store it on S3 for example, being able to give it a local path where data is stored, and let kedro pick the local path to upload it to S3

Thanks!