Hello,
I'm evaluating Kedro for my company, it is currently one of the closest to what we need. But I have a question about something very common in our workflow and i'm not sure how we would implement it in Kedro.
Some of our pipelines start with someting like this
- Download a dataset (between 20 and 100GB)
- Create a local index in a temporary folder of the data (with lucene for example) using bash command
- Use the index to extract a dataset using bash command
- Remove the temporary local index
- Use the dataset in subsequent steps (after that step kedro seems to handle our needs)
It is similar to that kind of thing in some way
https://docs.dagster.io/tutorial/assets/non-argument-deps
To summarize
- Doing operations outside the graph by using local filesystem
- Another thing, instead of loading the data in memory and let Kedro serialize it to store it on S3 for example, being able to give it a local path where data is stored, and let kedro pick the local path to upload it to S3
Thanks!