Kedro is an open-sourced Python framework for creating maintainable and modular data science code.

Kedro

Hello all!

I have files constantly being pushed to a S3 bucket and I need to preprocess those files regularly and concatenate them to the already preprocessed ones.

I have found IncrementalDatasets in the docs and it should be a step in the right direction as it maintains a checkpoint of files already processed. I can create a node that I run on a schedule to get new files and concatenate them together into a parquet file. However, every time I run it, it will concatenate files into a new parquet file overwriting the previous file. Is it possible to give it a new name each time (e.g. using a time prefix) so that I can have a PartitionedDataset has an input to my data science pipeline that can gather all those files?

I feel like this is somehow what the kedro versioning is about except it creates a folder each time and therefore I would not be able to use a PartitionedDataset.

How would a kedro expert implement such a pipeline? Thanks!