Patrick Deutschmann01/27/2023, 1:19 PM
FlorianGD01/27/2023, 1:20 PM
and add a node like so in
data_local: type: pandas.CSVDataSet filepath: data/03_primary/data.csv data_remote: type: pandas.CSVDataSet filepath: <s3://my-bucket/data/03_primary/data.csv> credentials: creds
def create_pipeline(): return pipeline([..., # your pipeline start node(lambda df: (df, df), inputs="data_before_save", ouputs=["data_local", "data_remote"])])
Deepyaman Datta01/27/2023, 1:49 PM
by running the pipeline
(reads inputs or writes outputs for a single node in parallel, using threads). If you do this a lot, your DAG starts looking a bit ugly, so it's possible to do this using hooks. The idea here would be that you don't care from a logic perspective that it's getting written to local and cloud storage, and that should be handled on the backend for specified nodes without making your DAG look different. Finally, if your requirement was a bit different (e.g. write everything to Azure blob storage bucket and replicate in another bucket), it's usually more efficient to do this outside of Kedro (i.e. run a process to copy the data after the pipeline runs).
Patrick Deutschmann01/27/2023, 1:52 PM
and hooks as well.