Patrick Deutschmann
01/27/2023, 1:19 PMFlorianGD
01/27/2023, 1:20 PMcatalog.yml
data_local:
type: pandas.CSVDataSet
filepath: data/03_primary/data.csv
data_remote:
type: pandas.CSVDataSet
filepath: <s3://my-bucket/data/03_primary/data.csv>
credentials: creds
and add a node like so in pipeline.py
def create_pipeline():
return pipeline([..., # your pipeline start
node(lambda df: (df, df), inputs="data_before_save", ouputs=["data_local", "data_remote"])])
Deepyaman Datta
01/27/2023, 1:49 PMdata_local
and data_remote
by running the pipeline --async
(reads inputs or writes outputs for a single node in parallel, using threads).
If you do this a lot, your DAG starts looking a bit ugly, so it's possible to do this using hooks. The idea here would be that you don't care from a logic perspective that it's getting written to local and cloud storage, and that should be handled on the backend for specified nodes without making your DAG look different.
Finally, if your requirement was a bit different (e.g. write everything to Azure blob storage bucket and replicate in another bucket), it's usually more efficient to do this outside of Kedro (i.e. run a process to copy the data after the pipeline runs).Patrick Deutschmann
01/27/2023, 1:52 PM--async
and hooks as well.