Hey everyone! I’m new to Kedro, and I first want ...
# questions
p
Hey everyone! I’m new to Kedro, and I first want to thank all the contributors. You’ve genuinely built a fantastic tool! Is it possible to save outputs to multiple data sets? For instance, I’d like to write my feature data both to the local file system and to, say, an Azure blob storage. Thanks 😊
❤️ 3
f
Hello, I think the easiest is something along those lines:
catalog.yml
Copy code
data_local:
    type: pandas.CSVDataSet
    filepath: data/03_primary/data.csv

data_remote:
    type: pandas.CSVDataSet
    filepath: <s3://my-bucket/data/03_primary/data.csv>
    credentials: creds
and add a node like so in
pipeline.py
Copy code
def create_pipeline():
   return pipeline([..., # your pipeline start
        node(lambda df: (df, df), inputs="data_before_save", ouputs=["data_local", "data_remote"])])
👍🏾 1
👍 3
d
What @FlorianGD suggests is good, because you can parallelize the write to
data_local
and
data_remote
by running the pipeline
--async
(reads inputs or writes outputs for a single node in parallel, using threads). If you do this a lot, your DAG starts looking a bit ugly, so it's possible to do this using hooks. The idea here would be that you don't care from a logic perspective that it's getting written to local and cloud storage, and that should be handled on the backend for specified nodes without making your DAG look different. Finally, if your requirement was a bit different (e.g. write everything to Azure blob storage bucket and replicate in another bucket), it's usually more efficient to do this outside of Kedro (i.e. run a process to copy the data after the pipeline runs).
p
@FlorianGD Your solution works like a charm, thanks 🤗 @Deepyaman Datta Great, I’ll try out
--async
and hooks as well.