Dawid Bugajny
04/03/2023, 12:23 PMNok Lam Chan
04/03/2023, 12:42 PMmarrrcin
04/03/2023, 12:50 PMLet’s assume that a request has to fire two nodes and there is data which is output from the first node and input in the second one.
If there would be few requests in one time, there is a chance that one request will use data from another (in case of using only one file to save data between nodes).If I follow that correctly, then you have the following node graph: (node A) -> output data -> (node B) The (node B) will not execute until node A will finish. What do you mean by “If there would be few requests in one time”?
Dawid Bugajny
04/03/2023, 1:02 PMmarrrcin
04/03/2023, 1:05 PMDawid Bugajny
04/03/2023, 1:06 PMmarrrcin
04/03/2023, 1:07 PMDawid Bugajny
04/03/2023, 1:19 PMmarrrcin
04/03/2023, 2:23 PMdsl.RUN_ID_PLACEHOLDER
that can be passed to the Kedro pipeline as an environment variable. Your catalog will look more or less like this:
intermediate_data:
type: <type of ds>
filepath: s3://<bucket>/${run_id}/output/file.dat
You will need to inject the run_id
from environment variable, using TemplatedConfigLoader + globals, like this:
https://kedro-org.slack.com/archives/C03RKP2LW64/p1676881692806879?thread_ts=1676863743.414139&cid=C03RKP2LW64
(If you don’t want to use dsl.RUN_ID_PLACEHOLDER
you can as well generate some random uuid on your own).
---
Another option would be to use versioned dataset for everything (if you can) - then every run of kedro pipeline will get its unique timestamp and the datasets will be saved into subfolders with that date.Dawid Bugajny
04/03/2023, 2:46 PM