Hi, I have a problem with creating an API and depl...
# questions
d
Hi, I have a problem with creating an API and deployment with Kubeflow. https://docs.kedro.org/en/0.18.1/deployment/kubeflow.html says "All node input/output DataSets must be configured in catalog.yml and refer to an external location (e.g. AWS S3); you cannot use the MemoryDataSet in your workflow" Let's assume that a request has to fire two nodes and there is data which is output from the first node and input in the second one. If there would be few requests in one time, there is a chance that one request will use data from another (in case of using only one file to save data between nodes). What would be the best solution for this problem?
n
I am no expert to this, may find some inspiration from this community plug-in https://pypi.org/project/kedro-kubeflow/ @marrrcin who build this plug-in and may have better answer 🙂
đź‘Ť 1
m
I’m not the only one who build it, but yeah - we can help with that plugin if you encounter any issues 🙂
thankyou 1
Let’s assume that a request has to fire two nodes and there is data which is output from the first node and input in the second one.
If there would be few requests in one time, there is a chance that one request will use data from another (in case of using only one file to save data between nodes).
If I follow that correctly, then you have the following node graph: (node A) -> output data -> (node B) The (node B) will not execute until node A will finish. What do you mean by “If there would be few requests in one time”?
d
Yes, that node B wouldn't be executed before end of A. Maybe I will write an example with timestamps: 1. Request: node A -> start: 120000, end: 120005 saving output data -> start: 120007, end: 120010 node B -> start: 120012, end: 120015 2. Request: node A -> start: 120000, end: 120005 saving output data -> start: 120007, end: 120011 node B -> start: 120013, end: 120015 I assume that there is some time between saving data and executing next node. In this case node B from request one would use saved data from second request. EDIT: there were wrong timestamps
m
But what is the “requests” here? Do you mean the pipeline execution in KFP?
d
I mean HTTP requests, which fire kedro pipeline
m
How does Kubeflow fit in here then?
d
It requires using MemoryDataset, what is not supported by Kubeflow. I want the request from FastApi to start the Kubeflow pipeline, but it can't do that because of MemoryDataset (otherwise there is the problem which I mentioned).
m
1. Do you really need Fast API just to trigger the KFP Pipeline? KFP has it’s own API that can be invoked (e.g. to start the pipeline). 2. You can make the intermediate data stored in a place that will be run-isolated, e.g. basing on the
dsl.RUN_ID_PLACEHOLDER
that can be passed to the Kedro pipeline as an environment variable. Your catalog will look more or less like this:
Copy code
intermediate_data:
    type: <type of ds>
    filepath: s3://<bucket>/${run_id}/output/file.dat
You will need to inject the
run_id
from environment variable, using TemplatedConfigLoader + globals, like this: https://kedro-org.slack.com/archives/C03RKP2LW64/p1676881692806879?thread_ts=1676863743.414139&amp;cid=C03RKP2LW64 (If you don’t want to use
dsl.RUN_ID_PLACEHOLDER
you can as well generate some random uuid on your own). --- Another option would be to use versioned dataset for everything (if you can) - then every run of kedro pipeline will get its unique timestamp and the datasets will be saved into subfolders with that date.
đź‘Ť 1
d
Thank you very much.