Hi I have a problem with creating an API and deployment with Kedro #questions

Hi, I have a problem with creating an API and depl...

Dawid Bugajny

04/03/2023, 12:23 PM

Hi, I have a problem with creating an API and deployment with Kubeflow. https://docs.kedro.org/en/0.18.1/deployment/kubeflow.html says "All node input/output DataSets must be configured in catalog.yml and refer to an external location (e.g. AWS S3); you cannot use the MemoryDataSet in your workflow" Let's assume that a request has to fire two nodes and there is data which is output from the first node and input in the second one. If there would be few requests in one time, there is a chance that one request will use data from another (in case of using only one file to save data between nodes). What would be the best solution for this problem?

Nok Lam Chan

04/03/2023, 12:42 PM

I am no expert to this, may find some inspiration from this community plug-in https://pypi.org/project/kedro-kubeflow/ @marrrcin who build this plug-in and may have better answer 🙂

👍 1

marrrcin

04/03/2023, 12:50 PM

I’m not the only one who build it, but yeah - we can help with that plugin if you encounter any issues 🙂

thankyou 1

marrrcin

04/03/2023, 12:52 PM

Let’s assume that a request has to fire two nodes and there is data which is output from the first node and input in the second one.

If there would be few requests in one time, there is a chance that one request will use data from another (in case of using only one file to save data between nodes).

If I follow that correctly, then you have the following node graph: (node A) -> output data -> (node B) The (node B) will not execute until node A will finish. What do you mean by “If there would be few requests in one time”?

Dawid Bugajny

04/03/2023, 1:02 PM

Yes, that node B wouldn't be executed before end of A. Maybe I will write an example with timestamps: 1. Request: node A -> start: 120000, end: 120005 saving output data -> start: 120007, end: 120010 node B -> start: 120012, end: 120015 2. Request: node A -> start: 120000, end: 120005 saving output data -> start: 120007, end: 120011 node B -> start: 120013, end: 120015 I assume that there is some time between saving data and executing next node. In this case node B from request one would use saved data from second request. EDIT: there were wrong timestamps

marrrcin

04/03/2023, 1:05 PM

But what is the “requests” here? Do you mean the pipeline execution in KFP?

Dawid Bugajny

04/03/2023, 1:06 PM

I mean HTTP requests, which fire kedro pipeline

marrrcin

04/03/2023, 1:07 PM

How does Kubeflow fit in here then?

Dawid Bugajny

04/03/2023, 1:19 PM

It requires using MemoryDataset, what is not supported by Kubeflow. I want the request from FastApi to start the Kubeflow pipeline, but it can't do that because of MemoryDataset (otherwise there is the problem which I mentioned).

marrrcin

04/03/2023, 2:23 PM

1. Do you really need Fast API just to trigger the KFP Pipeline? KFP has it’s own API that can be invoked (e.g. to start the pipeline). 2. You can make the intermediate data stored in a place that will be run-isolated, e.g. basing on the

dsl.RUN_ID_PLACEHOLDER

that can be passed to the Kedro pipeline as an environment variable. Your catalog will look more or less like this:

Copy code

intermediate_data:
    type: <type of ds>
    filepath: s3://<bucket>/${run_id}/output/file.dat

You will need to inject the

run_id

from environment variable, using TemplatedConfigLoader + globals, like this: https://kedro-org.slack.com/archives/C03RKP2LW64/p1676881692806879?thread_ts=1676863743.414139&cid=C03RKP2LW64 (If you don’t want to use

dsl.RUN_ID_PLACEHOLDER

you can as well generate some random uuid on your own). --- Another option would be to use versioned dataset for everything (if you can) - then every run of kedro pipeline will get its unique timestamp and the datasets will be saved into subfolders with that date.

👍 1

Dawid Bugajny

04/03/2023, 2:46 PM

Thank you very much.

Open in Slack

Previous Next