Hello everyone This is regarding the kubeflow plugin I wante Kedro #plugins-integrations

Hello everyone This is regarding the kubeflow plu...

Vishal Pandey

09/18/2024, 4:59 PM

Hello everyone This is regarding the kubeflow plugin. I wanted to just gain some information about how kedro nodes are executed by kubeflow. Does kubeflow run each node in a separate container ?? or separate pods ?? or all of nodes are executed in the same container

Vishal Pandey

09/19/2024, 11:18 AM

@Artur Dobrogowski any idea on this ?

Artur Dobrogowski

09/19/2024, 11:23 AM

afaik the default is separate pods

Artur Dobrogowski

09/19/2024, 11:24 AM

I added a feature to group nodes during translation process to run aggregate some of them together in the same pod where it makes sense, but I don't remember if I ported it to kubeflow plugin

👍 1

Vishal Pandey

09/19/2024, 11:25 AM

@Artur Dobrogowski is the kubeflow plugin not maintained anymore ?? Is it a bad choice to choose kubeflow for deploying kedro pipeline in 2024

Artur Dobrogowski

09/19/2024, 11:25 AM

it is on life support for now, as in not actively maintained but we try our best to update the versions of dependencies and work on it when we have more resources, which for now we don't have

Artur Dobrogowski

09/19/2024, 11:26 AM

We've kinda put it on lower priority since kubeflow as an ecosystem has been declining in popularity

👍 1

Vishal Pandey

09/19/2024, 11:26 AM

Would you recommend some other orchestrator for deploying our kedro pipelines ?

Artur Dobrogowski

09/19/2024, 11:27 AM

it really depends on your ecosystem/constraints

Artur Dobrogowski

09/19/2024, 11:27 AM

which cloud, on prem or not, what you already have in place

Artur Dobrogowski

09/19/2024, 11:28 AM

if you are not limited in any way (don't have anything in place yet) then it would depend on the planned scale of your project

Artur Dobrogowski

09/19/2024, 11:28 AM

still from open-source and self managed kubeflow and airflow with kubernetes seem like the best options for now for kedro

Artur Dobrogowski

09/19/2024, 11:29 AM

regarding node grouping- I haven't https://github.com/getindata/kedro-kubeflow/issues/262 but its draft is present in other plugins, so I could maybe find some time to do it or guide someone else willing to put work on it

Artur Dobrogowski

09/19/2024, 11:31 AM

The main problem with this plugin and its maintenance for us is: • we're short on people and resources to actively maintain it right now • we've reclaimed resources for kubeflow cluster and need some time to remake the environment to test the changes properly, the plan is to do it in github actions kubeflow cluster in minikube spinned on the fly If someone else is willing to help keeping it alive it is welcome and I can help in guiding as time allows

Vishal Pandey

09/19/2024, 11:34 AM

So we have a self managed kubeflow service running already on aws EKS and we are thinking to use kubeflow plugin to publish our kedro pipelines. I do have couple of questions regarding running kedro pipelines on kubeflow 1. Do i need to have a working knowledge of kubeflow for running kedro pipelines on kubeflow ? 2. Each of our node are generating some output files and these are kind of intermediate files which will be used by some other node in pipeline. When I am running these pipelines locally either i can treat them as memoryDatasets or save them in data/ . All the nodes can access these files easily. What is a suggested strategy to handle this when the pipelines are deployed on kubeflow given that each node will not be running in the same pod ?

Artur Dobrogowski

09/19/2024, 11:36 AM

1. probably not but it helps to see the picture and know how to optimize the workflow 2. the intermediary files need to be saved in buckets/cloud storage or with the grouping feature you can group them together and keep them as memory datasets

👍 1

Artur Dobrogowski

09/19/2024, 11:37 AM

I can try to see what resources we have and maybe bump the priority in porting the grouping feature

Artur Dobrogowski

09/19/2024, 11:39 AM

As yeah without it I am aware that it's a huge pain to have everything run in separate pods

Artur Dobrogowski

09/19/2024, 11:39 AM

as it's at odds with a principle you'd like to keep in kedro to keep nodes simple and atomic

Artur Dobrogowski

09/19/2024, 11:40 AM

you don't want to dl a docker img and spin a pod just to add 2 numbers together or extract some params

😅 1

Vishal Pandey

09/19/2024, 11:41 AM

@Artur Dobrogowski so yes I am also trying to save the outputs on s3 it works out the box by versioning them as well, so whenever the next node fetches the same dataset it gets the latest version. One quick question here -> So in our one of the use cases multiple users will be creating pipeline runs from the kubeflow ui. I wanted to understand how do i save them on s3 so that each run uses their own intermediate files generated and not the ones generated by some other run. Think of it like we have launched parallel runs for a pipeline - r1, r2, r3 , how can we ensure they do not mix the intermediate files.

Artur Dobrogowski

09/19/2024, 11:44 AM

I am really not familiar with kubeflow ui, or it has been so long since I was last using it so I forget what it looks like

Artur Dobrogowski

09/19/2024, 11:44 AM

but I'd be surprised if there are no options to pass any parameters or environment variables to the runs

Vishal Pandey

09/19/2024, 11:45 AM

Yeah you can ignore the ui for now. Buy yeah we can surely pass parameters from kubeflow ui, not sure about env variables although.

Artur Dobrogowski

09/19/2024, 11:45 AM

you could use an env variable to set the user and use this value in paths that are generated in catalog

Artur Dobrogowski

09/19/2024, 11:46 AM

I'd need to confirm if this is parametrizable in kedro

Vishal Pandey

09/19/2024, 11:46 AM

got it will oc.env help me to access these env variables in kedro which will be set by kubeflow somehow

Artur Dobrogowski

09/19/2024, 11:46 AM

one moment

👍 1

Artur Dobrogowski

09/19/2024, 11:47 AM

yes

Artur Dobrogowski

09/19/2024, 11:53 AM

yeah it works

👍 1

Artur Dobrogowski

09/19/2024, 11:54 AM

so let me know what you'll figure out about how to pass params, as I said I'm rusty in this topic and would be happy to know as well

Vishal Pandey

09/19/2024, 11:56 AM

So basically I can use some env variable to have different folder structure in s3 to make sure that nodes running for run r1 uses its own files and doesn't interferes with other runs r2, r3 .

Artur Dobrogowski

09/19/2024, 11:57 AM

yes

Vishal Pandey

09/19/2024, 11:58 AM

so let me know what you'll figure out about how to pass params, as I said I'm rusty in this topic and would be happy to know as well

on this, someone in the channel mentioned that once we define the params in parameters.yml , it reflects on the kubeflow Ui, and takes the default values defined in the yaml. Users can edit it , i will test it and let you know for sure.

Vishal Pandey

09/19/2024, 12:03 PM

@Artur Dobrogowski One thing that I want may be you can find some time to confirm , I am looking for a unique run_id for each run as this will help me to sort many problems like - 1. The one we just discussed I can use this unique run id in kedro to have different S3 folders for each run and store the intermediate files. 2. This unique run_id will be used to track various metrics for that run and we might dump these metrics like time_taken by each run , whether it was success or not etc etc corressponding to the unique run_id

Vishal Pandey

09/19/2024, 12:05 PM

Even if kubeflow generates a unique run_id, I am not sure if that will be passed as some env variables to our kedro pipeline. Like I am looking to somehow use that unique run_id in hooks and catalogs to achieve many things.

Artur Dobrogowski

09/19/2024, 12:10 PM

you can always generate your own

Artur Dobrogowski

09/19/2024, 12:14 PM

Here's example how you can do/test it with omegaconf alone (take only generate_uuid function from this code):

Copy code

import uuid
from omegaconf import OmegaConf

# Define a custom resolver to generate a random UUID
def generate_uuid():
    return str(uuid.uuid4())

# Register the resolver with OmegaConf
OmegaConf.register_new_resolver("uuid", generate_uuid)

# Example usage
config = OmegaConf.create({
    "id": "${uuid:}",
})

# Access the config to generate a random UUID
print(config.id)  # Each time config.id is accessed, it generates a new UUID

in kedro's settings you can add this:

Copy code

CONFIG_LOADER_ARGS = {
"custom_resolvers": {
"random_uid": generate_uuid
}
}

and then enjoy in configs

Copy code

${random_uid:}

Artur Dobrogowski

09/19/2024, 12:17 PM

if you need to generate it once and then re-use the same in the current session then the simplest solution would be to add cache decorator to the generate_uuid function or just do the caching manually

Artur Dobrogowski

09/19/2024, 12:18 PM

but before doing that I'd make double sure that you can't use the kubeflow id, as it would be better to have them be consistent and common

👍 1

Vishal Pandey

09/19/2024, 12:23 PM

Suppose i want to access this random_uuid in hooks, will ${random_uid:} this works in hooks implementation as well.

Vishal Pandey

09/19/2024, 12:23 PM

Like this custom resolver can resolve the ${random_uuid:} used anywhere in the source code of kedro project

Artur Dobrogowski

09/19/2024, 12:24 PM

in hooks no, but you can call the function directly

Vishal Pandey

09/19/2024, 12:27 PM

can we use Something like this in hooks.py

Copy code

from kedro.config import OmegaConfigLoader
from kedro.framework.project import settings

# Instantiate an `OmegaConfigLoader` instance with the location of your project configuration.
conf_path = str(project_path / settings.CONF_SOURCE)
conf_loader = OmegaConfigLoader(conf_source=conf_path)

Artur Dobrogowski

09/19/2024, 12:27 PM

eh why would you do that

Artur Dobrogowski

09/19/2024, 12:27 PM

when you can just

generate_uuid()

Artur Dobrogowski

09/19/2024, 12:28 PM

this magic is for letting your config loading execute some python code at load time

Artur Dobrogowski

09/19/2024, 12:28 PM

why would you want to go to config magic when running python code

Vishal Pandey

09/19/2024, 12:28 PM

oh but how do we persist this uuid across the kedro session ? So i might be using this uuid across nodes and in many other hooks

Artur Dobrogowski

09/19/2024, 12:29 PM

@cached decorator could work I think

👍 1

Artur Dobrogowski

09/19/2024, 12:29 PM

I am not 100% sure if the config resolving does not happen in separate process, and in that case it would need some more care to keep it consistent but in general that should be the simple

Vishal Pandey

09/19/2024, 12:30 PM

just wanted to understood why does custom config resolvers do not work in hooks

Vishal Pandey

09/19/2024, 12:31 PM

Cannot we override the files where custom_resolver shall do the magic

Artur Dobrogowski

09/19/2024, 12:33 PM

do they not work?

Artur Dobrogowski

09/19/2024, 12:33 PM

and by hooks do you mean kedro hooks?

Artur Dobrogowski

09/19/2024, 12:35 PM

I am confused

Artur Dobrogowski

09/19/2024, 12:36 PM

the omegaconfig resolver syntax is only resolved by omegaconfig in config files (yamls) - at params, data catalog and others. Hooks are python classes, not yaml files - so you should call the python function behind config resolver directly

Vishal Pandey

09/19/2024, 12:43 PM

Ok so you actually mean that we cannot use resolvers to put dynamic values in some python files

Vishal Pandey

09/19/2024, 12:43 PM

Got it Thanks.

Artur Dobrogowski

09/19/2024, 12:44 PM

eh why not

Artur Dobrogowski

09/19/2024, 12:45 PM

in python files you just use a function

Vishal Pandey

09/19/2024, 12:45 PM

Understood

Artur Dobrogowski

09/19/2024, 12:45 PM

resolvers are meant to enable usage of said function in CONFIGS not in python files

Artur Dobrogowski

09/19/2024, 12:45 PM

and they use the same function underneath

Artur Dobrogowski

09/19/2024, 12:45 PM

so you can

Vishal Pandey

09/19/2024, 12:47 PM

I need to first understand when should we use resolvers and why do we really need it. But do not we have some way of persisting variables or objects in kedro session, something which we can generate in a before pipeline runs hook and then it can used in nodes.py and other hooks as well. Caching is definitely one solution that you mentioned.

Artur Dobrogowski

09/19/2024, 12:47 PM

I don't think I can explain it any clearer 😄

🙌 1

Artur Dobrogowski

09/19/2024, 12:48 PM

resolvers are a must if you want to have dynamic paths for your artifacts in data catalog

Artur Dobrogowski

09/19/2024, 12:48 PM

sort of

Artur Dobrogowski

09/19/2024, 12:49 PM

there is another option by using dataset factories, but they rely on namespaces which should also be static, so yeah, resolvers are the only option to be dynamic

Vishal Pandey

09/19/2024, 12:49 PM

resolvers are a must if you want to have dynamic paths for your artifacts in data catalog

Yeah that's something I learned recently.

Vishal Pandey

09/19/2024, 12:49 PM

But do not we have some way of persisting variables or objects in kedro session, something which we can generate in a before pipeline runs hook and then it can used in nodes.py and other hooks as well. ?

Vishal Pandey

09/19/2024, 12:50 PM

Like some global config a python dict kind of thing which can be retrieved at any point in the entire kedro sessions

Artur Dobrogowski

09/19/2024, 12:50 PM

Technically you can do it, but that's a much more ugly and convoluted solution in my opinion

😂 1

Artur Dobrogowski

09/19/2024, 12:51 PM

I mean you can add custom code to edit kedro session and add anything to it or dynamically overwrite read configs... but why do that when you have legal mechanisms to achieve it

Artur Dobrogowski

09/19/2024, 12:52 PM

and nothing stops you from making the resolver just reach for some set field in your common python config dict

Artur Dobrogowski

09/19/2024, 12:52 PM

but you need to be aware of the order of events happening in kedro

Artur Dobrogowski

09/19/2024, 12:52 PM

and reading & resolving configs is pretty early on

Artur Dobrogowski

09/19/2024, 12:53 PM

you would need to populate that dict at import time or at hook that happens before loading configs

Vishal Pandey

09/19/2024, 12:54 PM

So i will summarise now - 1. First of all , I need to look for how can we utilise run_id being generated by kubeflow in the kedro pipeline . But if i want to use this across the entire session like in all the hooks, all nodes , this kubeflow run_id should be set as an env variable. 2. If the first approach is not viable we can generate our own unique ids in the kedro session as we just discussed. correct me if I made some mistake

Artur Dobrogowski

09/19/2024, 12:55 PM

1. env variable is just the easiest way to communicate, there might be other options too - yes 2. yes

Vishal Pandey

09/19/2024, 12:56 PM

can you think of some other options as well which i can explore to utilise the kubeflow run_id

Artur Dobrogowski

09/19/2024, 12:58 PM

idk how this kubeflow run_id is handled, but you perhaps might try to use kubeflow api to get the current run id or maybe it's available in some templating syntax to fill command params - I'm just speculating here, this would require some googling for me

Vishal Pandey

09/19/2024, 12:59 PM

i see , interesting . 😊

Vishal Pandey

09/19/2024, 1:04 PM

One quick question if kubeflow is able to pass the run_id through run params which eventually means the params stored in parameters.yml as well. So we can definately retrieve these params in nodes but can we also retrieve these params in kedro hooks ?

Artur Dobrogowski

09/19/2024, 1:08 PM

yes, in kedro hooks you can run a hook at step after catalog loaded and read it manually from catalog/params and then retrieve at other hook point

Artur Dobrogowski

09/19/2024, 1:09 PM

I'm not sure if this would work with data catalog templating at this moment, catalog is a bit special

Artur Dobrogowski

09/19/2024, 1:09 PM

you need to ask in #C03RKP2LW64 - can you access params or runtime params in

catalog.yaml

Vishal Pandey

09/19/2024, 1:25 PM

sure i can ask this question there.

Vishal Pandey

09/19/2024, 3:00 PM

@Artur Dobrogowski if you see a kubeflow.yml is generated when we do a

kedro kubeflow init

. Couple of questions here - 1. Is this configuration used only once to publish/upload a pipeline and if we make changes to this config we will have to again run

upload_pipeline

command 2. Does

upload_pipeline

command always publishes a new pipeline on kubeflow or is there a way to simply publish a new version of an existing pipeline on kubeflow. 3. Can we reconfigure these configs for different runs from kubeflow UI once it is published on kubeflow. Because if that is not the case someone will have to always re run the upload_pipeine command.

Artur Dobrogowski

09/19/2024, 3:01 PM

I'll reply tomorrow, I've got to quit for today

👍 1

Vishal Pandey

09/19/2024, 3:01 PM

Sure @Artur Dobrogowski, please carry on.

Vishal Pandey

09/19/2024, 3:03 PM

Attaching the kubeflow UI for a published kedro pipeline

Artur Dobrogowski

09/20/2024, 10:26 AM

1. yes - it's a local state for plugin to know how to translate the pipeline to kubeflow 2. there should be a way to overwrite it, not sure 3. afaik once you want a new version of pipeline then you need to re-run the translation process. I can see there is run parameters section in kubeflow ui, I'll add a ticket to investigate usage of that for more flexibility of parametrizations of existing pipelines. Also as a side note, if your main case is for different users to have their own versions then you can use kedro-envs for that instead of fiddling with dynamic configs and resolvers.

Artur Dobrogowski

09/20/2024, 10:29 AM

@marrrcin can you maybe take a look and confirm my answers?

🤯 1

Vishal Pandey

09/20/2024, 10:32 AM

Also as a side note, if your main case is for different users to have their own versions then you can use kedro-envs for that instead of fiddling with dynamic configs and resolvers.

Can you please ellaborate on this ??

14 Views

Open in Slack

Previous Next