Hello everyone This is regarding the kubeflow plu...
# plugins-integrations
v
Hello everyone This is regarding the kubeflow plugin. I wanted to just gain some information about how kedro nodes are executed by kubeflow. Does kubeflow run each node in a separate container ?? or separate pods ?? or all of nodes are executed in the same container
@Artur Dobrogowski any idea on this ?
a
afaik the default is separate pods
I added a feature to group nodes during translation process to run aggregate some of them together in the same pod where it makes sense, but I don't remember if I ported it to kubeflow plugin
👍 1
v
@Artur Dobrogowski is the kubeflow plugin not maintained anymore ?? Is it a bad choice to choose kubeflow for deploying kedro pipeline in 2024
a
it is on life support for now, as in not actively maintained but we try our best to update the versions of dependencies and work on it when we have more resources, which for now we don't have
We've kinda put it on lower priority since kubeflow as an ecosystem has been declining in popularity
👍 1
v
Would you recommend some other orchestrator for deploying our kedro pipelines ?
a
it really depends on your ecosystem/constraints
which cloud, on prem or not, what you already have in place
if you are not limited in any way (don't have anything in place yet) then it would depend on the planned scale of your project
still from open-source and self managed kubeflow and airflow with kubernetes seem like the best options for now for kedro
regarding node grouping- I haven't https://github.com/getindata/kedro-kubeflow/issues/262 but its draft is present in other plugins, so I could maybe find some time to do it or guide someone else willing to put work on it
The main problem with this plugin and its maintenance for us is: • we're short on people and resources to actively maintain it right now • we've reclaimed resources for kubeflow cluster and need some time to remake the environment to test the changes properly, the plan is to do it in github actions kubeflow cluster in minikube spinned on the fly If someone else is willing to help keeping it alive it is welcome and I can help in guiding as time allows
v
So we have a self managed kubeflow service running already on aws EKS and we are thinking to use kubeflow plugin to publish our kedro pipelines. I do have couple of questions regarding running kedro pipelines on kubeflow 1. Do i need to have a working knowledge of kubeflow for running kedro pipelines on kubeflow ? 2. Each of our node are generating some output files and these are kind of intermediate files which will be used by some other node in pipeline. When I am running these pipelines locally either i can treat them as memoryDatasets or save them in data/ . All the nodes can access these files easily. What is a suggested strategy to handle this when the pipelines are deployed on kubeflow given that each node will not be running in the same pod ?
a
1. probably not but it helps to see the picture and know how to optimize the workflow 2. the intermediary files need to be saved in buckets/cloud storage or with the grouping feature you can group them together and keep them as memory datasets
👍 1
I can try to see what resources we have and maybe bump the priority in porting the grouping feature
As yeah without it I am aware that it's a huge pain to have everything run in separate pods
as it's at odds with a principle you'd like to keep in kedro to keep nodes simple and atomic
you don't want to dl a docker img and spin a pod just to add 2 numbers together or extract some params
😅 1
v
@Artur Dobrogowski so yes I am also trying to save the outputs on s3 it works out the box by versioning them as well, so whenever the next node fetches the same dataset it gets the latest version. One quick question here -> So in our one of the use cases multiple users will be creating pipeline runs from the kubeflow ui. I wanted to understand how do i save them on s3 so that each run uses their own intermediate files generated and not the ones generated by some other run. Think of it like we have launched parallel runs for a pipeline - r1, r2, r3 , how can we ensure they do not mix the intermediate files.
a
I am really not familiar with kubeflow ui, or it has been so long since I was last using it so I forget what it looks like
but I'd be surprised if there are no options to pass any parameters or environment variables to the runs
v
Yeah you can ignore the ui for now. Buy yeah we can surely pass parameters from kubeflow ui, not sure about env variables although.
a
you could use an env variable to set the user and use this value in paths that are generated in catalog
I'd need to confirm if this is parametrizable in kedro
v
got it will oc.env help me to access these env variables in kedro which will be set by kubeflow somehow
a
one moment
👍 1
yes
yeah it works
👍 1
so let me know what you'll figure out about how to pass params, as I said I'm rusty in this topic and would be happy to know as well
v
So basically I can use some env variable to have different folder structure in s3 to make sure that nodes running for run r1 uses its own files and doesn't interferes with other runs r2, r3 .
a
yes
v
so let me know what you'll figure out about how to pass params, as I said I'm rusty in this topic and would be happy to know as well
on this, someone in the channel mentioned that once we define the params in parameters.yml , it reflects on the kubeflow Ui, and takes the default values defined in the yaml. Users can edit it , i will test it and let you know for sure.
@Artur Dobrogowski One thing that I want may be you can find some time to confirm , I am looking for a unique run_id for each run as this will help me to sort many problems like - 1. The one we just discussed I can use this unique run id in kedro to have different S3 folders for each run and store the intermediate files. 2. This unique run_id will be used to track various metrics for that run and we might dump these metrics like time_taken by each run , whether it was success or not etc etc corressponding to the unique run_id
Even if kubeflow generates a unique run_id, I am not sure if that will be passed as some env variables to our kedro pipeline. Like I am looking to somehow use that unique run_id in hooks and catalogs to achieve many things.
a
you can always generate your own
Here's example how you can do/test it with omegaconf alone (take only generate_uuid function from this code):
Copy code
import uuid
from omegaconf import OmegaConf

# Define a custom resolver to generate a random UUID
def generate_uuid():
    return str(uuid.uuid4())

# Register the resolver with OmegaConf
OmegaConf.register_new_resolver("uuid", generate_uuid)

# Example usage
config = OmegaConf.create({
    "id": "${uuid:}",
})

# Access the config to generate a random UUID
print(config.id)  # Each time config.id is accessed, it generates a new UUID
in kedro's settings you can add this:
Copy code
CONFIG_LOADER_ARGS = {
"custom_resolvers": {
"random_uid": generate_uuid
}
}
and then enjoy in configs
Copy code
${random_uid:}
if you need to generate it once and then re-use the same in the current session then the simplest solution would be to add cache decorator to the generate_uuid function or just do the caching manually
but before doing that I'd make double sure that you can't use the kubeflow id, as it would be better to have them be consistent and common
👍 1
v
Suppose i want to access this random_uuid in hooks, will ${random_uid:} this works in hooks implementation as well.
Like this custom resolver can resolve the ${random_uuid:} used anywhere in the source code of kedro project
a
in hooks no, but you can call the function directly
v
can we use Something like this in hooks.py
Copy code
from kedro.config import OmegaConfigLoader
from kedro.framework.project import settings

# Instantiate an `OmegaConfigLoader` instance with the location of your project configuration.
conf_path = str(project_path / settings.CONF_SOURCE)
conf_loader = OmegaConfigLoader(conf_source=conf_path)
a
eh why would you do that
when you can just
generate_uuid()
this magic is for letting your config loading execute some python code at load time
why would you want to go to config magic when running python code
v
oh but how do we persist this uuid across the kedro session ? So i might be using this uuid across nodes and in many other hooks
a
@cached decorator could work I think
👍 1
I am not 100% sure if the config resolving does not happen in separate process, and in that case it would need some more care to keep it consistent but in general that should be the simple
v
just wanted to understood why does custom config resolvers do not work in hooks
Cannot we override the files where custom_resolver shall do the magic
a
do they not work?
and by hooks do you mean kedro hooks?
I am confused
the omegaconfig resolver syntax is only resolved by omegaconfig in config files (yamls) - at params, data catalog and others. Hooks are python classes, not yaml files - so you should call the python function behind config resolver directly
v
Ok so you actually mean that we cannot use resolvers to put dynamic values in some python files
Got it Thanks.
a
eh why not
in python files you just use a function
v
Understood
a
resolvers are meant to enable usage of said function in CONFIGS not in python files
and they use the same function underneath
so you can
v
I need to first understand when should we use resolvers and why do we really need it. But do not we have some way of persisting variables or objects in kedro session, something which we can generate in a before pipeline runs hook and then it can used in nodes.py and other hooks as well. Caching is definitely one solution that you mentioned.
a
I don't think I can explain it any clearer 😄
🙌 1
resolvers are a must if you want to have dynamic paths for your artifacts in data catalog
sort of
there is another option by using dataset factories, but they rely on namespaces which should also be static, so yeah, resolvers are the only option to be dynamic
v
resolvers are a must if you want to have dynamic paths for your artifacts in data catalog
Yeah that's something I learned recently.
But do not we have some way of persisting variables or objects in kedro session, something which we can generate in a before pipeline runs hook and then it can used in nodes.py and other hooks as well. ?
Like some global config a python dict kind of thing which can be retrieved at any point in the entire kedro sessions
a
Technically you can do it, but that's a much more ugly and convoluted solution in my opinion
😂 1
I mean you can add custom code to edit kedro session and add anything to it or dynamically overwrite read configs... but why do that when you have legal mechanisms to achieve it
and nothing stops you from making the resolver just reach for some set field in your common python config dict
but you need to be aware of the order of events happening in kedro
and reading & resolving configs is pretty early on
you would need to populate that dict at import time or at hook that happens before loading configs
v
So i will summarise now - 1. First of all , I need to look for how can we utilise run_id being generated by kubeflow in the kedro pipeline . But if i want to use this across the entire session like in all the hooks, all nodes , this kubeflow run_id should be set as an env variable. 2. If the first approach is not viable we can generate our own unique ids in the kedro session as we just discussed. correct me if I made some mistake
a
1. env variable is just the easiest way to communicate, there might be other options too - yes 2. yes
v
can you think of some other options as well which i can explore to utilise the kubeflow run_id
a
idk how this kubeflow run_id is handled, but you perhaps might try to use kubeflow api to get the current run id or maybe it's available in some templating syntax to fill command params - I'm just speculating here, this would require some googling for me
v
i see , interesting . 😊
One quick question if kubeflow is able to pass the run_id through run params which eventually means the params stored in parameters.yml as well. So we can definately retrieve these params in nodes but can we also retrieve these params in kedro hooks ?
a
yes, in kedro hooks you can run a hook at step after catalog loaded and read it manually from catalog/params and then retrieve at other hook point
I'm not sure if this would work with data catalog templating at this moment, catalog is a bit special
you need to ask in #C03RKP2LW64 - can you access params or runtime params in
catalog.yaml
?
v
sure i can ask this question there.
@Artur Dobrogowski if you see a kubeflow.yml is generated when we do a
kedro kubeflow init
. Couple of questions here - 1. Is this configuration used only once to publish/upload a pipeline and if we make changes to this config we will have to again run
upload_pipeline
command 2. Does
upload_pipeline
command always publishes a new pipeline on kubeflow or is there a way to simply publish a new version of an existing pipeline on kubeflow. 3. Can we reconfigure these configs for different runs from kubeflow UI once it is published on kubeflow. Because if that is not the case someone will have to always re run the upload_pipeine command.
a
I'll reply tomorrow, I've got to quit for today
👍 1
v
Sure @Artur Dobrogowski, please carry on.
Attaching the kubeflow UI for a published kedro pipeline
a
1. yes - it's a local state for plugin to know how to translate the pipeline to kubeflow 2. there should be a way to overwrite it, not sure 3. afaik once you want a new version of pipeline then you need to re-run the translation process. I can see there is run parameters section in kubeflow ui, I'll add a ticket to investigate usage of that for more flexibility of parametrizations of existing pipelines. Also as a side note, if your main case is for different users to have their own versions then you can use kedro-envs for that instead of fiddling with dynamic configs and resolvers.
@marrrcin can you maybe take a look and confirm my answers?
🤯 1
v
Also as a side note, if your main case is for different users to have their own versions then you can use kedro-envs for that instead of fiddling with dynamic configs and resolvers.
Can you please ellaborate on this ??