Hi, we recently build a <kedro plugin> for efficie...
# plugins-integrations
u
Hi, we recently build a kedro plugin for efficient intermediate sharing in Kedro pipelines and adapt it to Kubernetes deployment with Argo workflow. We have composed a blog to introduce this plugin to the community: https://medium.com/cncf-vineyard/efficient-data-sharing-in-data-science-pipelines-on-kubernetes-bb42d36c739 Welcome to touch if you have any questions about this post, or if you have the need for faster intermediate data sharing in Kedro pipelines, especially on Kubernetes!
👍🏽 1
👍🏾 1
❤️ 4
🔥 3
n
Thank you, this is amazing!
d
@Deepyaman Datta @marrrcin you’d probably like this 🙂
👌 2
j
amazing @何涛! 👏🏼
🙏 1
u
Thank you! Really happy to see it could help someone in real-world cases. Any feedback will be highly appreciated! If you folks are facing the same challenges, feel free to touch!
m
@何涛 It seems really cool at first glance! Looking forward to try it out! I have a few questions: 1. Why there is a separate docker init / build handling implemented in the plugin? 2. What happens if the k8s scheduler decides to run 2 Kedro nodes that pass the data between each other on separate k8s nodes? 3. Have you benchmarked non-CSV datasets too? Maybe it would be worth to compare it to our implementation of cloudpickle+zstd (I guess ours will be slower, but I wonder by how much 😄 ) https://github.com/getindata/kedro-sagemaker/blob/dbd78fd6c1781cc9e8cf046e14b3ab96faf63719/kedro_sagemaker/datasets.py#L126 4. Any suggestions to run it outside of k8s? It would solve a lot of problems in orchestrators such as Vertex AI / Azure ML, but then I guess it will boil down to the network communication between the instance of Vineyard and the nodes in the managed ML services so the gain when compared to GCS/ABFS/S3 might not be that high in those scenarios 🤔
u
1. Why there is a separate docker init / build handling implemented in the plugin?
init is to create dockerfile from the current source repo, and build is to run the docker build.
2. What happens if the k8s scheduler decides to run 2 Kedro nodes that pass the data between each other on separate k8s nodes?
Inside the dataset implementation, we use
client.get(..., fetch=True)
which means when required data doesn't resident in local instance, a migration will be triggered between these two vineyard server instances.
m
u
3. Have you benchmarked non-CSV datasets too?
I will take a try. But I think the most of data sharing overhead comes from the network I/O, rather than the serialization/deserialization. I will take a try ASAP and update the docs.
4. I guess it will boil down to the network communication between the instance of Vineyard and the nodes in the managed ML services so the gain when compared to GCS/ABFS/S3 might not be that high in those scenarios
Vineyard is more suitable for cases where you can deploy your worker pod/process and the vineyardd server pod/process in the same host (as it uses memory mapping for data sharing). In cases where the vineyardd server and the worker that operates on the data are not co-located and even not in the same cluster, the gain won't be that high, and vineyard doesn't have much advantages compared with other key-value store engines. That's part of why we originally think Kubernetes is a good fit as it orchestrates all jobs in the same cluster.
💛 1
m
Awesome, makes sense! Thanks for your replies 🙂 Looking forward to trying this myself as well as for your benchmarks! 🦜
u
For 1. there’s https://github.com/kedro-org/kedro-plugins/tree/main/kedro-docker
Actually, the current docker stuff is borrowed from this repo (we should leave a reference and acknowledgment but forget to do that). We only add some vineyard-related commands in the initial version and now it should be retired as such customization is removed and only an extra Python requirement is needed.
👍 1
Looking forward to trying this myself as well as for your benchmarks!
Thanks!
n
@何涛 it would be great to do a user interview with you to understand your experience building a kedro plugin, towards improving this experience, and producing a post for the kedro blog.
m
@何涛 any update on benchmarks? 🙂
u
any update on benchmarks? 🙂
Hi @marrrcin we have just updated for the mini benchmark to include the
CloudpickleDataset
and there is indeed a huge improvement compared with the CSV dataset, and vineyard is still better thanks to the efficiency of memory: https://v6d.io/tutorials/data-processing/accelerate-data-sharing-in-kedro.html#performance. The catalog configuration has been uploaded to https://github.com/v6d-io/v6d/blob/main/python/vineyard/contrib/kedro/benchmark/mlops/argo-cloudpickle-benchmark.yml. Hope the observation above could be helpful to you!
🥳 2
m
I’ve just saw it on GitHub, really cool that you’ve tested it!
🥳 1
u
Hi @Nero Okwa,
it would be great to do a user interview with you to understand your experience building a kedro plugin, towards improving this experience, and producing a post for the kedro blog.
It sounds really great to have such a chance to talk about our work!
👍🏽 1
j
ping @Jo Stichbury for the blog part 😊
u
Thanks!