何涛
07/19/2023, 10:05 AMNero Okwa
07/19/2023, 10:28 AMdatajoely
07/19/2023, 10:31 AMJuan Luis
07/19/2023, 11:14 AM何涛
07/19/2023, 11:55 AMmarrrcin
07/19/2023, 1:03 PM何涛
07/19/2023, 1:16 PM1. Why there is a separate docker init / build handling implemented in the plugin?init is to create dockerfile from the current source repo, and build is to run the docker build.
2. What happens if the k8s scheduler decides to run 2 Kedro nodes that pass the data between each other on separate k8s nodes?Inside the dataset implementation, we use
client.get(..., fetch=True)
which means when required data doesn't resident in local instance, a migration will be triggered between these two vineyard server instances.marrrcin
07/19/2023, 1:18 PM何涛
07/19/2023, 1:21 PM3. Have you benchmarked non-CSV datasets too?I will take a try. But I think the most of data sharing overhead comes from the network I/O, rather than the serialization/deserialization. I will take a try ASAP and update the docs.
4. I guess it will boil down to the network communication between the instance of Vineyard and the nodes in the managed ML services so the gain when compared to GCS/ABFS/S3 might not be that high in those scenariosVineyard is more suitable for cases where you can deploy your worker pod/process and the vineyardd server pod/process in the same host (as it uses memory mapping for data sharing). In cases where the vineyardd server and the worker that operates on the data are not co-located and even not in the same cluster, the gain won't be that high, and vineyard doesn't have much advantages compared with other key-value store engines. That's part of why we originally think Kubernetes is a good fit as it orchestrates all jobs in the same cluster.
marrrcin
07/19/2023, 1:23 PM何涛
07/19/2023, 1:26 PMFor 1. there’s https://github.com/kedro-org/kedro-plugins/tree/main/kedro-dockerActually, the current docker stuff is borrowed from this repo (we should leave a reference and acknowledgment but forget to do that). We only add some vineyard-related commands in the initial version and now it should be retired as such customization is removed and only an extra Python requirement is needed.
Looking forward to trying this myself as well as for your benchmarks!Thanks!
Nero Okwa
07/20/2023, 9:58 AMmarrrcin
07/21/2023, 1:32 PM何涛
07/25/2023, 1:38 PMany update on benchmarks? 🙂Hi @marrrcin we have just updated for the mini benchmark to include the
CloudpickleDataset
and there is indeed a huge improvement compared with the CSV dataset, and vineyard is still better thanks to the efficiency of memory: https://v6d.io/tutorials/data-processing/accelerate-data-sharing-in-kedro.html#performance.
The catalog configuration has been uploaded to https://github.com/v6d-io/v6d/blob/main/python/vineyard/contrib/kedro/benchmark/mlops/argo-cloudpickle-benchmark.yml.
Hope the observation above could be helpful to you!marrrcin
07/25/2023, 1:39 PM何涛
07/25/2023, 1:39 PMit would be great to do a user interview with you to understand your experience building a kedro plugin, towards improving this experience, and producing a post for the kedro blog.It sounds really great to have such a chance to talk about our work!
Juan Luis
07/25/2023, 1:40 PM何涛
07/25/2023, 1:41 PM