何涛07/19/2023, 10:05 AM
Nero Okwa07/19/2023, 10:28 AM
datajoely07/19/2023, 10:31 AM
Juan Luis07/19/2023, 11:14 AM
何涛07/19/2023, 11:55 AM
marrrcin07/19/2023, 1:03 PM
何涛07/19/2023, 1:16 PM
1. Why there is a separate docker init / build handling implemented in the plugin?init is to create dockerfile from the current source repo, and build is to run the docker build.
2. What happens if the k8s scheduler decides to run 2 Kedro nodes that pass the data between each other on separate k8s nodes?Inside the dataset implementation, we use
which means when required data doesn't resident in local instance, a migration will be triggered between these two vineyard server instances.
marrrcin07/19/2023, 1:18 PM
何涛07/19/2023, 1:21 PM
3. Have you benchmarked non-CSV datasets too?I will take a try. But I think the most of data sharing overhead comes from the network I/O, rather than the serialization/deserialization. I will take a try ASAP and update the docs.
4. I guess it will boil down to the network communication between the instance of Vineyard and the nodes in the managed ML services so the gain when compared to GCS/ABFS/S3 might not be that high in those scenariosVineyard is more suitable for cases where you can deploy your worker pod/process and the vineyardd server pod/process in the same host (as it uses memory mapping for data sharing). In cases where the vineyardd server and the worker that operates on the data are not co-located and even not in the same cluster, the gain won't be that high, and vineyard doesn't have much advantages compared with other key-value store engines. That's part of why we originally think Kubernetes is a good fit as it orchestrates all jobs in the same cluster.
marrrcin07/19/2023, 1:23 PM
何涛07/19/2023, 1:26 PM
Looking forward to trying this myself as well as for your benchmarks!Thanks!
Nero Okwa07/20/2023, 9:58 AM
marrrcin07/21/2023, 1:32 PM
何涛07/25/2023, 1:38 PM
any update on benchmarks? 🙂Hi @marrrcin we have just updated for the mini benchmark to include the
and there is indeed a huge improvement compared with the CSV dataset, and vineyard is still better thanks to the efficiency of memory: https://v6d.io/tutorials/data-processing/accelerate-data-sharing-in-kedro.html#performance. The catalog configuration has been uploaded to https://github.com/v6d-io/v6d/blob/main/python/vineyard/contrib/kedro/benchmark/mlops/argo-cloudpickle-benchmark.yml. Hope the observation above could be helpful to you!
marrrcin07/25/2023, 1:39 PM
何涛07/25/2023, 1:39 PM
it would be great to do a user interview with you to understand your experience building a kedro plugin, towards improving this experience, and producing a post for the kedro blog.It sounds really great to have such a chance to talk about our work!
Juan Luis07/25/2023, 1:40 PM
何涛07/25/2023, 1:41 PM