Cyril Verluise
07/19/2023, 5:55 PMNok Lam Chan
07/19/2023, 5:56 PMCyril Verluise
07/19/2023, 5:58 PMNok Lam Chan
07/19/2023, 6:00 PMCyril Verluise
07/19/2023, 6:03 PMDeepyaman Datta
07/20/2023, 3:55 AMPartitionedDataset
(with lazy I/O to avoid having them all in memory). If you want to be lazy across nodes, use the generator functions; this should be compatible with PartitionedDataset
stuff?
The Argo deployment doc is outdated in the sense that it needs review, not that it's not recommended. Kedro isn't really opinionated about your choice of workflow orchestrator. In this case, it sounds like you could even consider using Dask.Cyril Verluise
07/20/2023, 7:23 AMmarrrcin
07/20/2023, 7:56 AMCyril Verluise
07/20/2023, 11:11 AMNok Lam Chan
07/20/2023, 11:45 AMCyril Verluise
07/20/2023, 12:05 PMNok Lam Chan
07/20/2023, 12:47 PMBest practice for multi files data input/outputIn general I would say go for PartitionedDataSet. If you are using things like Spark that have native partitioning feature, use that as itโs like to have better performance.
kedro distributed approachAs Deepyaman mentioend, Kedro is not opinionated about the choice of orchestrator. There are no single answer but depends on multiple dimension 1. Is it for performance purpose only? If so maybe you donโt even need orchestrator, some cluster solutions like
Dask
or Ray
may be enough.
2. Is it I/O intensive workflow like Spark/ data processing job or computation intensive (i.e. training a model over multiple GPU)?
3. Do you have existing infrastructure? If you have a Kubernetes cluster sitting there, then likely you will want a K8s based orchestrator solution, maybe Kubeflow or Argo, I donโt have much experience with any of these two.
If the goal is simply performance, then I would try to avoid multiple machine as long as I can. Cluster solution adds more complexity, sometimes simple solution is good enough. When you start to spin job across cluster, debugging could be harder, there are also overhead (could be significant depends on the nature of the job, as data need to be pass between different machines/process)Cyril Verluise
07/20/2023, 2:33 PM