Hey there! hope you are all doing well. Some ope...
# questions
c
Hey there! hope you are all doing well. Some open ended questions for you ๐Ÿค— โ„น๏ธ Context: Let's say that the input/output of a pipeline is a collection of files (e.g. images) and that I want to apply the same function over all these files. I don't want to load/dump all the files at once for memory reason. My understanding is that this does not fit off-the-shelf kedro approach, feel free to correct me if I'm wrong. ๐Ÿ“„ ๐Ÿ“„ ๐Ÿ“„ Best practice for multi files data input/output: What's the best practice to handle such cases in kedro? I have seen workarounds in the past with internal functions loading/dumping blobs and faking input/output for kedro with an empty file. I'm wondering if there is anything you would recommend. ๐Ÿค– ๐Ÿค– ๐Ÿค– kedro distributed approach: Is there any recommended approach if I want to distrribute the above processing over multiple machines. I was considering argo workflows but I see that the kedro doc re argo is deprecated. Does it mean that this is not the recommended approach? if yes, what would be recommended? Thanks a lot in advance!
๐Ÿ‘๐Ÿผ 1
n
Would generator (yield) helps?
๐Ÿ‘ 1
๐Ÿ‘€ 1
c
thanks! looking into it
n
These are all very good questions btw! I will take some time to respond tomorrow!
c
thks!
d
+1 to what @Nok Lam Chan said. If you want to process multiple files with the same logic natively in Kedro,
PartitionedDataset
(with lazy I/O to avoid having them all in memory). If you want to be lazy across nodes, use the generator functions; this should be compatible with
PartitionedDataset
stuff? The Argo deployment doc is outdated in the sense that it needs review, not that it's not recommended. Kedro isn't really opinionated about your choice of workflow orchestrator. In this case, it sounds like you could even consider using Dask.
thankyou 1
c
thanks a lot Deepyaman amd Nok!
cc @Roberto P. Palomares
๐Ÿฅณ 1
m
c
thx!
n
@Cyril Verluise can I ask for clarification what do you mean multi files input/output?
c
hey, sure. This is what kedro calls PartitionedDataset. Basically, a folder with a given number of files which are the processing unit within the node. Sorry for the lack of precision in the vocabulary
n
I see, then I guess this is somewhat answered. Iโ€™ll add my own comments here,
Best practice for multi files data input/output
In general I would say go for PartitionedDataSet. If you are using things like Spark that have native partitioning feature, use that as itโ€™s like to have better performance.
kedro distributed approach
As Deepyaman mentioend, Kedro is not opinionated about the choice of orchestrator. There are no single answer but depends on multiple dimension 1. Is it for performance purpose only? If so maybe you donโ€™t even need orchestrator, some cluster solutions like
Dask
or
Ray
may be enough. 2. Is it I/O intensive workflow like Spark/ data processing job or computation intensive (i.e. training a model over multiple GPU)? 3. Do you have existing infrastructure? If you have a Kubernetes cluster sitting there, then likely you will want a K8s based orchestrator solution, maybe Kubeflow or Argo, I donโ€™t have much experience with any of these two. If the goal is simply performance, then I would try to avoid multiple machine as long as I can. Cluster solution adds more complexity, sometimes simple solution is good enough. When you start to spin job across cluster, debugging could be harder, there are also overhead (could be significant depends on the nature of the job, as data need to be pass between different machines/process)
โž• 2
๐Ÿค— 1
c
Thanks a lot! That's super useful. Agree on the "simpler is better". Really like the dask/ray push! That's great to have all these community insights!
K 1