Hey there hope you are all doing well Some open ended questi Kedro #questions

Hey there! hope you are all doing well. Some ope...

Cyril Verluise

07/19/2023, 5:55 PM

Hey there! hope you are all doing well. Some open ended questions for you 🤗 ℹ️ Context: Let's say that the input/output of a pipeline is a collection of files (e.g. images) and that I want to apply the same function over all these files. I don't want to load/dump all the files at once for memory reason. My understanding is that this does not fit off-the-shelf kedro approach, feel free to correct me if I'm wrong. 📄 📄 📄 Best practice for multi files data input/output: What's the best practice to handle such cases in kedro? I have seen workarounds in the past with internal functions loading/dumping blobs and faking input/output for kedro with an empty file. I'm wondering if there is anything you would recommend. 🤖 🤖 🤖 kedro distributed approach: Is there any recommended approach if I want to distrribute the above processing over multiple machines. I was considering argo workflows but I see that the kedro doc re argo is deprecated. Does it mean that this is not the recommended approach? if yes, what would be recommended? Thanks a lot in advance!

👍🏼 1

Nok Lam Chan

07/19/2023, 5:56 PM

Would generator (yield) helps?

Nok Lam Chan

07/19/2023, 5:57 PM

https://docs.kedro.org/en/stable/nodes_and_pipelines/nodes.html#how-to-use-generator-functions-in-a-node

👀 1

Nok Lam Chan

07/19/2023, 5:57 PM

Or lazy saving with partitioneddataset? https://docs.kedro.org/en/stable/data/kedro_io.html#partitioned-dataset-lazy-saving

👍 1

👀 1

Cyril Verluise

07/19/2023, 5:58 PM

thanks! looking into it

Nok Lam Chan

07/19/2023, 6:00 PM

These are all very good questions btw! I will take some time to respond tomorrow!

Cyril Verluise

07/19/2023, 6:03 PM

thks!

Deepyaman Datta

07/20/2023, 3:55 AM

+1 to what @Nok Lam Chan said. If you want to process multiple files with the same logic natively in Kedro,

PartitionedDataset

(with lazy I/O to avoid having them all in memory). If you want to be lazy across nodes, use the generator functions; this should be compatible with

PartitionedDataset

stuff? The Argo deployment doc is outdated in the sense that it needs review, not that it's not recommended. Kedro isn't really opinionated about your choice of workflow orchestrator. In this case, it sounds like you could even consider using Dask.

thankyou 1

Cyril Verluise

07/20/2023, 7:23 AM

thanks a lot Deepyaman amd Nok!

Cyril Verluise

07/20/2023, 7:23 AM

cc @Roberto P. Palomares

🥳 1

marrrcin

07/20/2023, 7:56 AM

Cyril Verluise

07/20/2023, 11:11 AM

thx!

Nok Lam Chan

07/20/2023, 11:45 AM

@Cyril Verluise can I ask for clarification what do you mean multi files input/output?

Cyril Verluise

07/20/2023, 12:05 PM

hey, sure. This is what kedro calls PartitionedDataset. Basically, a folder with a given number of files which are the processing unit within the node. Sorry for the lack of precision in the vocabulary

Nok Lam Chan

07/20/2023, 12:47 PM

I see, then I guess this is somewhat answered. I’ll add my own comments here,

Best practice for multi files data input/output

In general I would say go for PartitionedDataSet. If you are using things like Spark that have native partitioning feature, use that as it’s like to have better performance.

kedro distributed approach

As Deepyaman mentioend, Kedro is not opinionated about the choice of orchestrator. There are no single answer but depends on multiple dimension 1. Is it for performance purpose only? If so maybe you don’t even need orchestrator, some cluster solutions like

Dask

Ray

may be enough. 2. Is it I/O intensive workflow like Spark/ data processing job or computation intensive (i.e. training a model over multiple GPU)? 3. Do you have existing infrastructure? If you have a Kubernetes cluster sitting there, then likely you will want a K8s based orchestrator solution, maybe Kubeflow or Argo, I don’t have much experience with any of these two. If the goal is simply performance, then I would try to avoid multiple machine as long as I can. Cluster solution adds more complexity, sometimes simple solution is good enough. When you start to spin job across cluster, debugging could be harder, there are also overhead (could be significant depends on the nature of the job, as data need to be pass between different machines/process)

➕ 2

🤗 1

Cyril Verluise

07/20/2023, 2:33 PM

Thanks a lot! That's super useful. Agree on the "simpler is better". Really like the dask/ray push! That's great to have all these community insights!

K 1

6 Views

Open in Slack

Previous Next