Hi team How can I conduct parallel IO with kedro I have a la Kedro #questions

Hi team, How can I conduct parallel IO with kedro...

Baden Ashford

07/31/2023, 11:16 AM

Hi team, How can I conduct parallel IO with kedro? I have a larger than memory partitioned dataset. I'd like to run each partition through the node in some parallel fashion. Can I utilise ParallelRunner for this? Thank you 😁

Baden Ashford

07/31/2023, 3:59 PM

I have found this post regarding a custom dask runner https://docs.kedro.org/en/stable/deployment/dask.html but it does not seem like a straightforward thing to drop into an otherwise series pipeline

Baden Ashford

08/01/2023, 9:40 AM

Sorry to bug you, but any ideas @datajoely @Nok Lam Chan? Chat GPT recommends to use dask within the node function, but idk how this sits with the 'nodes as pure functions' paradigm.

Copy code

def process_single(df: pd.DataFrame) -> pd.DataFrame:
    df = process_thing(df=df)
    
    return df

def process_batch(parts: dd.DataFrame) -> dd.DataFrame: # this is the node
    dask_client = Client()
    batch = parts.map_partitions(process_single, meta=parts._meta)
    
    return batch

datajoely

08/01/2023, 9:45 AM

yeah that’s not the way we’d do it

😁 1

datajoely

08/01/2023, 9:46 AM

are the instructions on the linked docs not clear?

datajoely

08/01/2023, 9:47 AM

essentially we tend to push the remote execution engine connection point to the DataSets, Hooks or Runner. The nodes shouldn’t be aware of IO

Baden Ashford

08/01/2023, 9:48 AM

The parallel runner docs or the dask runner docs?

datajoely

08/01/2023, 9:49 AM

dask runner

datajoely

08/01/2023, 9:49 AM

I would say on the regular parallel runner have you tried returning generators?

datajoely

08/01/2023, 9:50 AM

https://docs.kedro.org/en/stable/data/kedro_io.html#partitioned-dataset-lazy-saving

👀 1

datajoely

08/01/2023, 9:50 AM

lazy saving may be the answer here

Baden Ashford

08/01/2023, 9:55 AM

Ok thanks for the link I will take a look 🙏 To answer your questions, the dask runner docs isn't super clear on how this runner would allow me to map over partitions of a dask dataframe and process them in parallel but idk if this is asking too much of it. And I have no returning generators but I have seen it mentioned in other threads so can take a look as well and report back 👍

datajoely

08/01/2023, 9:58 AM

the returning generators wont run in parallel but they hopefully will allow you to process larger than memory partitions

datajoely

08/01/2023, 9:58 AM

if you use Dask or Spark you’re delegating to those engines for task splitting and scale out

datajoely

08/01/2023, 9:59 AM

the last idea is to use Polars which should be even simpler

Nok Lam Chan

08/01/2023, 10:35 AM

Some answered thread before: https://kedro-org.slack.com/archives/C03RKP2LW64/p1689857227419389?thread_ts=1689789306.348409&cid=C03RKP2LW64

Nok Lam Chan

08/01/2023, 10:35 AM

There isn’t a single solution fit all cases, thus you see a tons of dataframe libraries, pandas, polars, dask, spark etc. It’s a spectrum of problem.

Nok Lam Chan

08/01/2023, 10:37 AM

I have a larger than memory partitioned dataset.

It also depends on how much larger it is, can you solve it by processing it in chunks? In general, I’ll only bring in distributed system when you are running out of solution with a single machine.

Nok Lam Chan

08/01/2023, 10:38 AM

lazy saving may be the answer here

I agree with this, I am working with an example here but it will take some time

Baden Ashford

08/01/2023, 12:04 PM

Nice thank you for the links, I think I am prematurely optimising, thanks for your help!

K 1

3 Views

Open in Slack

Previous Next