Hi team, How can I conduct parallel IO with kedro...
# questions
b
Hi team, How can I conduct parallel IO with kedro? I have a larger than memory partitioned dataset. I'd like to run each partition through the node in some parallel fashion. Can I utilise ParallelRunner for this? Thank you 😁
I have found this post regarding a custom dask runner https://docs.kedro.org/en/stable/deployment/dask.html but it does not seem like a straightforward thing to drop into an otherwise series pipeline
Sorry to bug you, but any ideas @datajoely @Nok Lam Chan? Chat GPT recommends to use dask within the node function, but idk how this sits with the 'nodes as pure functions' paradigm.
Copy code
def process_single(df: pd.DataFrame) -> pd.DataFrame:
    df = process_thing(df=df)
    
    return df

def process_batch(parts: dd.DataFrame) -> dd.DataFrame: # this is the node
    dask_client = Client()
    batch = parts.map_partitions(process_single, meta=parts._meta)
    
    return batch
d
yeah that’s not the way we’d do it
😁 1
are the instructions on the linked docs not clear?
essentially we tend to push the remote execution engine connection point to the DataSets, Hooks or Runner. The nodes shouldn’t be aware of IO
b
The parallel runner docs or the dask runner docs?
d
dask runner
I would say on the regular parallel runner have you tried returning generators?
lazy saving may be the answer here
b
Ok thanks for the link I will take a look 🙏 To answer your questions, the dask runner docs isn't super clear on how this runner would allow me to map over partitions of a dask dataframe and process them in parallel but idk if this is asking too much of it. And I have no returning generators but I have seen it mentioned in other threads so can take a look as well and report back 👍
d
the returning generators wont run in parallel but they hopefully will allow you to process larger than memory partitions
if you use Dask or Spark you’re delegating to those engines for task splitting and scale out
the last idea is to use Polars which should be even simpler
There isn’t a single solution fit all cases, thus you see a tons of dataframe libraries, pandas, polars, dask, spark etc. It’s a spectrum of problem.
I have a larger than memory partitioned dataset.
It also depends on how much larger it is, can you solve it by processing it in chunks? In general, I’ll only bring in distributed system when you are running out of solution with a single machine.
lazy saving may be the answer here
I agree with this, I am working with an example here but it will take some time
b
Nice thank you for the links, I think I am prematurely optimising, thanks for your help!
K 1