Baden Ashford
07/31/2023, 11:16 AMdef process_single(df: pd.DataFrame) -> pd.DataFrame:
df = process_thing(df=df)
return df
def process_batch(parts: dd.DataFrame) -> dd.DataFrame: # this is the node
dask_client = Client()
batch = parts.map_partitions(process_single, meta=parts._meta)
return batch
datajoely
08/01/2023, 9:45 AMBaden Ashford
08/01/2023, 9:48 AMdatajoely
08/01/2023, 9:49 AMBaden Ashford
08/01/2023, 9:55 AMdatajoely
08/01/2023, 9:58 AMNok Lam Chan
08/01/2023, 10:35 AMI have a larger than memory partitioned dataset.It also depends on how much larger it is, can you solve it by processing it in chunks? In general, I’ll only bring in distributed system when you are running out of solution with a single machine.
lazy saving may be the answer hereI agree with this, I am working with an example here but it will take some time
Baden Ashford
08/01/2023, 12:04 PM