Hi Everyone I m hoping that you ll be able to answer my ques Kedro #questions

Hi Everyone, I'm hoping that you'll be able to ans...

Gary McCormack

04/05/2023, 10:44 AM

Hi Everyone, I'm hoping that you'll be able to answer my question. My current use case is, I have a pipeline that does something like the following: I have massive data sets that contain information on 'trades' • node1: extracts data that depends on a timestamp range given by params in the parameter.yaml config file • node2: given this instance of the data, extract user_ids for who 'traded' during this timestamp range • nodes3,4,5....etc: The subsequent nodes would then depend on which user_ids were found in the datasets. ◦ user_id 123 has it's own node ◦ user_id 789 has it's own node ◦ etc etc This last step is where I'm running into my issues 🫠 I suppose, the best and most succinct way of asking this question is, Can one node be used to create subsequent nodes dynamically? If yes, could anyone explain how to do this? Or potentially point me to some documentation? If no, I would be very grateful if anyone has a work around that I might be able to use. Thanks for any help in advance 🙂

Nok Lam Chan

04/05/2023, 10:51 AM

Hooks is an option, but we try to avoid dynamic pipeline in general as it’s hard to reason and challenging for reproducibility.

Nok Lam Chan

04/05/2023, 10:51 AM

How are these “users node” different from each other?

Gary McCormack

04/05/2023, 10:54 AM

So the logic for each user would be the same ie: • count number of trades • aggregate value of trades • etc • dump output to seperate CSV titled "user_123.csv", as an example.

marrrcin

04/05/2023, 10:55 AM

Maybe just use PartitionedDataSet?

Gary McCormack

04/05/2023, 10:56 AM

I'm not too familiar with it but I can read up on that now

Nok Lam Chan

04/05/2023, 10:56 AM

Agree with marrrcin here, PartitionedDataSet seems like a good fit here.

marrrcin

04/05/2023, 10:58 AM

If the logic is the same but you want to save each user to its own dataset then that’s the way to go. Your node3 should process the users one by one and output a dictionary of

user_id

as keys and data as value. Then in catalog, when you use PartitionedDataSet, each key from the returned dict will be saved to a separate file

Gary McCormack

04/05/2023, 11:00 AM

Ok brilliant. I can read up on this now in the docs. One last quick question, if I was to do this, would it still be possible to use parallel runners for each user, if I was to use partitioned datasets?

marrrcin

04/05/2023, 11:02 AM

Not directly. You can use

multiprocessing.Pool

concurrent.futures

within the node

Gary McCormack

04/05/2023, 11:02 AM

Ahh ok that I have a bit of experience with from other programs.

Gary McCormack

04/05/2023, 11:02 AM

Thanks again for your quick response! Really really appreciate it! 🙇

👍 1

4 Views

Open in Slack

Previous Next