Hi Everyone, I'm hoping that you'll be able to ans...
# questions
g
Hi Everyone, I'm hoping that you'll be able to answer my question. My current use case is, I have a pipeline that does something like the following: I have massive data sets that contain information on 'trades' • node1: extracts data that depends on a timestamp range given by params in the parameter.yaml config file • node2: given this instance of the data, extract user_ids for who 'traded' during this timestamp range • nodes3,4,5....etc: The subsequent nodes would then depend on which user_ids were found in the datasets. ◦ user_id 123 has it's own node ◦ user_id 789 has it's own node ◦ etc etc This last step is where I'm running into my issues 🫠 I suppose, the best and most succinct way of asking this question is, Can one node be used to create subsequent nodes dynamically? If yes, could anyone explain how to do this? Or potentially point me to some documentation? If no, I would be very grateful if anyone has a work around that I might be able to use. Thanks for any help in advance 🙂
n
Hooks is an option, but we try to avoid dynamic pipeline in general as it’s hard to reason and challenging for reproducibility.
How are these “users node” different from each other?
g
So the logic for each user would be the same ie: • count number of trades • aggregate value of trades • etc • dump output to seperate CSV titled "user_123.csv", as an example.
m
Maybe just use PartitionedDataSet?
g
I'm not too familiar with it but I can read up on that now
n
Agree with marrrcin here, PartitionedDataSet seems like a good fit here.
m
If the logic is the same but you want to save each user to its own dataset then that’s the way to go. Your node3 should process the users one by one and output a dictionary of
user_id
as keys and data as value. Then in catalog, when you use PartitionedDataSet, each key from the returned dict will be saved to a separate file
g
Ok brilliant. I can read up on this now in the docs. One last quick question, if I was to do this, would it still be possible to use parallel runners for each user, if I was to use partitioned datasets?
m
Not directly. You can use
multiprocessing.Pool
or
concurrent.futures
within the node
g
Ahh ok that I have a bit of experience with from other programs.
Thanks again for your quick response! Really really appreciate it! 🙇
👍 1