I have a question about using Kedro in a non-ML se...
# questions
q
I have a question about using Kedro in a non-ML setting. Specifically, I am trying to use Kedro in a data analysis (no learning) and statistical modeling/simulation pipeline. A simplified view of the use case is something like: 1. Set input parameters 2. Run 10,000 instances of a monte-carlo simulation 3. Calculate statistics on that run 4. Save data So far so good: Kedro defines these operations really nicely and keeps things tidy, along with visualizations and saving data for experiments. (keep in mind that in reality, steps 2 and 3 are probably 6-8 nodes long split across two or three pipelines in Kedro). Now, the problem is that I need to explore a large space of input parameters. Like sweep an input parameter in 100 steps of log space from 1e-6 to 1e-4. So the (simplified) workflow now becomes: 1. For input param1 in logspace(1e-6,1e-4,100): 2. Set input param 3. Run 10,000 instances 4. Calculate statistics 5. Save data 6. Go-to 1 until done I know Kedro wasn't built for this, but I want to highlight that the Kedro way is very amenable to general statistical modeling and simulation efforts that don't include ML. My question is: what's the "Kedro canonical" way to do that? From initial attempts I can see one of two options: 1. Instantiate 100 modular pipelines programmatically, then run through all of them in some way (ideally with a parallel runner). 2. Write my own for loop with threading, and pass a changed contex, pipeline, or catalog to a SerialRunner within each thread/for loop. (keep in mind this is also a simplified example, I probably have two or three variables that I want to loop over in similar ways, upping the amount of total pipelines to run to something like 10000+).
K 1
This is probably similar to @Chris Rabotin's question from the other day...
👍 1
Might also be possible to implement something suggested here?
c
Yes, that's definitely a setup that's identical to what I am trying to do. How do you currently run these 10k simulations? Do you generate the cases in a node? I guess I could generate them in a node and store them in a dataframe. From the message that you quote, I don't understand how the name spaces fit in there : what is their added value?
j
hi folks, I cross-referenced these conversations in https://github.com/kedro-org/kedro/issues/1606 it turns out running Kedro with different parameter specs is currently not very well supported by our open source offering. however, we are collecting use cases, and I will raise the priority of this so we can start looking at it. feel free to upvote that issue (react with a 👍🏼 to the top comment)
🥳 1
c
Ah, yes, using experiment pipelines might be a solution as well, I have yet to try them out. In my case I have a reproducible way of generating synthetic data given the seed, so I can maybe store that seed in the experiment tracking? Thanks @Juan Luis for linking the issue, I'll try out different approaches
👍🏼 1
q
@Chris Rabotin The 10k simulations all sit in one node, and our output as a numpy array to the next node for calculating statistics on the full array. I'm now wondering if a custom runner might be the way to go. Hadn't considered that... but I think that might make more sense and allow the loop to live within the
_run
function...Since the run function has access to the data catalog, I could add a memory dataset that can be introspected in the custom _run function to instantiate loops. @Juan Luis would that make more sense in the short term? I'm not quite sure how the github request you posted would allow the kind of workflow I'm imagining.
j
hi @quantumtrope, a custom runner sounds like an idea worth trying - the conversation on #1606 is long and there are several similar use cases, so different ideas were proposed. dynamically instantiating modular pipelines is just another one of them.
q
I wrote a quick function that updates the datacatalog with new parameters, and then using SequentialRunner using that updated catalog to run through things. I'm ignoring, for the moment, the fact that some intermediate CSV files are being overwritten. Probably versioned datasets combined with experiment tracking would help with that issue.
c
How do you update the data catalog? Do you change the yaml itself or do you use the Data Catalog function of kedro?
q
Basically I grab the existing parameters
catalog.load("parameters")
, update the part of the dictionary that I need to, then generate a dictionary matching the expected format of the "feed_dict" and then
catalog.add_feed_dict(new_params, replace=True)
c
Interesting. How do you plug into it this way? Is that with a hook with the "click" package? My usage of kedro is very limited for the time being, I only use the newbie features
q
We're a jupyter facing outfit, so it's from a jupyter notebook with the kedro extension loaded
👍 1