My team and I are building demand and supply forec...
# questions
l
My team and I are building demand and supply forecasting models of fresh produce at weekly and daily intervals (different models: daily supply, daily demand, weekly demand, weekly supply), we also have our models be at different granularities (think product level and product and SKU level) - find a picture attached for clarity, we have 8 different but very similar modelling problems. The idea is to build a model or set of models for each of these scenarios and for each client. right now we have Jupyter notebooks to train the initial models (call it m0) and then we have kedro pipelines for refitting and getting predictions. In our current pipelines we’ve also made the mistake to represent pipelines as nodes, e.g. feature engineering as a node rather than a pipeline and we are also not (yet) following best practices in terms of data layering (https://web.archive.org/web/20250215132726/https://towardsdatascience.com/the-importance-of-layered-thinking-in-data-engineering-a09f685edc71/) and we are not storing intermediate outputs. We would like to migrate nodes to be actual pipelines and nodes to be more atomic (map to function with smaller scope) We currently intend to migrate our Jupyter notebooks to hopefully just one pipeline that could handle the four scenarios I presented earlier with its different components (preprocess, fe, train, evaluations) as pipelines. What I’m really not clear about is, if we don’t manage to abstract away the logic to build one pipeline to cover all the scenarios would we end up with let’s say 4 pipelines (preprocess, fe, train, evaluations) x 8 problems scenarios (32 atomic pipelines and 8 “problem” pipelines) - if so how is it recommended to managed that? At that point is each of these in a separate project/repo? Is Alloy designed for this sort of scenario? Also, let’s say our end to end pipelines only ever really runs once or once in a while to train an initial model on historical data and on data drift events or when a new features are added and we want the changes to apply to all of our clients. But then we have some pipelines that run more often such as getting predictions after loading a model that’s already fit or re fitting to latest data and getting predictions. Should this live in the same repo? Should they be all part of one big pipeline and some pipelines are skipped conditionally? Sorry for the big question but I’ve had all these questions in my mind for a long time. In this scenario it almost feels sensible to come up with custom scripts and our own architecture for ml pipelines rather than try to fit the kedro mould but I might be missing something!
h
Someone will reply to you shortly. In the meantime, this might help:
d
Do you think you can ever get to a place where the only difference between pipelines is configuration?
l
That’s hopefully where we want to get to yes, but I’m concerned of configuration becoming a bit unwieldy, specially as we might have per client configuration
d
The reason I say this is that there is a pattern where you can essentially create instances of the same pipelines with different inputs/outputs/parameters: https://docs.kedro.org/en/stable/nodes_and_pipelines/namespaces.html
👌🏼 1
n
The shareable unit is always function - node - pipeline. If the logic is very similar, then maybe you can refactor a configuration so that a node/func can be reused between the pipelines. If the logic is different anyway, they deserve their own node, and the maintenance effort is inevitable.
👌🏼 1
Or do you mean you need conditional logic to create pipeline? That should also be doable.
l
Or do you mean you need conditional logic to create pipeline? That should also be doable.
In some instances, different clients might need slightly tweaked configurations
What about these two sets of questions?
What I’m really not clear about is, if we don’t manage to abstract away the logic to build one pipeline to cover all the scenarios would we end up with let’s say 4 pipelines (preprocess, fe, train, evaluations) x 8 problems scenarios (32 atomic pipelines and 8 “problem” pipelines) - if so how is it recommended to managed that? At that point is each of these in a separate project/repo? Is Alloy designed for this sort of scenario?
Also, let’s say our end to end pipelines only ever really runs once or once in a while to train an initial model on historical data and on data drift events or when a new features are added and we want the changes to apply to all of our clients. But then we have some pipelines that run more often such as getting predictions after loading a model that’s already fit or re fitting to latest data and getting predictions. Should this live in the same repo? Should they be all part of one big pipeline and some pipelines are skipped conditionally?