Has someone written any checklists or guides on be...
# questions
g
Has someone written any checklists or guides on best practices for using Kedro? It feels like there are a lot of choices to make about organizing nodes/pipelines/data sources/parameters. I'm reading Patrick Viafore's "Robust Python" that gives some general Python-level advice, but it does not mention practices around Kedro (so far as I have read).
n
Regarding structure -
kedro pipeline create
is the answer of how Kedro think they should be structured.
What specific choice are you making? maybe easier to answer if you can name A vs B
g
@Nok Lam Chan In some sense what I know is more about is what I need Kedro to do, but I am less familiar with how to make choices about structuring a Kedro project for doing these different activities. Here are some examples of my use cases (and some common tools): • data mining ◦ frequent pattern mining with mlxtend ◦ process mining via PM4Py • statistical/causal inference ◦ Bespoke Bayesian models with PyMC or Tensorflow-Probability. ◦ Facebook's prophet • Machine learning ◦ Scikit-learn ◦ Keras/Tensorflow • Discrete event simulations ◦ queueing networks with Ciw or queueing-tool. ◦ Miscellaneous with SimPy.
n
Cannot comment on all at once, but for example using Kedro for Simpy doesn't make much sense to me.
Of course you can use kedro ConfigLoader to manage the configuration and pipeline to handle the preprocessing for simulation, but structuring kedro node to run simulation ain't gonna add much value to it.
g
@Nok Lam Chan I think for entirely hypothetical scenarios it makes sense that little value is added in using Kedro to call a SimPy simulation. For some of my simulations they involve learning statistical models from real data and those statistical models are used in the simulation product. So Kedro could conveniently pull the data onto a server, clean and process the data, train models on the processed data, pass the trained models to the simulation product along with the sim config, the simulation product runs and outputs results, results are processed and visualized. Does that make sense?
n
This make perfect sense, but you also need to consider does all of these belong to a single Kedro monorepo? Could it be a kedro forecasting pipeline that get packaged standalone and then your simulation program use that as an external library? If you would like to keep kedro as the entrypoint for convenience, you can still wrap the simulation program as a single node in Kedro so you have a full pipeline.
g
@Nok Lam Chan Those are excellent questions. I don't think is it necessary for these processes to all be implemented in the same Kedro project. Making a lot of the functionality as separate modular pieces and then composing them in a Kedro product makes sense to me.
K 1
y
@Galen Seilis Hi! I'd think of it this way. • Kedro is basically a framework to organize a finite sequence of steps with clear inputs and outputs into a maintainable and well structured python codebase than you can run and extend in standardized way. • What those steps are, and what their inputs and outputs are - that's not that important. As long as those are python objects, kedro works with it. So be it a simulation thing, or a ML model... for me that's all just python objects that I do something with, and I use kedro
nodes
and
pipelines
to sequence that, and
catalog
to manage
load/save
logic. @Nok Lam Chan also curious if you agree with that.
👍 1
n
@Yury Fedotov I think it depends on how you look at Kedro. From this perspective you are treating Kedro more or less like an orchestration. In that sense yes I think you can use Kedro to run this sequence of job, my point is there will not be too many benefit to develop i.e. Your simulation engine in Kedro, kedro will probably just call the simulation as an external dependency, and they should have their own package.
👍 2