I need some advice here. I have a big project with...
# questions
m
I need some advice here. I have a big project with E2E pipelines where I need to building a fact/dim model, features, model and some reporting graphs. For the first part, I need Spark to do the processing, until I have a nice set of features. Then I need pandas, scikit learn etc to build a model, create insights plots etc. What’s the best way to split up a kedro project into 2 parts (a part requiring Spark and a part that doesn’t).
n
What’s the challenge here? Did you get any dependencies issue?
m
Not really, but just to reduce container size
n
Cc @datajoely may have some more to say about this
So essentially you need to separate the dependency per pipeline if they are running separately.
👍 1
d
Do your pipelines (like, each entity under the
pipelines/
directory) contain either Spark or pandas/sklearn code, and not both? In that case, +1 to what @Nok Lam Chan said, and just specifying dependencies per pipeline.
m
Well it’s either Spark/pandas or scikit learn/pandas. But it’s not only about dependencies. These pipelines are huge (think 200-300 nodes each?), so it’s hard to keep a good overview on how everything is linked. So my idea was to start something like a monorepo, possibly with multiple kedro projects. But then I’m not sure how to organise it (sharing parts of the catalog, params, code)
👍 1
n
Taking the modular approach, catalog/params can all broken down into pipeline level. You may need a common module to share code across pipelines
Micropkg basically treat each pipeline as it own python package, so it's more or less the same as a monorepo. But I think that also depends on how large the project is and is it going to be reused.
m
How would micropkg then work with spark? Can you enable/disable hooks based on which pipelines run?
n
Would it be possible to initiate the hook in
before_pipeline_run
so that you only instantiate SparkHook for Spark pipeline?