I need some advice here I have a big project with E2E pipeli Kedro #questions

I need some advice here. I have a big project with...

Matthias Roels

09/14/2023, 11:38 AM

I need some advice here. I have a big project with E2E pipelines where I need to building a fact/dim model, features, model and some reporting graphs. For the first part, I need Spark to do the processing, until I have a nice set of features. Then I need pandas, scikit learn etc to build a model, create insights plots etc. What’s the best way to split up a kedro project into 2 parts (a part requiring Spark and a part that doesn’t).

Nok Lam Chan

09/14/2023, 12:23 PM

What’s the challenge here? Did you get any dependencies issue?

Matthias Roels

09/14/2023, 12:34 PM

Not really, but just to reduce container size

Nok Lam Chan

09/14/2023, 1:08 PM

I guess that may related to the micropkg https://docs.kedro.org/en/stable/nodes_and_pipelines/micro_packaging.html

👍 1

Nok Lam Chan

09/14/2023, 1:08 PM

Cc @datajoely may have some more to say about this

Nok Lam Chan

09/14/2023, 1:10 PM

So essentially you need to separate the dependency per pipeline if they are running separately.

👍 1

Deepyaman Datta

09/14/2023, 1:29 PM

Do your pipelines (like, each entity under the

pipelines/

directory) contain either Spark or pandas/sklearn code, and not both? In that case, +1 to what @Nok Lam Chan said, and just specifying dependencies per pipeline.

Matthias Roels

09/14/2023, 7:06 PM

Well it’s either Spark/pandas or scikit learn/pandas. But it’s not only about dependencies. These pipelines are huge (think 200-300 nodes each?), so it’s hard to keep a good overview on how everything is linked. So my idea was to start something like a monorepo, possibly with multiple kedro projects. But then I’m not sure how to organise it (sharing parts of the catalog, params, code)

👍 1

Nok Lam Chan

09/14/2023, 9:09 PM

Taking the modular approach, catalog/params can all broken down into pipeline level. You may need a common module to share code across pipelines

Nok Lam Chan

09/14/2023, 9:11 PM

Micropkg basically treat each pipeline as it own python package, so it's more or less the same as a monorepo. But I think that also depends on how large the project is and is it going to be reused.

Matthias Roels

09/15/2023, 5:03 AM

How would micropkg then work with spark? Can you enable/disable hooks based on which pipelines run?

Nok Lam Chan

09/15/2023, 2:04 PM

Would it be possible to initiate the hook in

before_pipeline_run

so that you only instantiate SparkHook for Spark pipeline?

4 Views

Open in Slack

Previous Next