Alright fellow Kedroids - let's start a thread (if...
# questions
e
Alright fellow Kedroids - let's start a thread (if there isn't already one) on mono repo vs. individual pipeline repos. I know folks have opinions here! 🙂
🧵 1
K 1
j
one repo per project and N pipelines per project? that would be my take, what do you think @Ed Henry? 🍿
👍 1
e
I'm torn because my pipelines are growing to the point where dependency hell is starting to show. So now I'm left with deciding if mono-repos are worth the overhead moving forward.
d
Kedro projects are inherently monorepos, where dependencies should be defined at the pipeline micropackage level. Dependency hell occurs when you don't define requirements per pipeline micropackage, and instead throw them all into a top-level requirements file. The only reason individual pipeline repos helps this, is because it forces you to separate out the requirements--but that's not to say you can't properly manage them in a monorepo. Kedro's built-in support for managing the project as a monorepo is immature, because it tends towards what's easy from the new data scientist perspective (throw your requirements in one file), instead of what's correct from a software engineering perspective (manage loosely-bound requirements for each pipeline, and make sure they're properly resolvable). Above are my opinions--as you asked :)--and do not reflect the views of the maintainer team.
👍 1
e
I agree with almost everything you said! Except when you start to abstract modules outside of pipelines to be reusable across pipelines, and that's where I'm hitting the dependency hell. I'm using pipelines to do more than just data engineering-focused tasks, and I have other modules in the project repo that are then used throughout various pipelines. For example, we generate text embeddings for downstream tasks, and the models we use for creating embeddings and their respective modules containing the implementation are used across pipelines. Top-level requirements or not, the dependencies of the model modules are bleeding between pipelines. Maybe this is where I create the separation? Just thinking out loud. :) Another area where it's starting to crop up is across various data repositories and the interfaces from which we need to consume data, having different requirements, such as
s3fs
, etc.
m
Have you tried any monorepo managers before, @Ed Henry? e.g. Bazel, Pants
👍 2
d
Except when you start to abstract modules outside of pipelines to be reusable across pipelines, and that's where I'm hitting the dependency hell.
This is a good point. In my past experience solving this problem, I added a
lib
directory inside of the project, where we built packages that could be used as dependencies by multiple pipelines. For example,
lib/embedding_generator
could be used by
pipelines/some_modeling_pipeline
and
pipelines/another_modeling_pipeline
. This did require a more complex CI/build process, where libs were built before pipelines. Also explored the use of monorepo tools (e.g. Pants, Bazel) for this, though never got around to it. My design was heavily influenced by https://medium.com/opendoor-labs/our-python-monorepo-d34028f2b6fa, where instead of
projects
you have Kedro
pipelines
. This was also done before Kedro had a more generic micropackaging workflow, so now perhaps Kedro could also manage the
libs
.
👍 1
e
This is all great info! Thanks all! I would love to hear other's opinions, as well!
j
I have a somewhat diverging opinion from @Deepyaman Datta: I don't think Kedro projects are inherently monorepos - but that's where the "dependency hell" problems arise from! so, I agree on the symptoms, not on the root cause (but maybe this is an unimportant distinction) Kedro projects in 0.18 and much more so in the upcoming 0.19 assume that 1 Python library = 1 Kedro project, hence the dependencies are mapped once per project. we do have a micropackaging workflow https://docs.kedro.org/en/stable/nodes_and_pipelines/micro_packaging.html but if anything, it's more for packaging than for development. regardless, it's somewhat inconsistent with other parts of the library and needs some love, as well as users complaining about it - we are not seeing a lot of usage!
kedro micropkg *
has less than 300 hits in our telemetry, out of 2.41M events, hence 0.01 %. I hear you on the dependency issues though @Ed Henry, I think it's happening to other users with big projects (I recall @Marc Gris has spoken up about this in the recent past too). my recommendation would be, for now, to split the piplines across different Kedro projects and connect them through a common catalog. but it's an area we don't have lots of good recommendations to make - if you come up with nice usage patterns, we'd be glad to add those to our docs.
👍 1
m
Hi everyone, @Ed Henry: I feel your pain. In my humble and un-educated opinion: It would be fantastic if kedro could (a bit like airflow) handle / support dynamic venv activation at the node level ! (not just micro-pipeline). Regards M
👍 1
d
we do have a micropackaging workflow https://docs.kedro.org/en/stable/nodes_and_pipelines/micro_packaging.html but if anything, it's more for packaging than for development. regardless, it's somewhat inconsistent with other parts of the library and needs some love, as well as users complaining about it - we are not seeing a lot of usage!
kedro micropkg *
has less than 300 hits in our telemetry, out of 2.41M events, hence 0.01 %.
As you point out, I agree it needs some love. The usage is a bit of chicken-and-egg problem; there is significant packaging/unpackaging of Kedro pipelines, etc. going on in some places, but they don't use
kedro micropkg
because it doesn't work as well as just using native Python packaging. But, the people who end up integrating custom Python packaging into their Kedro project/monorepos/tooling are also people who are relative experts when it comes to Python packaging (compared to 95+% of Kedro users).
👍🏼 1
👍 1
i
On the face of it, I really like the idea of micropackaged pipelines; the bottleneck for us adopting this paradigm is the need for a very robust CICD system.
Atm we operate with one repo one project. Within that repo we have the full E2E process from data acquisition to model predictions and reporting.
y
I think that mono-vs-poly-repo question has little to do with dependency hell. • mono-vs-poly-repo decision should be taken considering such criteria as collaboration simplicity, CI and linting consistency, scope of typical changes (e.g. if changes to one part of software often require changes in the other, they should likely be a monorepo) • dependency hell however is more a question of properly isolating virtual environments for different parts of software. • you can perfectly have 10 kedro projects in one repo, each requiring totally different dependencies. and have a separate venv for each. in the same repo you can also have packages that are not related to those kedro projects. and have separate venvs for them too.
👍🏼 1
👍 2