How does Git and Kedro play ball together? We hav...
# random
How does Git and Kedro play ball together? We have a classification data science pipeline written in Python and hosted on a GitHub repository. While I get the concept of Kedro project and having a workspace per data model, I don’t get how do I sync the code across projects/workspaces/experiments. Should Kedro tasks (and pipelines) be thin wrappers that import my existing Python code, or not? what are the best practices if you already have an existing code base and Git repository with your code? Thanks!
👍🏼 1
Should Kedro tasks (and pipelines) be thin wrappers that import my existing Python code, or not? what are the best practices if you already have an existing code base and Git repository with your code?
Sure, you can do that, if you have an existing package you are publishing/can import. Else, you can use Kedro as a way of organizing that existing codebase.
While I get the concept of Kedro project and having a workspace per data model, I don’t get how do I sync the code across projects/workspaces/experiments.
I don't totally understand your question here. Maybe this will help, else you can clarify: Kedro has a separation between logic (
directory with pipelines) and configuration (
directory, with specifics around the data loading + parameters). You can define a reusable pipeline (e.g. for classification) as a Kedro project, package it (will only package logic), and deploy in different contexts with different configuration. Alternatively, you can pull in reusable pipelines into different Kedro projects.
Hi @Deepyaman Datta thanks for the reply! Here’s an example: Let’s say that I have a data preprocessing helpers that apply certain business logic to clean the data, and I want to apply a bug fix to all of my existing experiments, i.e. different kedro folders. What is the best practice here? 1. Backport the changes and manually apply the change? 2. Clone/fork an existing experiment that you’d like to fix and apply the fix on top of the newly forked experiment? 3. Maintain a single source of truth, i.e. repo/src/ and do relative Python imports from all experiments/folders to repo/src?
What I don’t get is how to you pull reusable pipelines into different Kedro projects. Is it by means of duplicating the code (copying-n-pasting across src folders of different Kedro projects), or is it by having a shared core code and have Kedro projects import that code?
I think best practice would never advocate for duplicating code (unless you needed to make substantial changes on top of it)? So I would personally opt for option 3, having a shared single source of truth. That said, I'm a c++ programmer not a pythonista, so perhaps @Juan Luis can confirm.
Thanks @Jo Stichbury! The only concern with option #3 is the reproducibility
If you use relative imports you may change the semantics of the past (old experiments), so when you come back 1 year for now to reproduce the results you may lost the point-in-time-correctness for the code
The equivalent for this discussion in the classic software world is having either: 1. A single branch in Git 2. Having a main branch and different release branches
In option #1 you can create semi-isolation by separating code into different folders and doing relative imports. In option #2 you effectively have different copies (e.g. releases) of the same code, and if you want to apply a change from say the main branch into all other releases, you either cherry-pick or backport your change to the relevant release branches
Can you make packages and version them, aligned with release tags in your "library" repo? Then use requirements to set specific versions to ensure you always have the correct version for each experiment?
sounds to me like if you're reusing these helpers in different experiments and you want to keep evolving them but you also want to have reproducibility in the past (totally fair use case in my opinion), you either do it manually (lots of copy pasting) or have a separate repo with proper versioning, so in a given experiment you can always pin a specific version of said helpers, which can have a trunk-based git branching model to be able to evolve them while also backporting fixes. I would go for the first approach if the helpers are narrow scoped, small, and don't expect them to evolve much (after all, a bit of copy pasting is sometimes better than introducing the wrong abstractions or too much complexity early on). if the helpers grow in scope, then I would go full reusable library code and apply the best practices of any open source third party library
Nice! I appreciate the answers and they make total sense to me
🚀 2
What you’re suggesting essentially is to either (1) copy-n-paste (which may not be a biggie for a relatively small scale code base) or: (2) apply semver (semantic versioning) to my data science core code and treat like any other 3rd party
The former is quick-n-dirty and works when you don’t have a lot of code, in which case the abstractions may be overkill and complex to maintain and design.
Whereas the latter ensures reproducibility and only requires you to update the reference (pointer) / configuration of the wheel / requirements.txt of the different Kedro project
that would be my recommendation indeed!
The same question can be asked what happens if numpy or pandas have a bug fix, rgiht?
And your code is actually no different. I recently had a discussion about this with a colleague, whether polyrepo or monorepo is the right way to go. Given that all the FAANG companies use monorepo, I was wondering if that’s the way to go as it simplifies backward compatibility and CI/CD maintenance
His argument was to work in an extreme polyrepo approach where each repo has its own semver and commitment to fulfill an API contract with Cypress/Postman tests around that (at least for web purposes)
I've never worked in behemoth-sized companies of the likes of FAANG but if I had to pick a side, I think a polyrepo approach + a set of common practices enforced across the company to reduce cognitive load would be my strategy of choice
K 1
@Ofir, I can tell you how I do things and it works well for me. I use a custom starter with specific boilerplate like preprocessing cleaning that's specific to my business. This starter has the specific version of libraries I'm using at the time in the
file (so you can rebuild your environment for reproducibility) whenever I create a pipeline that I'll use across projects I use micro-packaging and set the destination to my starter, so when updates are made to that pipeline it won't affect projects that have already been created and are using that pipeline unless I update it. With this process there's no copy/pasting, and I don't run into issues where something was updated and it breaks my project
I agree with what @Jo Stichbury and @Juan Luis said entirely, and that, at the far end of reusing pipelines of different version across an organization, it basically turns into the same way you manage versions of any package internally or externally. I've only personally seen the details of how Kedro is used at this scale when I was at QuantumBlack, but they took the same approach (SemVered packages pushed to internal PyPI repo and consumed by many teams, for pipelines like data mocker and classification). Don't think monorepo helps when you want to use multiple versions of a package, but it has uses in plenty of other cases.
@Brandon Meek the starter and micro-packaging sounds interesting, I’ll definitely check that out thanks! It might be nice to submit a PR of such sample to GitHub with to Kedro’s GitHub or add an “Using Kedro in an existing code base” section in the documentation, whichever makes more sense and is accepted by the community and maintainers.
👍 1
@Deepyaman Datta thanks! nice to know about the SemVer approach being used also across teams within the same organization