Is there any guidance on transitioning an existing...
# questions
j
Is there any guidance on transitioning an existing project to Kedro? I have a data science project which is currently relatively unstructured. I'm attempting to transition to Kedro by creating a directory in the project for the Kedro project to live in and then move one step at a time from the pipeline over
s
We have plans to add guides to transitioning to Kedro from existing projects. But if you need help with anything in the mean time please don't hesitate to reach out.
d
I think the best strategy today isto generate an empty Kedro project and transition one pipeline at a time
👍 1
j
An issue with that approach is that this project is mostly one pipeline. It's more scientist code than engineer code and so I wanted to avoid having to transition the whole pipeline all at once. Ideally I wanted to create one node at a time in the same repo so that the old and new pipelines can share data
d
So a kedro pipeline can be any granularity you want
so if you want to merge the data processing park at the beginning you run
kedro pipeline create data_processing
and start working there
j
And then split them out into individual nodes?
d
well just the data processing part in this example
j
So then IIUC
Right now there is a bunch of manual process involved in running this pipeline. We have a folder of Python scripts which ingest the previous step of data output and then call into the next step of data processing, and these need to be called by hand. This project is in the middle of transitioning from client number 1 to clients number 2 and 3, so this is quickly not going to work as we need to start generalizing the data processing pipeline to different client data. One of our big goals in transitioning to Kedro is to be able to get rid of this manual process and have more control over the pipeline. So you're suggesting that we make each script its own Kedro pipeline and move them over one by one with their code dependencies?
d
I think that makes sense
but my push is move the smallest most self contained part first
and see if it works
j
That's the goal
but I think that means the Kedro project needs to be in the same repo
d
Whilst a bit ugly, that shouldn’t be an issue
👍 1
j
What would the advantage be of making each script its own pipeline versus a node in a growing single pipeline?
d
So I think in general that’s the right way to write Kedro anyway, a
pipeline
can be a very small unit of logic, if you look at the demo viz on https://demo.kedro.org/ you can see each of those ‘mega nodes’ is a modular pipeline with its own namesapce i.e. Ingestion, Feature Engineering
this was adapted from a demo project I made aggggeeeessss ago https://github.com/datajoely/modular-spaceflights
j
How angrily were you using Kedro?
d
What do you mean in this context? I guess I was trying to show a representative way a practitioner would use it after using it in the real world for a couple of years
j
Just joking. It says "This project is designed to be a realistic example of what Kedro looks like when used in anger."
💢 1
This all makes sense
Thank you for the guidance
d
oh haha
again a long time since I wrote that
let’s call it a 6/10
j
lol
Well thank you. I'll come back when I have more questions. One of the reasons we settled on Kedro was the level of support on this channel, so your effort is appreciated
❤️ 1
👍 1
🚀 3
n
Hey! I am very interested in this topic but I am off today. I would get back to this tomorrow, meanwhile if you have questions, please fire it 🚀