Hi Team. Loving the look of kedro as a way to enfo...
# questions
Hi Team. Loving the look of kedro as a way to enforce modularity in work done by my actuarial team. I've done the spaceflights tutorial, and now I'm implementing my first real pipeline and I have a two general "advice" questions. My current aim is modularity and best practice in a data processing pipeline (but with a lower learning curve for data scientist than full on ETL pipelines). We're talking about "small data" with order of 100k rows and 50 columns. I'm not so worried about the model building part yet. Questions are: 1. I'd like to split the data prep pipeline into modular steps (some geneic i.e. "format all Boolean fields", "replace nulls", and some quite specific edge cases "fix this particular named field"). My instinct is that each of these steps should be a node, but I note in the tutorials that quite often "prep this whole dataset" is one node. Is that just to keep the tutorial simple, or am I going down a bad path? 2. I have tended in recent times to be a "no pandas in place operations" nazi. (i.e. use "assign" to create a new column rather than df['new_column_name']) I'm seeing in place operations in the tutorials --is that just to keep things simple, or is there either a reason not to worry, or a performance overhead reason why actually I should be using in place? (Bearing in mind my "small data" use case) Sorry to ask basic questions - but I'm super keen on what I think of as the "kedro mindset" of "do it neat and tidy first time to avoid problems later on" :-)
No stupid questions! These are very good questions that you need to consider regardless of using Kedro or not. 1. There is no one correct way of doing it. Things that you need to consider are how big is your node? You should think about how long it takes to run a node, as one benefit is you can specifically run one particular node. If you need to debug a certain output quite often, then it makes sense to break them out. On the other hand, wrapping every processing step into a separate node can be too verbose. 2. No
isn’t a bad idea, as far as I know, the
isn’t an in-memory operation, so there is no gain in performance at all. Under the hood pandas just make a copy and throw away the old one. It is more like syntactic sugar. The important part of the node concept is that you should try to write a Pure Python Function or something without side-effect, the motivation behind this is that it is easier to reason and debug, as long as you have the input ready you can always reproduce the same output.
Thanks so much for your reply Nok Lam! :-) Ok so I think I'll experiment with node size and see what gives the best balance between verbosity, testability and maintainability. Thanks again for the awesome package!
K 1
Awesome 😁