[Kedro-version : 0.18.6 currently] Hi, I am workin...
# questions
a
[Kedro-version : 0.18.6 currently] Hi, I am working on a sort of 'pipeline monorepo' where I have dozens of pipelines. I have a question: would some sort of lazy-configuration-validation be a useful feature for kedro? I have 2 reasons for asking: 1. It feels a bit cumbersome that even a simple
hello_world.py
will take several seconds to run when the configuration is large enough, as first you will see all the logs and all the setup will be done for the data catalog etc, none of which would actually end up being used in a
hello_world.py
2. When setting up the project for someone, it is impossible to provide a credentials file with just the required credentials. In kedro all of them need to be filled right now as it is all validated at once. In a sort of lazy version, only the dependencies that follow from the pipeline would need to be evaluated. Are there any solutions or modifications I could use to improve my approaches here? Thanks in advance! :)
🎉 1
n
It feels a bit cumbersome that even a simple
hello_world.py
will take several seconds to run when the configuration is large enough, as first you will see all the logs and all the setup will be done for the data catalog etc, none of which would actually end up being used in a
hello_world.py
I suspect this is related. How big is your catalog (number of entries?) Can you try removing any SQLDataset and see if it is speed up? Most dataset are lazily initiated, it should not have any impact on loading up the catalog. https://github.com/kedro-org/kedro/issues/2829
When setting up the project for someone, it is impossible to provide a credentials file with just the required credentials. In kedro all of them need to be filled right now as it is all validated at once. In a sort of lazy version, only the dependencies that follow from the pipeline would need to be evaluated.
Is this also related to SQLDataset which requires db connection (and credentials)?
a
Yes there are SQL datasets there, so that could be the issue
and yes both of those cases are in the same repository, so likely it could be the same issue
as for size: about 200 - 800 LOC for both
parameters.yml
and
catalog .yml
and ive been using the
TemplatedConfigLoader
to interpolate elements of configuration from env vars
n
Can you do a quick test? We only need to test how long does it takes to initiate a data catalog. To do that, you can do kedro catalog list. Can you try comment out all the sqldataset and see if the speed change before/after?
a
~21 secs
for
kedro catalog list
~6 secs
after commenting out the SQL and GBQ datasets
still a lot
n
Can u do another one commenting out all datasets?
From 21s -> 6s, I think it’s a significant amount of time. I am not sure what’s the remaining 6s is, usually to start a kedro project shouldn’t take longer than 1/2 seconds. You may need to try profiling and find the bottleneck • Python imports (it could be quite slow especially importing bigger library) - you can roughly estimate this by doing
import <your_module>
to see that how long does it takes, this doesn’t load up any kedro thing and can help you isolate how much overhead is added by Kedro. • Connections - Do you have other connections setup other than SQL? Normally reading config should be very fast, 200 -800 LOC shouldn’t create any significant overhead.
a
Ok, thank you, I will let you know if and when I do more profiling