Kedro version 0 18 6 currently Hi I am working on a sort of Kedro #questions

[Kedro-version : 0.18.6 currently] Hi, I am workin...

Aleksander Jaworski

07/24/2023, 11:22 AM

[Kedro-version : 0.18.6 currently] Hi, I am working on a sort of 'pipeline monorepo' where I have dozens of pipelines. I have a question: would some sort of lazy-configuration-validation be a useful feature for kedro? I have 2 reasons for asking: 1. It feels a bit cumbersome that even a simple

hello_world.py

will take several seconds to run when the configuration is large enough, as first you will see all the logs and all the setup will be done for the data catalog etc, none of which would actually end up being used in a

hello_world.py

2. When setting up the project for someone, it is impossible to provide a credentials file with just the required credentials. In kedro all of them need to be filled right now as it is all validated at once. In a sort of lazy version, only the dependencies that follow from the pipeline would need to be evaluated. Are there any solutions or modifications I could use to improve my approaches here? Thanks in advance! :)

🎉 1

Nok Lam Chan

07/24/2023, 11:34 AM

It feels a bit cumbersome that even a simple
hello_world.py
will take several seconds to run when the configuration is large enough, as first you will see all the logs and all the setup will be done for the data catalog etc, none of which would actually end up being used in a
hello_world.py

I suspect this is related. How big is your catalog (number of entries?) Can you try removing any SQLDataset and see if it is speed up? Most dataset are lazily initiated, it should not have any impact on loading up the catalog. https://github.com/kedro-org/kedro/issues/2829

Nok Lam Chan

07/24/2023, 11:35 AM

When setting up the project for someone, it is impossible to provide a credentials file with just the required credentials. In kedro all of them need to be filled right now as it is all validated at once. In a sort of lazy version, only the dependencies that follow from the pipeline would need to be evaluated.

Is this also related to SQLDataset which requires db connection (and credentials)?

Aleksander Jaworski

07/24/2023, 11:35 AM

Yes there are SQL datasets there, so that could be the issue

Aleksander Jaworski

07/24/2023, 11:35 AM

and yes both of those cases are in the same repository, so likely it could be the same issue

Aleksander Jaworski

07/24/2023, 11:38 AM

as for size: about 200 - 800 LOC for both

parameters.yml

and

catalog .yml

and ive been using the

TemplatedConfigLoader

to interpolate elements of configuration from env vars

Nok Lam Chan

07/24/2023, 11:46 AM

Can you do a quick test? We only need to test how long does it takes to initiate a data catalog. To do that, you can do kedro catalog list. Can you try comment out all the sqldataset and see if the speed change before/after?

Aleksander Jaworski

07/24/2023, 11:57 AM

~21 secs

for

kedro catalog list

~6 secs

after commenting out the SQL and GBQ datasets

Aleksander Jaworski

07/24/2023, 11:57 AM

still a lot

Nok Lam Chan

07/24/2023, 12:20 PM

Can u do another one commenting out all datasets?

Nok Lam Chan

07/24/2023, 12:49 PM

From 21s -> 6s, I think it’s a significant amount of time. I am not sure what’s the remaining 6s is, usually to start a kedro project shouldn’t take longer than 1/2 seconds. You may need to try profiling and find the bottleneck • Python imports (it could be quite slow especially importing bigger library) - you can roughly estimate this by doing

import <your_module>

to see that how long does it takes, this doesn’t load up any kedro thing and can help you isolate how much overhead is added by Kedro. • Connections - Do you have other connections setup other than SQL? Normally reading config should be very fast, 200 -800 LOC shouldn’t create any significant overhead.

Aleksander Jaworski

07/25/2023, 8:56 AM

Ok, thank you, I will let you know if and when I do more profiling

2 Views

Open in Slack

Previous Next