hey everyone Im trying to add some tests to my project but i Kedro #questions

hey everyone, Im trying to add some tests to my p...

Fazil B. Topal

08/25/2023, 12:30 PM

hey everyone, Im trying to add some tests to my project but i often see that catalog load takes quite a bit of time (>10 sec). Any idea why and how can I speed this up? 😄

Fazil B. Topal

08/25/2023, 12:41 PM

If i only execute config_loader['catalog'] it takes about 1-2 sec but the following line takes quite some time which creates the datacatalog itself

Copy code

catalog = settings.DATA_CATALOG_CLASS.from_config(
            catalog=conf_catalog,
            credentials=conf_creds,
            load_versions=load_versions,
            save_version=save_version,
        )

datajoely

08/25/2023, 1:33 PM

Can you run a profiler please 🙏

Fazil B. Topal

08/25/2023, 1:36 PM

How can I run a profiler for kedro internals code? 😄 I did it by debugging running it line by line to get a sense which one is taking longer and this confirms that catalog creation from dict is the bottleneck as so much happens in there. For the context i have about 25 entries in my catalog.

Nok Lam Chan

08/25/2023, 5:27 PM

https://github.com/joerick/pyinstrument It should be able to profile arbitary code - not just your project

Nok Lam Chan

08/25/2023, 5:27 PM

For catalog, I suspect either you have a big catalog, or you have some SQLDataSet there where connection is created at intialisation.

Nok Lam Chan

08/25/2023, 5:28 PM

Potentially related issue: https://github.com/kedro-org/kedro/issues/2829

Fazil B. Topal

08/25/2023, 7:24 PM

I have a lot of GBQCreateDataset which creates google client for bigquery so i think it might be due to that as you pointed out. 🤔 I'll try the profiler so thanks for the tips gratitude thank you

Fazil B. Topal

08/25/2023, 7:49 PM

With your help, i can say the problem is the google client creation. I used the same style in pandas.gbq_dataset file which creates the client during the init. I can change my code which could speed it up but the ones i use directly (GBQQueryDataSet) will still be slow. I wonder if we can something like cached client (using globals?) or something similar to not create the same client too many times. This could help with scaling of big catalog definitions which wouldn't suffer from this

Fazil B. Topal

08/25/2023, 7:56 PM

I can confirm making this a property and removing from init made it very fast. In the stack trace i only see GBQQueryDataSet from pandas. And after customizing that to not create google client im down to 3 sec 😄

Fazil B. Topal

08/25/2023, 7:59 PM

Would you take this over or should i try to make a PR somehow? 🙂

🎉 1

❤️ 1

Nok Lam Chan

08/25/2023, 10:26 PM

Issues/PR is always welcome. Does this assume all dataset share the same client/connections?

Fazil B. Topal

08/28/2023, 8:18 AM

You mean with the cache? If datacatalog creates new objects, they should be separate. Cache here would make sure when i do load/save operation we don't keep creating a new client every time but rather using the already created/cached python variable.

👍🏼 1

8 Views

Open in Slack

Previous Next