hey everyone, Im trying to add some tests to my p...
# questions
hey everyone, Im trying to add some tests to my project but i often see that catalog load takes quite a bit of time (>10 sec). Any idea why and how can I speed this up? 😄
If i only execute config_loader['catalog'] it takes about 1-2 sec but the following line takes quite some time which creates the datacatalog itself
Copy code
catalog = settings.DATA_CATALOG_CLASS.from_config(
Can you run a profiler please 🙏
How can I run a profiler for kedro internals code? 😄 I did it by debugging running it line by line to get a sense which one is taking longer and this confirms that catalog creation from dict is the bottleneck as so much happens in there. For the context i have about 25 entries in my catalog.
https://github.com/joerick/pyinstrument It should be able to profile arbitary code - not just your project
For catalog, I suspect either you have a big catalog, or you have some SQLDataSet there where connection is created at intialisation.
I have a lot of GBQCreateDataset which creates google client for bigquery so i think it might be due to that as you pointed out. 🤔 I'll try the profiler so thanks for the tips gratitude thank you
With your help, i can say the problem is the google client creation. I used the same style in pandas.gbq_dataset file which creates the client during the init. I can change my code which could speed it up but the ones i use directly (GBQQueryDataSet) will still be slow. I wonder if we can something like cached client (using globals?) or something similar to not create the same client too many times. This could help with scaling of big catalog definitions which wouldn't suffer from this
I can confirm making this a property and removing from init made it very fast. In the stack trace i only see GBQQueryDataSet from pandas. And after customizing that to not create google client im down to 3 sec 😄
Would you take this over or should i try to make a PR somehow? 🙂
🎉 1
❤️ 1
Issues/PR is always welcome. Does this assume all dataset share the same client/connections?
You mean with the cache? If datacatalog creates new objects, they should be separate. Cache here would make sure when i do load/save operation we don't keep creating a new client every time but rather using the already created/cached python variable.
👍🏼 1