How can I reduce the execution time of a kedro pro...
# questions
r
How can I reduce the execution time of a kedro project? Is there anything I should be looking at?
m
Have you tried the ParallelRunner ? šŸ™‚
r
in fact my different nodes execute sequentially so I won't be able to use the ParallelRunner
j
what do your nodes do, generally speaking? process tabular data, connect to databases, something else?
n
Many ideas! ā€¢ pip install pandas[performance] ā€¢ --async / --parallel ā€¢ CachedDataSet ā€¢ PartitionedDataSet - Lazy loading/saving ā€¢ yield node to process data in chunk - Add an example in the documentation about nodes with generator functions kedro#2170 https://github.com/kedro-org/kedro-devrel/issues/49#issuecomment-1473735750
šŸ‘šŸ¼ 1
šŸ‘ 1
Btw, you should profile your pipeline to find out the bottleneck first. https://github.com/joerick/pyinstrument
šŸ‘ 1
šŸ‘šŸ¼ 1
r
@Juan Luis I have 5 nodes that are connected to each other by manipulating the dataframe and loading and saving the various results.
j
@Rachid Cherqaoui good to know - are you using pandas, PySpark, or something else?
r
pandas, xgboost and fastAPi
j
have a look at https://pythonspeed.com/datascience/#pandas or, alternatively, switch to https://www.pola.rs/
šŸ‘ 1
r
Thank you
d
Are you loading and saving all your datasets to physical catalog entries? You will incur an I/O bottleneck writing to disk, especially if they're large files. In addition to what everybody mentioned, I'd say that performance issues are usually not Kedro-related; Kedro is pretty smart around not adding overhead to the underlying calls. (There are occasional issues caused due to something in Kedro; profiling usually finds that, then we have to fix it.)