I have a node with a large run time and I am thinking about Kedro #questions

I have a node with a large run time, and I am thin...

Galen Seilis

12/07/2023, 9:38 PM

I have a node with a large run time, and I am thinking about running that particular node with PyPy. I am not aware of a supported way to do this. So here is what I am thinking. I could ensure that the required inputs are saved in the Kedro data folder having been generated in earlier steps of the analysis. Then I could have a kedro node that uses subprocess.run to call PyPy on a script that performs the computationally-expensive steps and writes an intermediate output file. Finally, the kedro node loads that output back into memory and returns it (and hence can be a registered outupt in the catalog). Something resembling this:

Copy code

import subprocess

PYPY_PATH = ...
SCRIPT_PATH = ...

def run_pypy_script(input_data):
    write_to_temp(input_data)
    pypy_cmd = [PYPY_PATH, SCRIPT_PATH]
    subprocess.run(pypy_cmd, check=True)
    return load_results()

Any thoughts on this approach, or a better approach? https://docs.python.org/3/library/subprocess.html#subprocess.run

marrrcin

12/08/2023, 8:12 AM

Interesting. Is using PyPy in your particular case really a huge improvement?

Galen Seilis

12/08/2023, 2:08 PM

@marrrcin Yes, the node runs a Ciw simulation which is mostly pure Python. PyPy substantially improves the performance by avoiding static typing. https://ciw.readthedocs.io/en/latest/index.html

marrrcin

12/08/2023, 2:37 PM

Have you also tried on Py3.11?

marrrcin

12/08/2023, 2:39 PM

Anyway - if it’s a standalone thing that can be run in an isolated manner, then I don’t see any problem with running it in a node like you have right now. It’s just a matter of you being aware that you’re loosing some of the reproducibility guarantees from Kedro 😉

Galen Seilis

12/08/2023, 3:22 PM

@marrrcin I am running Python 3.10.4 ATM. Yes, it would be a standalone script within the same project. I believe the

check=True

argument in

subprocess.run

should raise an error if something doesn't work on this component. If that is correct then Kedro should be aware of when the script passes/fails. Which reproducibility guarantees would I be loosing? If I know what they are then there may be ways for me to mitigate that issue.

marrrcin

12/08/2023, 4:06 PM

The I/O will be done outside of the Kedro Data Catalog

marrrcin

12/08/2023, 4:07 PM

The pipeline will not be runnable after simple git checkout + create venv + install requirements, because you will also need to install PyPy separately

Open in Slack

Previous Next