I have a node with a large run time, and I am thin...
# questions
g
I have a node with a large run time, and I am thinking about running that particular node with PyPy. I am not aware of a supported way to do this. So here is what I am thinking. I could ensure that the required inputs are saved in the Kedro data folder having been generated in earlier steps of the analysis. Then I could have a kedro node that uses subprocess.run to call PyPy on a script that performs the computationally-expensive steps and writes an intermediate output file. Finally, the kedro node loads that output back into memory and returns it (and hence can be a registered outupt in the catalog). Something resembling this:
Copy code
import subprocess

PYPY_PATH = ...
SCRIPT_PATH = ...

def run_pypy_script(input_data):
    write_to_temp(input_data)
    pypy_cmd = [PYPY_PATH, SCRIPT_PATH]
    subprocess.run(pypy_cmd, check=True)
    return load_results()
Any thoughts on this approach, or a better approach? https://docs.python.org/3/library/subprocess.html#subprocess.run
m
Interesting. Is using PyPy in your particular case really a huge improvement?
g
@marrrcin Yes, the node runs a Ciw simulation which is mostly pure Python. PyPy substantially improves the performance by avoiding static typing. https://ciw.readthedocs.io/en/latest/index.html
m
Have you also tried on Py3.11?
Anyway - if it’s a standalone thing that can be run in an isolated manner, then I don’t see any problem with running it in a node like you have right now. It’s just a matter of you being aware that you’re loosing some of the reproducibility guarantees from Kedro 😉
g
@marrrcin I am running Python 3.10.4 ATM. Yes, it would be a standalone script within the same project. I believe the
check=True
argument in
subprocess.run
should raise an error if something doesn't work on this component. If that is correct then Kedro should be aware of when the script passes/fails. Which reproducibility guarantees would I be loosing? If I know what they are then there may be ways for me to mitigate that issue.
m
The I/O will be done outside of the Kedro Data Catalog
The pipeline will not be runnable after simple git checkout + create venv + install requirements, because you will also need to install PyPy separately