when saving a model using pickleDataSet with dill ...
# questions
s
when saving a model using pickleDataSet with dill backend it packages the node in which the model instance was created and ran, trying to dill.load raises
Copy code
ModuleNotFoundError: No module named 'pipelines'
any suggestions on how to handle it?
d
This feels like a working directory / installation issue. Is this all on the same machine?
s
yes on same machine. i have a node which inside instantiates a model, trains it, and outputs it as a result. saves using data catalog then use a different script to load this dill and run inference on the model but get this error.
d
ah interesting - is it running the same environment and version of python?
s
Yes, Same interpreter
d
can you post more of the stack trace?
s
but i can’t seem to reproduce it… i reran the whole pipeline and succesfully loaded the dill. this error occurs on my previously trained model… so i can’t reproduce right now, but will ping if i see it again 🙂 thx!
d
interesting
the error looks like a Kedro one, but as I said is something I’d expect to see if you were in the wrong directory not related to the pickle itself
f
@Sergei Benkovich We ran into a similar issue to this when submitting our Kedro pipeline to a Dask cluster - functions within the same node were unable to call each other (and failed with a
No module named 'my_pipeline'
error). One temporary workaround was to define the functions being called as inner functions, so instead of:
Copy code
def A(df):
  # ... A

def B(df):
  # ... B
  A(df)
we'd have:
Copy code
def B(df):
  def A(df):
    # ... A
  # ... B
  A(df)
The assumption was that the code de/serialization wasn't quite working for some reason, though we used cloudpickle (the default?). Packaging stuff into wheels and importing those instead would probably work too. Do note that using these inner functions is a bit slower than calling separate functions, though we haven't benchmarked it accurately. I'm not quite sure if this helps, but it may be worth a shot if this happens when running the nodes themselves...
d
this is super interesting @Filip Panovski thanks for reporting
f
No problem, but I doubt that writing it up in a random chat like this is very helpful 🙂 I think I may have posted about this previously in fact, but seeing how it's unknown (and maybe related to the problem in this thread?) I'll try to get an actual issue (with MRE) up by the end of the week
s
Pickles have many issues related to co dependencies. Thats why in some cases we use a docker container of a complete repo. The stracture is : Model is in models. Py the file is in same level of hierarcchy as pipelines folder. Theres is a pipeline running training. In this pipeline called a node file which imports the model and creates an instance of the class. And runs fit. The node returns the model to a pickle datacatalog Then another script from same level as models folder tries to load the model and run predict on some test input. Not sure that was clear 😅 But thanks for the responses and help! Maybe the instance should also be done in the models. Py and than just import the instance and not the class
i
then use a different script to load this dill and run inference on the model but get this error.
@Sergei Benkovich is that different script outside of the Kedro project? Sometimes when serialising/deserialising Python objects, you need to make sure the classes used are importable in both the serialising and deserialising code. Here you can find a similar SO issue, not Kedro related: https://stackoverflow.com/questions/63101601/import-error-no-module-named-utils-when-using-pickle-load As for @Filip Panovski,s issue for Dask jobs, the underlying reason is the same as Sergei's, but probably has more relevance to Kedro. By default, most Kedro starters follow a
src/
based layout (https://setuptools.pypa.io/en/latest/userguide/package_discovery.html#src-layout) as contrasted to flat-layout https://setuptools.pypa.io/en/latest/userguide/package_discovery.html#flat-layout. In order to run everything normally, Kedro adds the
src/
folder to your PYTHON_PATH under the hood with
bootstrap_project
. So if you are running your code through Kedro, all your modules are importable normally. However, if you submit to other execution engines, they might have a different entrypoints, processing modules, etc. and they might have different assumptions on how the packages can be imported. I am not sure about Dask, but it is entirely possible that if you run in distributed mode, rather than just parallel, then Dask somehow skips the part of Kedro which adds
src/
to the PYTHON_PATH and thus makes none of your code importable. A quick way to fix this is to move all of your code out of
src/
and add this to `pyproject.toml`:
Copy code
[tool.kedro]
source_dir = "."
This way you will force Kedro to use the flat-layout package, which will likely be easier for Dask to pick up. As a side note, executing Python scripts has implications on what is on the
PYTHON_PATH
, so you should always make sure you have only one and only one entrypoint, rather than calling
python src/package/script1.py
and then
python src/package/script2.py
. Python is a nice scripting language, but the moment you start using packages, your entrypoints start to matter (I am not an expert on the topic, but I suppose it should be documented somewhere what gets added to the import path and what does not depending on how you execute your code).
f
Thanks for the in-depth response, will definitely take a look at that - we are indeed running Dask in distributed. Also, sorry for hijacking this thread a bit : )
👍 1