when saving a model using pickleDataSet with dill backend it Kedro #questions

when saving a model using pickleDataSet with dill ...

Sergei Benkovich

02/05/2023, 8:05 PM

when saving a model using pickleDataSet with dill backend it packages the node in which the model instance was created and ran, trying to dill.load raises

Copy code

ModuleNotFoundError: No module named 'pipelines'

any suggestions on how to handle it?

datajoely

02/06/2023, 9:39 AM

This feels like a working directory / installation issue. Is this all on the same machine?

Sergei Benkovich

02/06/2023, 9:42 AM

yes on same machine. i have a node which inside instantiates a model, trains it, and outputs it as a result. saves using data catalog then use a different script to load this dill and run inference on the model but get this error.

datajoely

02/06/2023, 10:53 AM

ah interesting - is it running the same environment and version of python?

Sergei Benkovich

02/06/2023, 10:57 AM

Yes, Same interpreter

datajoely

02/06/2023, 10:57 AM

can you post more of the stack trace?

Sergei Benkovich

02/06/2023, 2:36 PM

but i can’t seem to reproduce it… i reran the whole pipeline and succesfully loaded the dill. this error occurs on my previously trained model… so i can’t reproduce right now, but will ping if i see it again 🙂 thx!

datajoely

02/06/2023, 3:26 PM

interesting

datajoely

02/06/2023, 3:26 PM

the error looks like a Kedro one, but as I said is something I’d expect to see if you were in the wrong directory not related to the pickle itself

Filip Panovski

02/06/2023, 3:47 PM

@Sergei Benkovich We ran into a similar issue to this when submitting our Kedro pipeline to a Dask cluster - functions within the same node were unable to call each other (and failed with a

No module named 'my_pipeline'

error). One temporary workaround was to define the functions being called as inner functions, so instead of:

Copy code

def A(df):
  # ... A

def B(df):
  # ... B
  A(df)

we'd have:

Copy code

def B(df):
  def A(df):
    # ... A
  # ... B
  A(df)

The assumption was that the code de/serialization wasn't quite working for some reason, though we used cloudpickle (the default?). Packaging stuff into wheels and importing those instead would probably work too. Do note that using these inner functions is a bit slower than calling separate functions, though we haven't benchmarked it accurately. I'm not quite sure if this helps, but it may be worth a shot if this happens when running the nodes themselves...

datajoely

02/06/2023, 4:01 PM

this is super interesting @Filip Panovski thanks for reporting

Filip Panovski

02/06/2023, 4:11 PM

No problem, but I doubt that writing it up in a random chat like this is very helpful 🙂 I think I may have posted about this previously in fact, but seeing how it's unknown (and maybe related to the problem in this thread?) I'll try to get an actual issue (with MRE) up by the end of the week

Sergei Benkovich

02/06/2023, 4:29 PM

Pickles have many issues related to co dependencies. Thats why in some cases we use a docker container of a complete repo. The stracture is : Model is in models. Py the file is in same level of hierarcchy as pipelines folder. Theres is a pipeline running training. In this pipeline called a node file which imports the model and creates an instance of the class. And runs fit. The node returns the model to a pickle datacatalog Then another script from same level as models folder tries to load the model and run predict on some test input. Not sure that was clear 😅 But thanks for the responses and help! Maybe the instance should also be done in the models. Py and than just import the instance and not the class

Ivan Danov

02/06/2023, 4:30 PM

then use a different script to load this dill and run inference on the model but get this error.

@Sergei Benkovich is that different script outside of the Kedro project? Sometimes when serialising/deserialising Python objects, you need to make sure the classes used are importable in both the serialising and deserialising code. Here you can find a similar SO issue, not Kedro related: https://stackoverflow.com/questions/63101601/import-error-no-module-named-utils-when-using-pickle-load As for @Filip Panovski,s issue for Dask jobs, the underlying reason is the same as Sergei's, but probably has more relevance to Kedro. By default, most Kedro starters follow a

src/

based layout (https://setuptools.pypa.io/en/latest/userguide/package_discovery.html#src-layout) as contrasted to flat-layout https://setuptools.pypa.io/en/latest/userguide/package_discovery.html#flat-layout. In order to run everything normally, Kedro adds the

src/

folder to your PYTHON_PATH under the hood with

bootstrap_project

. So if you are running your code through Kedro, all your modules are importable normally. However, if you submit to other execution engines, they might have a different entrypoints, processing modules, etc. and they might have different assumptions on how the packages can be imported. I am not sure about Dask, but it is entirely possible that if you run in distributed mode, rather than just parallel, then Dask somehow skips the part of Kedro which adds

src/

to the PYTHON_PATH and thus makes none of your code importable. A quick way to fix this is to move all of your code out of

src/

and add this to `pyproject.toml`:

Copy code

[tool.kedro]
source_dir = "."

This way you will force Kedro to use the flat-layout package, which will likely be easier for Dask to pick up. As a side note, executing Python scripts has implications on what is on the

PYTHON_PATH

, so you should always make sure you have only one and only one entrypoint, rather than calling

python src/package/script1.py

and then

python src/package/script2.py

. Python is a nice scripting language, but the moment you start using packages, your entrypoints start to matter (I am not an expert on the topic, but I suppose it should be documented somewhere what gets added to the import path and what does not depending on how you execute your code).

Ivan Danov

02/06/2023, 4:36 PM

More info can be found here: https://docs.python.org/3/library/sys.html#sys.path

Filip Panovski

02/06/2023, 4:37 PM

Thanks for the in-depth response, will definitely take a look at that - we are indeed running Dask in distributed. Also, sorry for hijacking this thread a bit : )

👍 1

107 Views

Open in Slack

Previous Next