Hey guys I am currently trying to save load pyspark ml objec Kedro #questions

Hey guys - I am currently trying to save/load pysp...

Paul Mora

01/27/2023, 8:44 AM

Hey guys - I am currently trying to save/load pyspark ml objects through the catalog. The documentation states the following: https://kedro.readthedocs.io/en/stable/tools_integration/pyspark.html#use-memorydataset-with-copy-mode-assign-for-non-dataframe-spark-objects and the recommendation to use

MemoryDataSets

for those non-dataframe instances. That is all fine and well, though of course not being able to save any transformers becomes quite tedious at some point. Is there any guidance/ development on that front?

datajoely

01/27/2023, 10:47 AM

can you explain more about the type of object you’re looking to serialise?

datajoely

01/27/2023, 10:49 AM

I think this is the first time we’ve had a user ask for this - this should be a pretty custom simple dataset to implement

datajoely

01/27/2023, 10:52 AM

~~and lastly, what happens if you try and use one the PickleDataSet engines?~~ These will be jvm objects not pickleable

datajoely

01/27/2023, 10:55 AM

You’ll have to wrap these interfaces for MLWriter and MLReader https://github.com/apache/spark/blob/35d00df9bba7238ad4f409999617fae4d04ddbfd/python/pyspark/ml/util.py

Paul Mora

01/27/2023, 1:46 PM

Exactly, it is not pickeable, since it is a jvm object. Sorry for asking again: What do you mean with building a wrapper around the MLWriter and reader? Do you mean building a kedro dataset that utilizes these classes?

Paul Mora

01/27/2023, 1:46 PM

And thanks for answering already

datajoely

01/27/2023, 2:01 PM

exactly

datajoely

01/27/2023, 2:03 PM

so you just wrap the load and save methods of those classes and it should work - we’d also love a contribution back into the project if you get it working!

Paul Mora

01/27/2023, 2:23 PM

that sounds great - appreciate the help!

Paul Mora

01/27/2023, 2:23 PM

Thanks

6 Views

Open in Slack

Previous Next