Hello, is it possible to save a sklearn pipeline o...
# questions
m
Hello, is it possible to save a sklearn pipeline object in pickle because I have this error :
Copy code
DataSetError: <class 'sklearn.pipeline.Pipeline'> was not serialised due to: Can't pickle local object 'fit_best_model.<locals>.<lambda>'
I just return a partitioned pickle dataset like that
return {'model_' + parameters['model']: pipeline}
and I define the dataset in catalog.yml like that
Copy code
models_partionned:
  type: PartitionedDataSet
  path: data/06_models/${date}/${target}/
  filename_suffix: ".pkl"
  dataset:
    type: pickle.PickleDataSet
m
Use backend: cloudpickle param for the PickleDataSet (install cloudpickle first) or don't use lambdas in your sklearn Pipeline
m
what is lambdas here please ?
🙄 1
d
@Massinissa Saïdi Did you define a function called
fit_best_model
?
m
yes
d
Can you share the definition? Or at least check if you used
lambda
in there?
m
ooh when Marcin said lambda it talk about the lambda function. Yes I used it:
TfidfVectorizer(tokenizer=lambda x: x.split(' '),...
👍 1
d
You can define a separate function instead, or you may even be able to:
Copy code
from operator import methodcaller

TfidfVectorizer(tokenizer=methodcaller('split', ' '),...
m
ok nice thanks 🙂
d
Also, just minor (unsolicited) notes: 1. Maybe you don't need to pass
' '
argument to
split
? By default,
split
already will separate based on any run of whitespace. Unless you really need it to split on single space. 2.
models_partionned
is spelled wrong (if it's English) 😉
👍 1
m
yes i wrote to fast haha thx