hi all, i have a question about saving huge pickle...
# questions
h
hi all, i have a question about saving huge pickle files. I get an error when saving it to s3.
Copy code
DatasetError: Failed while saving data to data set PickleDataset(backend=pickle,
filepath=.../performance_optimisation/client/data/08_reporting/explaine
r.joblib, load_args=
{}
, protocol=s3, save_args=
{}
).
[Errno 22] Part number must be an integer between 1 and 10000, inclusive
we’re generating a huge explainer for an automatic bidding engine, we’ve now sampling on roughly 1M datapoints to keep things manageable. the pipeline is deployed to AWS batch and running on instances with 600gb+ of RAM, so thats not an issue. the issue is that we should probably specify a larger chunksize for saving to s3, but my question is on how to do that. i saw there is a _`fs_args` key for the pickledataset, specifically the `open_args_save`_ However it is unclear to me how to specify the chunksize given the fsspec docs, does anyone have experience doing this?
👀 1
d
I’ve not seen this before, have you tried using a different more robust pickle engine? something like joblib/dill/cloudpickle may have better performance out of the box
h
sorry, to clarify, we are saving it as joblib,
ow but maybe i should specify the joblib backend <.<”
d
yes!
h
sorry:p, okay, ill run that and check whether that, and some compression makes a difference, thanks!
d
Give it a go - this hasn’t come up before so I need to think about how to achieve this
h
i reckon the issue will still pop up if we really push the size of the explainer even further, now we are turning off interactions, but in the future we’ll run it on an even bigger machine and turn them on again. but maybe we should split those calculations up and so omething smart with a custom multipart dataset or something. ill try this joblib backend first and see!
d
so according to ChatGPT you can reassemble a pickle from chunks so you could create the partitions yourself, use
ParittionedDataSet
(or inherit from this and extend) and then read them back
Copy code
import pickle
import fsspec

# Function to write data in chunks
def write_pickle_in_chunks(data, file_path, chunk_size=1024):
    pickled_data = pickle.dumps(data)
    with fsspec.open(file_path, 'wb') as f:
        for i in range(0, len(pickled_data), chunk_size):
            chunk = pickled_data[i:i+chunk_size]
            f.write(chunk)

# Function to read data in chunks and reconstruct the pickle
def read_pickle_in_chunks(file_path, chunk_size=1024):
    chunks = []
    with fsspec.open(file_path, 'rb') as f:
        while True:
            chunk = f.read(chunk_size)
            if not chunk:
                break
            chunks.append(chunk)
    pickled_data = b''.join(chunks)
    data = pickle.loads(pickled_data)
    return data

# Sample data to pickle
data = {'key': 'value', 'number': 42, 'list': [1, 2, 3, 4, 5]}

# Specify the file path
file_path = 'data.pickle'

# Write data in chunks
write_pickle_in_chunks(data, file_path)

# Read data in chunks and reconstruct the original data
reconstructed_data = read_pickle_in_chunks(file_path)

# Verify that the reconstructed data matches the original data
print(reconstructed_data == data)  # Should print: True
it’s a bit funky but it could work?
what model class is it?
h
hmm, we turned to chatgpt before and got this:
Copy code
import boto3
import joblib
import io
from kedro.io.core import AbstractDataset, DatasetError

class S3MultipartPickleDataSet(AbstractDataset):
    def __init__(self, filepath: str, save_args: dict = None):
        self._filepath = filepath
        self._save_args = save_args or {}
        self._s3 = boto3.client('s3')

    def _describe(self) -> dict:
        return dict(filepath=self._filepath, save_args=self._save_args)

    def _load(self) -> object:
        raise DatasetError("Load not implemented")

    def _save(self, data: object) -> None:
        try:
            # Create an in-memory bytes buffer
            with io.BytesIO() as temp_buffer:
                # Dump the data to the in-memory buffer using joblib
                joblib.dump(data, temp_buffer)
                # Ensure the buffer's cursor is at the start
                temp_buffer.seek(0)
                # Stream the buffer to S3
                bucket, key = self._parse_s3_url(self._filepath)
                self._s3.upload_fileobj(temp_buffer, bucket, key, Config=self._get_transfer_config())
        except Exception as exc:
            raise DatasetError(f"Failed to save data to {self._filepath}") from exc

    def _get_transfer_config(self):
        return boto3.s3.transfer.TransferConfig(
            multipart_threshold=int(self._save_args.get('multipart_chunksize', 25 * 1024 * 1024)),
            max_concurrency=10,
            multipart_chunksize=int(self._save_args.get('multipart_chunksize', 25 * 1024 * 1024)),
            use_threads=True
        )

    @staticmethod
    def _parse_s3_url(s3_url: str):
        assert s3_url.startswith('s3://')
        bucket_key = s3_url[len('s3://'):].split('/', 1)
        return bucket_key[0], bucket_key[1]
but for some reason that also did not work (the batch job failed, but the logs show no Error/traceback
d
This is super interesting - but I don’t have a time to prototype myself
which library is generating the pickle?
h
ExplainerDashboard
it has a build in to_file method, which pickles
self
but pickling it direcly works fine
(for smaller sample sizes)
d
looking at the docs there is some examples of doing that for a docker target https://explainerdashboard.readthedocs.io/en/latest/deployment.html#docker-deployment
but also down the page there are some memory saving tips
h
yeah, but thats quite opionated:p, its easier to integrate explainerdashboard with kedro by not using their methods
d
got it
h
but yeah, i was also thinking about making a custom ExplainerDataset
👍 1
incorporating some of those ideas, but kedro PickleDataset is just easier, less maintanance for me
d
I think you’ll probably need a custom one , but I think it would be cool for us to build a
PartitionedPickleDataset
or prove that it works
longer term
m
h
ahh yes, the CloudPickle, i think that was also the default dataset of one of the kedro runner cloud deployments right?
m
Yup, sagemaker, as linked
h
< its right in the name:p
sorry, i didnt have my coffee yet, okay, thanks @marrrcin, ill try that as well!
👍 1
d
@marrrcin fancy contributing that to
kedro-datasets
for consumability?
m
Fancy, fancy 😄
Let's see if it works first 😛
🚀 1
j
so, an explainer is not a model , right? asking in case any of ONNX, safetensors, skops would help here
h
nope, its not a model
✔️ 1
it worked! with the compression the joblibbed backed pickledataset totally worked. thanks guys!
j
great to know!!
d
So it worked without any custom bits?
h
exactly
💪 1
🎉 1