hi all i have a question about saving huge pickle files I ge Kedro #questions

hi all, i have a question about saving huge pickle...

Hugo Evers

06/07/2024, 10:01 AM

hi all, i have a question about saving huge pickle files. I get an error when saving it to s3.

Copy code

DatasetError: Failed while saving data to data set PickleDataset(backend=pickle,
filepath=.../performance_optimisation/client/data/08_reporting/explaine
r.joblib, load_args=
{}
, protocol=s3, save_args=
{}
).
[Errno 22] Part number must be an integer between 1 and 10000, inclusive

we’re generating a huge explainer for an automatic bidding engine, we’ve now sampling on roughly 1M datapoints to keep things manageable. the pipeline is deployed to AWS batch and running on instances with 600gb+ of RAM, so thats not an issue. the issue is that we should probably specify a larger chunksize for saving to s3, but my question is on how to do that. i saw there is a _`fs_args` key for the pickledataset, specifically the `open_args_save`_ However it is unclear to me how to specify the chunksize given the fsspec docs, does anyone have experience doing this?

👀 1

datajoely

06/07/2024, 10:13 AM

I’ve not seen this before, have you tried using a different more robust pickle engine? something like joblib/dill/cloudpickle may have better performance out of the box

Hugo Evers

06/07/2024, 10:18 AM

sorry, to clarify, we are saving it as joblib,

Hugo Evers

06/07/2024, 10:18 AM

ow but maybe i should specify the joblib backend <.<”

datajoely

06/07/2024, 10:18 AM

yes!

Hugo Evers

06/07/2024, 10:19 AM

sorry:p, okay, ill run that and check whether that, and some compression makes a difference, thanks!

datajoely

06/07/2024, 10:21 AM

Give it a go - this hasn’t come up before so I need to think about how to achieve this

Hugo Evers

06/07/2024, 10:22 AM

i reckon the issue will still pop up if we really push the size of the explainer even further, now we are turning off interactions, but in the future we’ll run it on an even bigger machine and turn them on again. but maybe we should split those calculations up and so omething smart with a custom multipart dataset or something. ill try this joblib backend first and see!

datajoely

06/07/2024, 10:23 AM

so according to ChatGPT you can reassemble a pickle from chunks so you could create the partitions yourself, use

ParittionedDataSet

(or inherit from this and extend) and then read them back

Copy code

import pickle
import fsspec

# Function to write data in chunks
def write_pickle_in_chunks(data, file_path, chunk_size=1024):
    pickled_data = pickle.dumps(data)
    with fsspec.open(file_path, 'wb') as f:
        for i in range(0, len(pickled_data), chunk_size):
            chunk = pickled_data[i:i+chunk_size]
            f.write(chunk)

# Function to read data in chunks and reconstruct the pickle
def read_pickle_in_chunks(file_path, chunk_size=1024):
    chunks = []
    with fsspec.open(file_path, 'rb') as f:
        while True:
            chunk = f.read(chunk_size)
            if not chunk:
                break
            chunks.append(chunk)
    pickled_data = b''.join(chunks)
    data = pickle.loads(pickled_data)
    return data

# Sample data to pickle
data = {'key': 'value', 'number': 42, 'list': [1, 2, 3, 4, 5]}

# Specify the file path
file_path = 'data.pickle'

# Write data in chunks
write_pickle_in_chunks(data, file_path)

# Read data in chunks and reconstruct the original data
reconstructed_data = read_pickle_in_chunks(file_path)

# Verify that the reconstructed data matches the original data
print(reconstructed_data == data)  # Should print: True

datajoely

06/07/2024, 10:24 AM

it’s a bit funky but it could work?

datajoely

06/07/2024, 10:24 AM

what model class is it?

Hugo Evers

06/07/2024, 10:25 AM

hmm, we turned to chatgpt before and got this:

Copy code

import boto3
import joblib
import io
from kedro.io.core import AbstractDataset, DatasetError

class S3MultipartPickleDataSet(AbstractDataset):
    def __init__(self, filepath: str, save_args: dict = None):
        self._filepath = filepath
        self._save_args = save_args or {}
        self._s3 = boto3.client('s3')

    def _describe(self) -> dict:
        return dict(filepath=self._filepath, save_args=self._save_args)

    def _load(self) -> object:
        raise DatasetError("Load not implemented")

    def _save(self, data: object) -> None:
        try:
            # Create an in-memory bytes buffer
            with io.BytesIO() as temp_buffer:
                # Dump the data to the in-memory buffer using joblib
                joblib.dump(data, temp_buffer)
                # Ensure the buffer's cursor is at the start
                temp_buffer.seek(0)
                # Stream the buffer to S3
                bucket, key = self._parse_s3_url(self._filepath)
                self._s3.upload_fileobj(temp_buffer, bucket, key, Config=self._get_transfer_config())
        except Exception as exc:
            raise DatasetError(f"Failed to save data to {self._filepath}") from exc

    def _get_transfer_config(self):
        return boto3.s3.transfer.TransferConfig(
            multipart_threshold=int(self._save_args.get('multipart_chunksize', 25 * 1024 * 1024)),
            max_concurrency=10,
            multipart_chunksize=int(self._save_args.get('multipart_chunksize', 25 * 1024 * 1024)),
            use_threads=True
        )

    @staticmethod
    def _parse_s3_url(s3_url: str):
        assert s3_url.startswith('s3://')
        bucket_key = s3_url[len('s3://'):].split('/', 1)
        return bucket_key[0], bucket_key[1]

Hugo Evers

06/07/2024, 10:25 AM

but for some reason that also did not work (the batch job failed, but the logs show no Error/traceback

datajoely

06/07/2024, 10:25 AM

This is super interesting - but I don’t have a time to prototype myself

datajoely

06/07/2024, 10:25 AM

which library is generating the pickle?

Hugo Evers

06/07/2024, 10:26 AM

ExplainerDashboard

Hugo Evers

06/07/2024, 10:26 AM

https://explainerdashboard.readthedocs.io/en/latest/explainers.html#regressionexplainer

Hugo Evers

06/07/2024, 10:27 AM

it has a build in to_file method, which pickles

self

but pickling it direcly works fine

Hugo Evers

06/07/2024, 10:27 AM

(for smaller sample sizes)

datajoely

06/07/2024, 10:27 AM

looking at the docs there is some examples of doing that for a docker target https://explainerdashboard.readthedocs.io/en/latest/deployment.html#docker-deployment

datajoely

06/07/2024, 10:27 AM

but also down the page there are some memory saving tips

Hugo Evers

06/07/2024, 10:28 AM

yeah, but thats quite opionated:p, its easier to integrate explainerdashboard with kedro by not using their methods

datajoely

06/07/2024, 10:28 AM

got it

Hugo Evers

06/07/2024, 10:28 AM

but yeah, i was also thinking about making a custom ExplainerDataset

👍 1

Hugo Evers

06/07/2024, 10:29 AM

incorporating some of those ideas, but kedro PickleDataset is just easier, less maintanance for me

datajoely

06/07/2024, 10:32 AM

I think you’ll probably need a custom one , but I think it would be cool for us to build a

PartitionedPickleDataset

or prove that it works

datajoely

06/07/2024, 10:32 AM

longer term

marrrcin

06/07/2024, 10:38 AM

You can try stealing this one https://github.com/getindata/kedro-sagemaker/blob/dbd78fd6c1781cc9e8cf046e14b3ab96faf63719/kedro_sagemaker/datasets.py#L126, it should handle large files

Hugo Evers

06/07/2024, 10:42 AM

ahh yes, the CloudPickle, i think that was also the default dataset of one of the kedro runner cloud deployments right?

marrrcin

06/07/2024, 10:42 AM

Yup, sagemaker, as linked

Hugo Evers

06/07/2024, 10:42 AM

< its right in the name:p

Hugo Evers

06/07/2024, 10:42 AM

sorry, i didnt have my coffee yet, okay, thanks @marrrcin, ill try that as well!

👍 1

datajoely

06/07/2024, 10:45 AM

@marrrcin fancy contributing that to

kedro-datasets

for consumability?

marrrcin

06/07/2024, 10:45 AM

Fancy, fancy 😄

marrrcin

06/07/2024, 10:45 AM

Let's see if it works first 😛

🚀 1

Juan Luis

06/07/2024, 11:07 AM

so, an explainer is not a model , right? asking in case any of ONNX, safetensors, skops would help here

Hugo Evers

06/07/2024, 11:07 AM

nope, its not a model

✔️ 1

Hugo Evers

06/12/2024, 4:28 PM

it worked! with the compression the joblibbed backed pickledataset totally worked. thanks guys!

Juan Luis

06/12/2024, 5:03 PM

great to know!!

datajoely

06/12/2024, 5:15 PM

So it worked without any custom bits?

Hugo Evers

06/14/2024, 9:24 AM

exactly

💪 1

🎉 1

23 Views

Open in Slack

Previous Next