Hey guys Me again smile I had a question regarding parquet d Kedro #questions

Hey guys, Me again :smile: I had a question regar...

Fazil Topal

09/20/2024, 1:48 PM

Hey guys, Me again 😄 I had a question regarding parquet dataset itself. I often encounter issues with custom datatypes during saving. For instance if i have a custom class in my dataframe i would like to still keep this as is - the reason why i use parquet -. I know there needs to be customer serializer/deserilizar code required to do this. I can for sure do it in my code but since it's io related, i believe it should be done in the dataset definition where i can somehow point to my custom which gets serialized before writing to file. I will work on the extended version now, i was wondering if it was discussed before?. i am happy to push this as a PR later

Nok Lam Chan

09/20/2024, 2:15 PM

If I summarised this right, the problem is you need some class which is not the usual schema that can be defined in the Yaml format? Could you share the constructors of your current class, maybe using omegaconf resolver can already solve this? How is the custom serialiser regisrered?

Fazil Topal

09/20/2024, 2:18 PM

So basically what I imagine is this:

Copy code

data:
  type: pandas.ParquetDataset
  filepath: ....
  serialize_cls: projx.pipelines.MyClass

this new serialize_cls points to a custom data class that I create and is in my dataframe. During parquet save, i will invoke methods of my class to be able to write and read this my parquet dataset. Right now, only default data types work in parquet so anything custom don't get serialized. Hence the solution

Fazil Topal

09/20/2024, 2:19 PM

User then can define let's say

_serialize_

and

_deserialize_

functions which gets invoked pre/post load and save calls

Fazil Topal

09/20/2024, 2:20 PM

I am just thinking if there is need from other people to add this into kedro, otherwise i can for sure implement a custom solution locally

Fazil Topal

09/20/2024, 2:32 PM

or possibly with omegaconf:

Copy code

data:
  type: pandas.ParquetDataset
  filepath: ....
  serializer: ... # omegaconf python code
   deserializer: ... # omegaconf python code

Nok Lam Chan

09/20/2024, 2:56 PM

My question would be, how would you do that with pure python code without Kedro? How pandas is currently supporting this?

Nok Lam Chan

09/20/2024, 3:00 PM

https://stackoverflow.com/questions/61271295/how-to-save-a-pandas-dataframe-with-custom-types-using-pyarrow-and-parquet From this thread it seems like if you have the serialisation method implement properly, you don't need anything extra?

Fazil Topal

09/20/2024, 3:02 PM

I def have to provide serialization method, solution is same with/without kedro. I am just porting this support to kedro reader/writer as i don't wanna do data conversion in my node since this isn't part of what the node function is supposed to do.

👍🏼 1

Fazil Topal

09/20/2024, 3:03 PM

I think omegaconf resolver could work fine, im testing a solution atm

Nok Lam Chan

09/20/2024, 3:06 PM

Let's us know if it works, if there are need for extending the current class feel free to open an issue and PR for this.

Nok Lam Chan

09/20/2024, 3:07 PM

but I would say,

serialiser

sounds like it belongs to

save_args

and

deserialiser

belong to

load_args

. The argument available for specific dataset usually mirror what's the underlying API support. In this case which is

pd.read_parquet

and

<http://pd.to|pd.to>_parquet

Fazil Topal

09/20/2024, 3:42 PM

Yes, you are right but pandas with pyarrow new dtype support is somehow super complicated and required deeper knowledge of underlying pandas code. I have a working version now but now that i think about it (see pic), this can be done with hooks. All I have to do check for parquet dataset and apply serializing before read/write 😄

Nok Lam Chan

09/20/2024, 3:46 PM

Ah, I see what you are doing here. I think this is a smart way of handling it. It will be good if you can open an issue, I think this will be useful for other people who may have the same problem regardless.

Fazil Topal

09/20/2024, 4:08 PM

yes so this works like a charm:

Fazil Topal

09/20/2024, 4:09 PM

I'll open an issue and share this there but im not sure if more required. Hooks are just awesome, sometimes i forget about them 😄

Fazil Topal

09/20/2024, 4:10 PM

Maybe we can just put this into the documentation? That way they can discover this

Open in Slack

Previous Next