Hey guys, Me again :smile: I had a question regar...
# questions
f
Hey guys, Me again 😄 I had a question regarding parquet dataset itself. I often encounter issues with custom datatypes during saving. For instance if i have a custom class in my dataframe i would like to still keep this as is - the reason why i use parquet -. I know there needs to be customer serializer/deserilizar code required to do this. I can for sure do it in my code but since it's io related, i believe it should be done in the dataset definition where i can somehow point to my custom which gets serialized before writing to file. I will work on the extended version now, i was wondering if it was discussed before?. i am happy to push this as a PR later
n
If I summarised this right, the problem is you need some class which is not the usual schema that can be defined in the Yaml format? Could you share the constructors of your current class, maybe using omegaconf resolver can already solve this? How is the custom serialiser regisrered?
f
So basically what I imagine is this:
Copy code
data:
  type: pandas.ParquetDataset
  filepath: ....
  serialize_cls: projx.pipelines.MyClass
this new serialize_cls points to a custom data class that I create and is in my dataframe. During parquet save, i will invoke methods of my class to be able to write and read this my parquet dataset. Right now, only default data types work in parquet so anything custom don't get serialized. Hence the solution
User then can define let's say
_serialize_
and
_deserialize_
functions which gets invoked pre/post load and save calls
I am just thinking if there is need from other people to add this into kedro, otherwise i can for sure implement a custom solution locally
or possibly with omegaconf:
Copy code
data:
  type: pandas.ParquetDataset
  filepath: ....
  serializer: ... # omegaconf python code
   deserializer: ... # omegaconf python code
n
My question would be, how would you do that with pure python code without Kedro? How pandas is currently supporting this?
https://stackoverflow.com/questions/61271295/how-to-save-a-pandas-dataframe-with-custom-types-using-pyarrow-and-parquet From this thread it seems like if you have the serialisation method implement properly, you don't need anything extra?
f
I def have to provide serialization method, solution is same with/without kedro. I am just porting this support to kedro reader/writer as i don't wanna do data conversion in my node since this isn't part of what the node function is supposed to do.
👍🏼 1
I think omegaconf resolver could work fine, im testing a solution atm
n
Let's us know if it works, if there are need for extending the current class feel free to open an issue and PR for this.
but I would say,
serialiser
sounds like it belongs to
save_args
and
deserialiser
belong to
load_args
. The argument available for specific dataset usually mirror what's the underlying API support. In this case which is
pd.read_parquet
and
<http://pd.to|pd.to>_parquet
f
Yes, you are right but pandas with pyarrow new dtype support is somehow super complicated and required deeper knowledge of underlying pandas code. I have a working version now but now that i think about it (see pic), this can be done with hooks. All I have to do check for parquet dataset and apply serializing before read/write 😄
n
Ah, I see what you are doing here. I think this is a smart way of handling it. It will be good if you can open an issue, I think this will be useful for other people who may have the same problem regardless.
f
yes so this works like a charm:
I'll open an issue and share this there but im not sure if more required. Hooks are just awesome, sometimes i forget about them 😄
Maybe we can just put this into the documentation? That way they can discover this