Fazil Topal
09/20/2024, 1:48 PMNok Lam Chan
09/20/2024, 2:15 PMFazil Topal
09/20/2024, 2:18 PMdata:
type: pandas.ParquetDataset
filepath: ....
serialize_cls: projx.pipelines.MyClass
this new serialize_cls points to a custom data class that I create and is in my dataframe. During parquet save, i will invoke methods of my class to be able to write and read this my parquet dataset.
Right now, only default data types work in parquet so anything custom don't get serialized. Hence the solutionFazil Topal
09/20/2024, 2:19 PM_serialize_
and _deserialize_
functions which gets invoked pre/post load and save callsFazil Topal
09/20/2024, 2:20 PMFazil Topal
09/20/2024, 2:32 PMdata:
type: pandas.ParquetDataset
filepath: ....
serializer: ... # omegaconf python code
deserializer: ... # omegaconf python code
Nok Lam Chan
09/20/2024, 2:56 PMNok Lam Chan
09/20/2024, 3:00 PMFazil Topal
09/20/2024, 3:02 PMFazil Topal
09/20/2024, 3:03 PMNok Lam Chan
09/20/2024, 3:06 PMNok Lam Chan
09/20/2024, 3:07 PMserialiser
sounds like it belongs to save_args
and deserialiser
belong to load_args
. The argument available for specific dataset usually mirror what's the underlying API support. In this case which is pd.read_parquet
and <http://pd.to|pd.to>_parquet
Fazil Topal
09/20/2024, 3:42 PMNok Lam Chan
09/20/2024, 3:46 PMFazil Topal
09/20/2024, 4:08 PMFazil Topal
09/20/2024, 4:09 PMFazil Topal
09/20/2024, 4:10 PM