Fazil Topal
09/20/2024, 1:48 PMNok Lam Chan
09/20/2024, 2:15 PMFazil Topal
09/20/2024, 2:18 PMdata:
type: pandas.ParquetDataset
filepath: ....
serialize_cls: projx.pipelines.MyClass
this new serialize_cls points to a custom data class that I create and is in my dataframe. During parquet save, i will invoke methods of my class to be able to write and read this my parquet dataset.
Right now, only default data types work in parquet so anything custom don't get serialized. Hence the solutionFazil Topal
09/20/2024, 2:19 PM_serialize_ and _deserialize_ functions which gets invoked pre/post load and save callsFazil Topal
09/20/2024, 2:20 PMFazil Topal
09/20/2024, 2:32 PMdata:
type: pandas.ParquetDataset
filepath: ....
serializer: ... # omegaconf python code
deserializer: ... # omegaconf python codeNok Lam Chan
09/20/2024, 2:56 PMNok Lam Chan
09/20/2024, 3:00 PMFazil Topal
09/20/2024, 3:02 PMFazil Topal
09/20/2024, 3:03 PMNok Lam Chan
09/20/2024, 3:06 PMNok Lam Chan
09/20/2024, 3:07 PMserialiser sounds like it belongs to save_args and deserialiser belong to load_args. The argument available for specific dataset usually mirror what's the underlying API support. In this case which is pd.read_parquet and <http://pd.to|pd.to>_parquetFazil Topal
09/20/2024, 3:42 PMNok Lam Chan
09/20/2024, 3:46 PMFazil Topal
09/20/2024, 4:08 PMFazil Topal
09/20/2024, 4:09 PMFazil Topal
09/20/2024, 4:10 PM