Júlio Resende
10/18/2023, 12:14 AMexample_dataset:
type: pandas.ExcelDataSet
filepath: <abfs://my_bucket/01_raw/example_dataset/${globals:raw_data_version}.xlsx>
But now I have a new challenge. I need to code some tests to compare different versions of my data, but using the catalog shown above I can only get the version fixed by raw_data_version. Does anyone have any suggestions on how I can version my raw data?
I'm thinking about using PartitionedDataSet, so that my test pipeline can load all versions of the data, but I don't know if it would be a suitable solution.marrrcin
10/18/2023, 6:50 AMJúlio Resende
10/18/2023, 12:28 PMmarrrcin
10/18/2023, 12:36 PMJúlio Resende
10/18/2023, 12:39 PMmarrrcin
10/18/2023, 1:12 PMJúlio Resende
10/18/2023, 1:47 PMmarrrcin
10/18/2023, 1:51 PMraw_data_version
and the second_raw_data_version
as parameters, use PartitionedDataset
and only load what’s required inside of the node, sth in the notion of:
def load_data(
partitioned_dataset: Dict[str, AbstractDataSet], raw_data_version: str
) -> List[pd.DataFrame]:
return next(
ds.load()
for name, ds in partitioned_dataset.items()
if name.lower().startswith(raw_data_version)
)
Júlio Resende
10/18/2023, 2:38 PM