Hey guys! I'm working on a project in wich I recei...
# questions
j
Hey guys! I'm working on a project in wich I receive a new version of my raw data every week and store it on remote storage. The raw data is cumulative, that is, every week I receive the old data that I already have plus some new records. It is important for me to synchronize the version of my data with my code (i.e. I need to know which version of my data is being used in the code in each commit). To do this, I used a global var with the data version and used it in the file path, as follows:
Copy code
example_dataset:
    type: pandas.ExcelDataSet
    filepath: <abfs://my_bucket/01_raw/example_dataset/${globals:raw_data_version}.xlsx>
But now I have a new challenge. I need to code some tests to compare different versions of my data, but using the catalog shown above I can only get the version fixed by raw_data_version. Does anyone have any suggestions on how I can version my raw data? I'm thinking about using PartitionedDataSet, so that my test pipeline can load all versions of the data, but I don't know if it would be a suitable solution.
m
Are those “different versions” also required to be in-sync with the code (commited to repo)?
j
yes, they are
That was the reason I didn't use Kedro Versioned Datasets
m
What about having two data catalog entries then? One for “current” one for “different”?
j
This is one solution. I would need to duplicate everything. Do you think it is better than using Partitioned Dataset?
m
What do you mean by everything?
j
All my versioned raw datasets.
m
In that case you can have the
raw_data_version
and the
second_raw_data_version
as parameters, use
PartitionedDataset
and only load what’s required inside of the node, sth in the notion of:
Copy code
def load_data(
    partitioned_dataset: Dict[str, AbstractDataSet], raw_data_version: str
) -> List[pd.DataFrame]:
    return next(
        ds.load()
        for name, ds in partitioned_dataset.items()
        if name.lower().startswith(raw_data_version)
    )
j
Yes, I think this will be the most appropriate solution. Thanks, @marrrcin!
😎 1