Hey guys! I'm working on a project in wich I recei...
# questions
Hey guys! I'm working on a project in wich I receive a new version of my raw data every week and store it on remote storage. The raw data is cumulative, that is, every week I receive the old data that I already have plus some new records. It is important for me to synchronize the version of my data with my code (i.e. I need to know which version of my data is being used in the code in each commit). To do this, I used a global var with the data version and used it in the file path, as follows:
Copy code
    type: pandas.ExcelDataSet
    filepath: <abfs://my_bucket/01_raw/example_dataset/${globals:raw_data_version}.xlsx>
But now I have a new challenge. I need to code some tests to compare different versions of my data, but using the catalog shown above I can only get the version fixed by raw_data_version. Does anyone have any suggestions on how I can version my raw data? I'm thinking about using PartitionedDataSet, so that my test pipeline can load all versions of the data, but I don't know if it would be a suitable solution.
Are those “different versions” also required to be in-sync with the code (commited to repo)?
yes, they are
That was the reason I didn't use Kedro Versioned Datasets
What about having two data catalog entries then? One for “current” one for “different”?
This is one solution. I would need to duplicate everything. Do you think it is better than using Partitioned Dataset?
What do you mean by everything?
All my versioned raw datasets.
In that case you can have the
and the
as parameters, use
and only load what’s required inside of the node, sth in the notion of:
Copy code
def load_data(
    partitioned_dataset: Dict[str, AbstractDataSet], raw_data_version: str
) -> List[pd.DataFrame]:
    return next(
        for name, ds in partitioned_dataset.items()
        if name.lower().startswith(raw_data_version)
Yes, I think this will be the most appropriate solution. Thanks, @marrrcin!
😎 1