Hey guys I m working on a project in wich I receive a new ve Kedro #questions

Hey guys! I'm working on a project in wich I recei...

Júlio Resende

10/18/2023, 12:14 AM

Hey guys! I'm working on a project in wich I receive a new version of my raw data every week and store it on remote storage. The raw data is cumulative, that is, every week I receive the old data that I already have plus some new records. It is important for me to synchronize the version of my data with my code (i.e. I need to know which version of my data is being used in the code in each commit). To do this, I used a global var with the data version and used it in the file path, as follows:

Copy code

example_dataset:
    type: pandas.ExcelDataSet
    filepath: <abfs://my_bucket/01_raw/example_dataset/${globals:raw_data_version}.xlsx>

But now I have a new challenge. I need to code some tests to compare different versions of my data, but using the catalog shown above I can only get the version fixed by raw_data_version. Does anyone have any suggestions on how I can version my raw data? I'm thinking about using PartitionedDataSet, so that my test pipeline can load all versions of the data, but I don't know if it would be a suitable solution.

marrrcin

10/18/2023, 6:50 AM

Are those “different versions” also required to be in-sync with the code (commited to repo)?

Júlio Resende

10/18/2023, 12:28 PM

yes, they are

Júlio Resende

10/18/2023, 12:29 PM

That was the reason I didn't use Kedro Versioned Datasets

marrrcin

10/18/2023, 12:36 PM

What about having two data catalog entries then? One for “current” one for “different”?

Júlio Resende

10/18/2023, 12:39 PM

This is one solution. I would need to duplicate everything. Do you think it is better than using Partitioned Dataset?

marrrcin

10/18/2023, 1:12 PM

What do you mean by everything?

Júlio Resende

10/18/2023, 1:47 PM

All my versioned raw datasets.

marrrcin

10/18/2023, 1:51 PM

In that case you can have the

raw_data_version

and the

second_raw_data_version

as parameters, use

PartitionedDataset

and only load what’s required inside of the node, sth in the notion of:

Copy code

def load_data(
    partitioned_dataset: Dict[str, AbstractDataSet], raw_data_version: str
) -> List[pd.DataFrame]:
    return next(
        ds.load()
        for name, ds in partitioned_dataset.items()
        if name.lower().startswith(raw_data_version)
    )

Júlio Resende

10/18/2023, 2:38 PM

Yes, I think this will be the most appropriate solution. Thanks, @marrrcin!

😎 1

3 Views

Open in Slack

Previous Next