Filip Panovski
11/17/2022, 10:03 AMyaml Loader part of the ConfigLoader somewhat configurable in any meaningful way? Or does kedro implement its own yaml parsing mechanism? We're trying to use some custom filtering that gets passed to the kedro.extras.datasets.dask.ParquetDataSet load_args. Specifically, we want to be able to do something like:
# catalog.yml
raw_data:
type: dask.ParquetDataSet
filepath: 's3://...'
load_args:
filters:
- !!python/tuple ['year', '=', '2022']
- !!python/tuple ['day', '=', '3']
- !!python/tuple ['id', '=', 'someVal']
dask (via filters, see docs) supports row-filtering on loaded data via this way and yaml (via tuple support in .yml files) supports the above definition. However, yaml unfortunately supports this using either the non-default FullLoader or the UnsafeLoader (for controlled environments, see here). Is it possible to configure the ConfigLoader to use either of these?
An example use case for this would be to filter only the rows belonging to all day = 3 partitions of any month in year = 2022.
I could alternatively write a DataSet that parses this logic from plain string lists, but I was wondering if there's any existing support for something like this.Filip Panovski
11/17/2022, 2:47 PMFilip Panovski
11/17/2022, 2:49 PM