Filip Panovski
11/17/2022, 10:03 AMyaml
Loader part of the ConfigLoader
somewhat configurable in any meaningful way? Or does kedro implement its own yaml
parsing mechanism? We're trying to use some custom filtering that gets passed to the kedro.extras.datasets.dask.ParquetDataSet
load_args
. Specifically, we want to be able to do something like:
# catalog.yml
raw_data:
type: dask.ParquetDataSet
filepath: 's3://...'
load_args:
filters:
- !!python/tuple ['year', '=', '2022']
- !!python/tuple ['day', '=', '3']
- !!python/tuple ['id', '=', 'someVal']
dask
(via filters
, see docs) supports row-filtering on loaded data via this way and yaml
(via tuple support in .yml
files) supports the above definition. However, yaml
unfortunately supports this using either the non-default FullLoader
or the UnsafeLoader
(for controlled environments, see here). Is it possible to configure the ConfigLoader
to use either of these?
An example use case for this would be to filter only the rows belonging to all day = 3
partitions of any month in year = 2022
.
I could alternatively write a DataSet that parses this logic from plain string lists, but I was wondering if there's any existing support for something like this.