Hello! Is the `yaml` Loader part of the `ConfigLoa...
# questions
f
Hello! Is the
yaml
Loader part of the
ConfigLoader
somewhat configurable in any meaningful way? Or does kedro implement its own
yaml
parsing mechanism? We're trying to use some custom filtering that gets passed to the
kedro.extras.datasets.dask.ParquetDataSet
load_args
. Specifically, we want to be able to do something like:
Copy code
# catalog.yml
raw_data:
  type: dask.ParquetDataSet
  filepath: 's3://...'
  load_args:
    filters:
      - !!python/tuple ['year', '=', '2022']
      - !!python/tuple ['day', '=', '3']
      - !!python/tuple ['id', '=', 'someVal']
dask
(via
filters
, see docs) supports row-filtering on loaded data via this way and
yaml
(via tuple support in
.yml
files) supports the above definition. However,
yaml
unfortunately supports this using either the non-default
FullLoader
or the
UnsafeLoader
(for controlled environments, see here). Is it possible to configure the
ConfigLoader
to use either of these? An example use case for this would be to filter only the rows belonging to all
day = 3
partitions of any month in
year = 2022
. I could alternatively write a DataSet that parses this logic from plain string lists, but I was wondering if there's any existing support for something like this.
I came back to this after asking this morning and it seems that https://github.com/kedro-org/kedro/issues/1011 is talking about exactly this use case. Unfortunately it is still an open issue, so it does not seem possible to do this at the time. Custom DataSet it is. : )
For anyone else stumbling on this: here's a workaround using a custom dataset which was already mentioned Q&A discussion: https://github.com/kedro-org/kedro/discussions/973 I apologize for duplicating the discussion here, it seems my Google-fu was not up to par.