Hiya, trying to figure out params for data process...
# questions
s
Hiya, trying to figure out params for data processing pipelines. I'd like to set parameters for catalog config so that catalog.load() can load dataset with load_argsdtypedtypes_dict_var, like:
Copy code
my_dataset:
  type: pandas.CSVDataSet
  filepath: path-to-my-file.csv
  load_args:
    parse_dates: ['col_3']
    dtype: dtypes_dict_var
So that catalog.yml won't be too many lines long. I'd like this dtype dict to live within conf/base/parameters/my_pipeline.yml, as:
Copy code
dtypes_dict_var: {
  "col_1": int,
  "col_2": str,
  "col_3": DateTime<'Y-m-d'>, # assumes YAML API syntax will be converted to datetime object
}
Another question here is how to pass in datetime object type to load_args:dtype I'd like this dtype dict to affect only loading my_dataset, and not use as a global var if possible. A separate case could be that I'd like to load the same dataset with different dtypes in different pipelines, which could utilise TemplatedConfigLoader.. Passing in certain parameters doesn't seem very straightforward tbh :/ Thanks in advance!
n
For your 1st question, two possible solutions I can think of now: 1. Use the native YAML alias, this means the
dtypees_dict_var
will live in the same
catalog.yml
though 2. Use TemplatedConfigLoader, this means it will be a global var though. Can I ask why do you want to do this? It seems like you are moving part of the dataset definition to another file, is that because you only want to overwrite part of the dataset definition? IMO
paramters.should
be something that gets passed into nodes, the parameters about
dataset
should be in
catalog.yml
instead
m
Nok is right about the options for your first question. One other thing you could do is use the new
OmegaConfigLoader
which we released on Tuesday. It allows you to do variable interpolation for parameters with omegaconf. See our docs for more info: https://kedro.readthedocs.io/en/stable/kedro_project_setup/configuration.html#templating-for-parameters
s
Thanks for your solutions Nok and Merel, I do want to overwrite part of the dataset definition, i.e. selected columns to parse for parquet, or slightly different naming for usecols, or even possibly changing dtypes on parse for different pipelines. But I'm happy to explore the options suggested :) There's a few columns I'd like to restrict the type of that's parsed. With read_excel and openpyxl, if you have a overflowing integer (>Int64) in a numeric column, by default it could be parsed as float, which gets converted to scientific notation (e+xx), and the column would end up either erroring (cannot parse string as int/float), or end up as a string column. It's fine to just set the column type as str in a
dtypes_dict_var
albeit having to increase memory usage. Extending from this, there's other types/ formats I'd like to strictly enforce on read, i.e. DateTime<'Y-m-d'> (YAML syntax, but this errors, this relates to my other question of parsing in datetime or other object types such as numpy dtypes into this
dtypes_dict_var
). Furthermore, if I'd like to use pyarrow engine in ParquetDataSet for loading, and pass options to the engine, I may also need to pass in pyarrow objects. (The openpyxl parser problem is best dealt with outside of here, and ideally I wouldn't have to deal with this if I didn't receive a data dump in Excel...)
Is it possible to overload the bound method that gets "lazily loaded" in the values of the resulting dict(str, Callable) from catalog.load(), and pass in config using from_config of Abstract/AbstractVersioned DataSet classes, or pass arguments into the Callable directly?
n
This could probably be better solved with
OmegaConfigLoader
resolver in the future, you could potentially create any class/object from the config. Note: This is just something I hacked together now, you can also create a custom
OmegaConfigLoader
to achieve similarly
🌟 1
💯 1
s
Thank you so much! I wonder if there's ongoing work/ plans to integrate these extra types possibly as part of extras?
n
What do you mean by “part of extras”?
s
As in like a plugin for kedro or kedro.extras. Just wondering if there has been others with similar use-case, happy to put in an issue on gh (as should be the practice) 😆
m
@Solomon Yu are you referring to the resolvers? Because those are a feature of
OmegaConf
and we’d advice people to just create their custom ones themselves
s
In this context yes. I'd imagine there would be some use case for i.e. passing in pyarrow objects like make_write_options, into kwargs of pandas.read_parquet()? Maybe there's a point where certain resolvers become repeated boiler plate that could be integrated as a plugin/ extra?
Anyhow this has been massively helpful. Thanks everyone!
👍 1
m
That’s an interesting idea. To be honest we haven’t thought too much about this yet, because we only released the
OmegaConfigLoader
on Tuesday 🙂 I expect in the
0.19.X
series we’ll add a lot more features to it.
🦜 1
👍 1
n
@Solomon Yu Feel free to open up an issue on GH, this feature is relatively new, any thoughts are welcome!