Hiya trying to figure out params for data processing pipelin Kedro #questions

Hiya, trying to figure out params for data process...

Solomon Yu

02/22/2023, 2:18 PM

Hiya, trying to figure out params for data processing pipelines. I'd like to set parameters for catalog config so that catalog.load() can load dataset with load_argsdtypedtypes_dict_var, like:

Copy code

my_dataset:
  type: pandas.CSVDataSet
  filepath: path-to-my-file.csv
  load_args:
    parse_dates: ['col_3']
    dtype: dtypes_dict_var

So that catalog.yml won't be too many lines long. I'd like this dtype dict to live within conf/base/parameters/my_pipeline.yml, as:

Copy code

dtypes_dict_var: {
  "col_1": int,
  "col_2": str,
  "col_3": DateTime<'Y-m-d'>, # assumes YAML API syntax will be converted to datetime object
}

Another question here is how to pass in datetime object type to load_args:dtype I'd like this dtype dict to affect only loading my_dataset, and not use as a global var if possible. A separate case could be that I'd like to load the same dataset with different dtypes in different pipelines, which could utilise TemplatedConfigLoader.. Passing in certain parameters doesn't seem very straightforward tbh :/ Thanks in advance!

Nok Lam Chan

02/23/2023, 5:39 AM

For your 1st question, two possible solutions I can think of now: 1. Use the native YAML alias, this means the

dtypees_dict_var

will live in the same

catalog.yml

though 2. Use TemplatedConfigLoader, this means it will be a global var though. Can I ask why do you want to do this? It seems like you are moving part of the dataset definition to another file, is that because you only want to overwrite part of the dataset definition? IMO

paramters.should

be something that gets passed into nodes, the parameters about

dataset

should be in

catalog.yml

instead

Merel

02/23/2023, 10:55 AM

Nok is right about the options for your first question. One other thing you could do is use the new

OmegaConfigLoader

which we released on Tuesday. It allows you to do variable interpolation for parameters with omegaconf. See our docs for more info: https://kedro.readthedocs.io/en/stable/kedro_project_setup/configuration.html#templating-for-parameters

Solomon Yu

02/23/2023, 12:16 PM

Thanks for your solutions Nok and Merel, I do want to overwrite part of the dataset definition, i.e. selected columns to parse for parquet, or slightly different naming for usecols, or even possibly changing dtypes on parse for different pipelines. But I'm happy to explore the options suggested :) There's a few columns I'd like to restrict the type of that's parsed. With read_excel and openpyxl, if you have a overflowing integer (>Int64) in a numeric column, by default it could be parsed as float, which gets converted to scientific notation (e+xx), and the column would end up either erroring (cannot parse string as int/float), or end up as a string column. It's fine to just set the column type as str in a

dtypes_dict_var

albeit having to increase memory usage. Extending from this, there's other types/ formats I'd like to strictly enforce on read, i.e. DateTime<'Y-m-d'> (YAML syntax, but this errors, this relates to my other question of parsing in datetime or other object types such as numpy dtypes into this

dtypes_dict_var

). Furthermore, if I'd like to use pyarrow engine in ParquetDataSet for loading, and pass options to the engine, I may also need to pass in pyarrow objects. (The openpyxl parser problem is best dealt with outside of here, and ideally I wouldn't have to deal with this if I didn't receive a data dump in Excel...)

Solomon Yu

02/23/2023, 12:43 PM

Is it possible to overload the bound method that gets "lazily loaded" in the values of the resulting dict(str, Callable) from catalog.load(), and pass in config using from_config of Abstract/AbstractVersioned DataSet classes, or pass arguments into the Callable directly?

Nok Lam Chan

02/23/2023, 3:33 PM

This could probably be better solved with

OmegaConfigLoader

resolver in the future, you could potentially create any class/object from the config. Note: This is just something I hacked together now, you can also create a custom

OmegaConfigLoader

to achieve similarly

🌟 1

💯 1

Solomon Yu

02/23/2023, 3:40 PM

Thank you so much! I wonder if there's ongoing work/ plans to integrate these extra types possibly as part of extras?

Nok Lam Chan

02/23/2023, 3:40 PM

What do you mean by “part of extras”?

Solomon Yu

02/23/2023, 3:42 PM

As in like a plugin for kedro or kedro.extras. Just wondering if there has been others with similar use-case, happy to put in an issue on gh (as should be the practice) 😆

Merel

02/23/2023, 3:42 PM

@Solomon Yu are you referring to the resolvers? Because those are a feature of

OmegaConf

and we’d advice people to just create their custom ones themselves

Solomon Yu

02/23/2023, 3:45 PM

In this context yes. I'd imagine there would be some use case for i.e. passing in pyarrow objects like make_write_options, into kwargs of pandas.read_parquet()? Maybe there's a point where certain resolvers become repeated boiler plate that could be integrated as a plugin/ extra?

Solomon Yu

02/23/2023, 3:47 PM

Anyhow this has been massively helpful. Thanks everyone!

👍 1

Merel

02/23/2023, 3:48 PM

That’s an interesting idea. To be honest we haven’t thought too much about this yet, because we only released the

OmegaConfigLoader

on Tuesday 🙂 I expect in the

0.19.X

series we’ll add a lot more features to it.

🦜 1

👍 1

Nok Lam Chan

02/23/2023, 4:11 PM

@Solomon Yu Feel free to open up an issue on GH, this feature is relatively new, any thoughts are welcome!

16 Views

Open in Slack

Previous Next