Hi I have created a custom TrinoSQLDataSet that 1 Reads a qu Kedro #questions

Hi. I have created a custom TrinoSQLDataSet, that:...

Andreas Tsangarides

11/02/2023, 10:25 AM

Hi. I have created a custom TrinoSQLDataSet, that: 1. Reads a query from a sql file 2. Injects some parameters into the query, .e.g.

{start_time_utc}

in the query replaced by the parameter

Copy code

# catalog.yml
raw-query-dataset:
  type: <my-package>.io.TrinoSQLDataSet
  query_param_names:
    - start_time_utc

What the dataset does with

query_param_names

is to look for each of those keys in the kedro parameters and inject them into the query, so essentially this is a parameterised query

Copy code

# parameters.yml
start_time_utc: 2023-01-01 00:00:00

In the dataset class itself, this is how i load the kedro parameters:

Copy code

def _load_kedro_params() -> Dict[str, Any]:
    conf_path = str(settings.CONF_SOURCE)
    # get kedro run env to load the params from the correct conf dir
    env = os.getenv("KEDRO_ENV", "local")
    conf_loader = ConfigLoader(conf_source=conf_path, default_run_env=env)
    return conf_loader.get("parameters", "parameters*/**")

This means the yml file is read, but if I run the pipeline using

Copy code

kedro run --params:"start_time_utc: 2023-03-01 13:00:00"

The dataset still loads the parameter from the file… Any way to access the merged dictionary of the parameters overwritten by the cli extra/runtime params? I cannot find a way to access the current session/context anymore

K 1

datajoely

11/02/2023, 10:27 AM

So the datasets shouldn’t be aware of IO

datajoely

11/02/2023, 10:27 AM

the right way to do all of this is to use a

before_pipeline_run

hook and pass the parameter to your catalog entry that way

Andreas Tsangarides

11/02/2023, 10:29 AM

> So the datasets shouldn’t be aware of IO Not not quite sure what you mean by that. Limiting query size at runtime is preferred than filtering later

datajoely

11/02/2023, 10:29 AM

by design you shouldn’t be loading your own config loader

datajoely

11/02/2023, 10:30 AM

the Hooks mechanism is the right way to integrate into the Kedro run lifecycle

datajoely

11/02/2023, 10:30 AM

this comes up a lot and there will be examples on the slack archive too https://linen-slack.kedro.org/

datajoely

11/02/2023, 10:30 AM

https://docs.kedro.org/en/stable/hooks/introduction.html#

Andreas Tsangarides

11/02/2023, 10:32 AM

the right way to do all of this is to use a
before_pipeline_run
hook and pass the parameter to your catalog entry that way

ok, that would work 🙏 any example for adding onto the catalog?

datajoely

11/02/2023, 10:33 AM

https://linen-slack.kedro.org/t/9703408/hi-everyone-i-m-trying-to-create-a-hook-to-overwrite-some-pa#43cd47ea-3cc2-422d-8f67-0c0d52326f04

datajoely

11/02/2023, 10:34 AM

https://linen-slack.kedro.org/t/15885721/hey-everyone-i-seem-to-remember-in-the-past-i-had-a-way-of-a#832c9da0-a9ca-422d-a5f3-7dfc56fd1820

datajoely

11/02/2023, 10:34 AM

you could also do the environment variable way

datajoely

11/02/2023, 10:35 AM

as you can pull those using omegaconf easily

Nok Lam Chan

11/02/2023, 11:56 AM

Could it be achieved better with a custom resolver? E.g. A function read a file and arguments to inject variable.

👍 1

Yolan Honoré-Rougé

11/02/2023, 12:46 PM

I don't see why there is a need for a custom dataset unless I miss something. You can use the built in ``SQL QueryDataset`` and the ``runtime_params`` resolver which are designed to do this

Andreas Tsangarides

11/02/2023, 6:29 PM

1. The query might be more than 1 line long, so having sql files that are read/loaded by the DataSet is cleaner 2. We are not using sqlalchemy to connect to our db (

trino

) 3. the

runtime_params

require the

OmegaConfigLoader

Nok Lam Chan

11/02/2023, 10:52 PM

OmegaConfigLoader

is required and it will be the default in 0.19 series

Nok Lam Chan

11/02/2023, 11:05 PM

You can still read from file - you can have a resolver which read a Jinja template and take variables (with default). This resolver, let’s call it

read_sql_template

In your catalog, you will define it something like this.

Copy code

# catalog.yml
MyCustomSqlDataset:
  ...
  sql: ${read_sql_template: <file_path>, <dict_of_variable>}

Andreas Tsangarides

11/03/2023, 8:00 AM

ok great. tried it last night we use neptune for tracking our experiments and they have not adjusted their config to work wit omega so for now will just use a before_pipeline_run hook to reinstantiate my custom datasets (before they are loaded), ask neptune to update their config loader and then do what you suggested 🙂

👍 1

👍🏼 1

Nok Lam Chan

11/03/2023, 10:15 AM

Are you using

kedro-neptune

? Do they come with a custom config loader?

Andreas Tsangarides

11/08/2023, 12:20 PM

the current version works with

ConfigLoader

(and

TemplatedConfigLoader

) and have asked them to make it work with omega.. they will do it, but have not provided a time estimate

Yolan Honoré-Rougé

11/10/2023, 8:03 AM

BTW I've tested it and it is totally possible to use a Trino backend with the

SQLQueryDataset

, you just need to install

pip install trino[sqlalchemy]

and set the connection in the credentials, so you can likely avoid the custom dataset.

K 2

6 Views

Open in Slack

Previous Next