Hi. I have created a custom TrinoSQLDataSet, that:...
# questions
a
Hi. I have created a custom TrinoSQLDataSet, that: 1. Reads a query from a sql file 2. Injects some parameters into the query, .e.g.
{start_time_utc}
in the query replaced by the parameter
Copy code
# catalog.yml
raw-query-dataset:
  type: <my-package>.io.TrinoSQLDataSet
  query_param_names:
    - start_time_utc
What the dataset does with
query_param_names
is to look for each of those keys in the kedro parameters and inject them into the query, so essentially this is a parameterised query
Copy code
# parameters.yml
start_time_utc: 2023-01-01 00:00:00
In the dataset class itself, this is how i load the kedro parameters:
Copy code
def _load_kedro_params() -> Dict[str, Any]:
    conf_path = str(settings.CONF_SOURCE)
    # get kedro run env to load the params from the correct conf dir
    env = os.getenv("KEDRO_ENV", "local")
    conf_loader = ConfigLoader(conf_source=conf_path, default_run_env=env)
    return conf_loader.get("parameters", "parameters*/**")
This means the yml file is read, but if I run the pipeline using
Copy code
kedro run --params:"start_time_utc: 2023-03-01 13:00:00"
The dataset still loads the parameter from the file… Any way to access the merged dictionary of the parameters overwritten by the cli extra/runtime params? I cannot find a way to access the current session/context anymore
K 1
d
So the datasets shouldn’t be aware of IO
the right way to do all of this is to use a
before_pipeline_run
hook and pass the parameter to your catalog entry that way
a
> So the datasets shouldn’t be aware of IO Not not quite sure what you mean by that. Limiting query size at runtime is preferred than filtering later
d
by design you shouldn’t be loading your own config loader
the Hooks mechanism is the right way to integrate into the Kedro run lifecycle
this comes up a lot and there will be examples on the slack archive too https://linen-slack.kedro.org/
a
the right way to do all of this is to use a
before_pipeline_run
hook and pass the parameter to your catalog entry that way
ok, that would work 🙏 any example for adding onto the catalog?
you could also do the environment variable way
as you can pull those using omegaconf easily
n
Could it be achieved better with a customer resolver?
👍 1
y
I don't see why there is a need for a custom dataset unless I miss something. You can use the built in ``SQL QueryDataset`` and the ``runtime_params`` resolver which are designed to do this
a
1. The query might be more than 1 line long, so having sql files that are read/loaded by the DataSet is cleaner 2. We are not using sqlalchemy to connect to our db (
trino
) 3. the
runtime_params
require the
OmegaConfigLoader
?
n
OmegaConfigLoader
is required and it will be the default in 0.19 series
You can still read from file - you can have a resolver which read a Jinja template and take variables (with default). This resolver, let’s call it
read_sql_template
In your catalog, you will define it something like this.
Copy code
# catalog.yml
MyCustomSqlDataset:
  ...
  sql: ${read_sql_template: <file_path>, <dict_of_variable>}
a
ok great. tried it last night we use neptune for tracking our experiments and they have not adjusted their config to work wit omega so for now will just use a before_pipeline_run hook to reinstantiate my custom datasets (before they are loaded), ask neptune to update their config loader and then do what you suggested 🙂
👍 1
👍🏼 1
n
Are you using
kedro-neptune
? Do they come with a custom config loader?
a
the current version works with
ConfigLoader
(and
TemplatedConfigLoader
) and have asked them to make it work with omega.. they will do it, but have not provided a time estimate
y
BTW I've tested it and it is totally possible to use a Trino backend with the
SQLQueryDataset
, you just need to install
pip install trino[sqlalchemy]
and set the connection in the credentials, so you can likely avoid the custom dataset.
K 2