Hello all I have a question about datasets and namespaces I Kedro #questions

Hello all, I have a question about datasets and na...

minmin

11/01/2024, 3:40 PM

Hello all, I have a question about datasets and namespaces. I am not even sure if what I am asking for is possible. Here is a simplified version of the issue: I have postgresql database which updates daily with data (predictions from some other models but that is beside the point) One of the columns in the results_table is called "run_date". So if I want today's results I can do this: (in catalog.yml):

oneday_of_data:

type: pandas.SQLQueryDataset

credentials: db_credentials

sql: "select * from results_table where run_date = %(run_date)s"

load_args:

params:

run_date: 2024-11-01

this dataset combined with this one node pipeline lets me get the data from the database into my local drive. (in pipeline.py)

pipeline([

node(

func=lambda x: x,

inputs="database_data",

outputs="local_data",

name="sql_to_local_node"

),]

now, if I wanted more than one day's data as different datasets, it seems like this a great candidate for namespacing because nothing changes except for the run date. Like this: (in catalog.yml)

_run_dates:

run_1: 2024-11-01 #today

run_2: 2024-10-30 #yesterday

run_3: 2024-10-25 #a week ago

"{run_name}.oneday_of_data":

type: pandas.SQLQueryDataset

credentials: db_credentials

sql: "select * from results_table where run_date = %(run_date)"

load_args:

params:

run_date: ${_run_dates.{run_name}}

but no matter what I try I can't get this to work. I know I can specify {run_name} in the filepath field (if it was a csv dataset say) but is it possible to use inside a templated/variable-interpolated field like this? I have tried writing my own custom resolver (called "picker") defined as: (in settings.py)

def pick_from_rundates(dict1, key):

return dict1[key]

CONFIG_LOADER_ARGS = {

"base_env": "base",

"default_run_env": "local",

"custom_resolvers": {"picker": lambda x, y: pick_from_rundates(x, y)}

and then tried this...which also failed: (in catalog.yml)

"{run_name}.oneday_of_data":

type: pandas.SQLQueryDataset

credentials: db_credentials

sql: "select * from results_table where run_date = %(run_date)"

load_args:

params:

run_date: ${picker:${_run_dates},"{run_name}"}

So am I missing something simple here or is this fundamentally not allowed? Ideally the run dates would be specified in from the globals.yml instead of directly in the catalog.yml but I am trying to walk before I run here. I will be grateful for any advice here. minesh

Yury Fedotov

11/01/2024, 4:40 PM

Not sure if it satisfies your goal, but a workaround could be: 1. Make your function take a list of dates, instead of a single date 2. Load all this as a single table from the DB 3. Partition into dates, and save as

PartitionedDataset

where something like

f"predictions{date}"

is key and corresponding predictions are value So this turns your: > ...if I wanted more than one day's data as different datasets... Into a single dataset, but multiple files

minmin

11/02/2024, 9:11 AM

thank you very much Yury for your response. I will give this a go on Monday and report back here.

minmin

11/05/2024, 10:18 AM

Hi Yury, just a quick update - this solution worked as long as the dates are hard coded in the sql which (while not ideal) is OK for my needs.

Yury Fedotov

11/05/2024, 2:53 PM

@minmin Hmm I'm wondering why does it depend on hard coded dates My idea was to do this: 1. Within your node function, get a dataframe that only contains dates of your interest. 2. Then do

data_by_date: dict[str, pd.DataFrame] = {str(date): data_for_date for date, data_for_date in data.groupby("date")}

3. This thing is an output of your node function 4. And it gets saved as

PartitionedDataset

- which doesn't need to know how many dates you have in advance. It can save 5, 10, 100 dates into separate files - as many as there are keys in that

data_by_date

dictionary.

minmin

11/05/2024, 3:21 PM

Hi Yury, it's an issue with that step 1: "Within your node function, get a dataframe that only contains dates of your interest."

minmin

11/05/2024, 3:27 PM

Hi Yury, it's an issue with that step 1: "Within your node function, get a dataframe that only contains dates of your interest." I wanted the "dates of interest" to be a parameterised list in some yml file somewhere. Then I can query the database for those dates: i.e.

select * from mytable

where run_date in ('2024-11-01', '2024-11-02', ... and all other dates from the yml file..)

once I have that data, you are right, I would be able to create the partitioned dataset. Does that make sense or have I missed something?

Yury Fedotov

11/05/2024, 8:55 PM

@minmin yeah, it makes sense I didn't work with SQL datasets in Kedro, so not sure how that typically gets parametrized, but those should be the kwargs of the

Dataset

in catalog I believe

6 Views

Open in Slack

Previous Next