minmin
11/01/2024, 3:40 PMoneday_of_data:
type: pandas.SQLQueryDataset
credentials: db_credentials
sql: "select * from results_table where run_date = %(run_date)s"
load_args:
params:
run_date: 2024-11-01
this dataset combined with this one node pipeline lets me get the data from the database into my local drive.
(in pipeline.py)
pipeline([
node(
func=lambda x: x,
inputs="database_data",
outputs="local_data",
name="sql_to_local_node"
),]
)
now, if I wanted more than one day's data as different datasets, it seems like this a great candidate for namespacing because nothing changes except for the run date. Like this:
(in catalog.yml)
_run_dates:
run_1: 2024-11-01 #today
run_2: 2024-10-30 #yesterday
run_3: 2024-10-25 #a week ago
"{run_name}.oneday_of_data":
type: pandas.SQLQueryDataset
credentials: db_credentials
sql: "select * from results_table where run_date = %(run_date)"
load_args:
params:
run_date: ${_run_dates.{run_name}}
but no matter what I try I can't get this to work. I know I can specify {run_name} in the filepath field (if it was a csv dataset say) but is it possible to use inside a templated/variable-interpolated field like this?
I have tried writing my own custom resolver (called "picker") defined as:
(in settings.py)
def pick_from_rundates(dict1, key):
return dict1[key]
CONFIG_LOADER_ARGS = {
"base_env": "base",
"default_run_env": "local",
"custom_resolvers": {"picker": lambda x, y: pick_from_rundates(x, y)}
}
and then tried this...which also failed:
(in catalog.yml)
"{run_name}.oneday_of_data":
type: pandas.SQLQueryDataset
credentials: db_credentials
sql: "select * from results_table where run_date = %(run_date)"
load_args:
params:
run_date: ${picker:${_run_dates},"{run_name}"}
So am I missing something simple here or is this fundamentally not allowed? Ideally the run dates would be specified in from the globals.yml instead of directly in the catalog.yml but I am trying to walk before I run here.
I will be grateful for any advice here.
mineshYury Fedotov
11/01/2024, 4:40 PMPartitionedDataset
where something like f"predictions{date}"
is key and corresponding predictions are value
So this turns your:
> ...if I wanted more than one day's data as different datasets...
Into a single dataset, but multiple filesminmin
11/02/2024, 9:11 AMminmin
11/05/2024, 10:18 AMYury Fedotov
11/05/2024, 2:53 PMdata_by_date: dict[str, pd.DataFrame] = {str(date): data_for_date for date, data_for_date in data.groupby("date")}
3. This thing is an output of your node function
4. And it gets saved as PartitionedDataset
- which doesn't need to know how many dates you have in advance. It can save 5, 10, 100 dates into separate files - as many as there are keys in that data_by_date
dictionary.minmin
11/05/2024, 3:21 PMminmin
11/05/2024, 3:27 PMselect * from mytable
where run_date in ('2024-11-01', '2024-11-02', ... and all other dates from the yml file..)
once I have that data, you are right, I would be able to create the partitioned dataset. Does that make sense or have I missed something?Yury Fedotov
11/05/2024, 8:55 PMDataset
in catalog I believe