What would be the smartest way to query only data from a database that is newer than 5 years (from today/a set enddate) through the catalog?
01/31/2023, 5:56 PM
SQLQueryDataSet? That being said we do emphasise this sort of piece does limit the reproducibility of your pipelines if the data underneath is changing
01/31/2023, 5:59 PM
Well the requirement is to save memory and we assume that a limited amount of data (e.g. 5 years) would be enough for model training. The database itself however fills itself with historical data and thus growths.
however we obviously want to take the most recent 5 years…
01/31/2023, 6:01 PM
so it’s then a case of doing that condition with sql in your relevant dialect:
WHERE date > dateadd('years', -5, today())
some variant of that
01/31/2023, 6:02 PM
yeah, that makes sense. Is there a way to add a specific end date instead of today?