What would be the smartest way to query only data ...
# questions
e
What would be the smartest way to query only data from a database that is newer than 5 years (from today/a set enddate) through the catalog?
d
SQLQueryDataSet? That being said we do emphasise this sort of piece does limit the reproducibility of your pipelines if the data underneath is changing
e
yes pandas.SQLQueryDataSet. Well the requirement is to save memory and we assume that a limited amount of data (e.g. 5 years) would be enough for model training. The database itself however fills itself with historical data and thus growths.
however we obviously want to take the most recent 5 years…
d
so it’s then a case of doing that condition with sql in your relevant dialect:
Copy code
SELECT *
FROM xxxx
WHERE date > dateadd('years', -5, today())
some variant of that
e
yeah, that makes sense. Is there a way to add a specific end date instead of today?