Hello! Does anyone know how to properly update an ...
# questions
f
Hello! Does anyone know how to properly update an existing data set from the catalog? It is not possible for us to download the whole data set again (from a sql db), so we would love to be able to download only the data after a certain date, which is determined at runtime.
d
+1 I'd like to know this too
😀 1
t
If I understand the situation, you could use the SqlQueryDataset and have a filter statement in the query argument. With that you can use templatedconfigloader in settings.py to inject any string to the filter statement from a cli argument, such as a preferred date
f
Thanks, I'll look that up
👍 1
t
As a simple example with environment variables, you could do this:
Copy code
# in settings.py
from kedro.config import TemplatedConfigLoader

CONFIG_LOADER_CLASS = TemplatedConfigLoader
CONFIG_LOADER_ARGS = {
    "globals_dict": {'filter_date': os.getenv('FILTER_DATE')},
}
then the query argument in the dataset could look like this
select * from table where dt>${filter_date}
f
Great, thanks again! I'm going to try that soon
Do you know whether it is possible to do this with a parameter as well?
t
if you check here you can see the docs for templating https://kedro.readthedocs.io/en/stable/kedro.config.TemplatedConfigLoader.html if you have globals_pattern in the config_loader_args then you can put entries in globals.yml if thats preferred, then they get injected to parameters or catalog entries with the
${variable}
pattern
1
f
Allright! 🙏
i
I've been looking at the same thing. We are pulling from some upstream tables that have a long history, and when running our feature engineering, training etc pipelines we only want to grab a certain window of data (last year for example). This isn't a static query and can change with each successive run of the pipeline. What would the current recommended way to do this be? I see two options: 1. Something along the lines of what's discussed in this thread, using hooks to inject this behavior during execution, this feels kind of hacky 2. Extending the customsqlquery dataset to behave in a similar way to partitioneddatasets, evaluating lazily and doing the actual loading in a node which receives as input the loader and can filter using parameters. I like the idea of this but seems like it would be too much work for a "simple" need. Has anybody faced this issue and would be willing to share how they approached it?
I realize it's been said before that this is against some kedro design principles regarding repeatability, but I would disagree that this hinders repeatability since it is still a deterministic run given the input filtering parameters.
d
I think this is a very valid pitch @Iñigo Hidalgo - the short answer is that it needs careful design as I think it would be non-trivial in execution. My suggestion to you is to build a custom dataset for your purposes and then raise a GitHub issue for the purposes of discussion and potentially adoption in the framework? Happy to help think through the custom dataset design if helpful
the laziness idea is super neat and using sqlalchemy directly likely let us do this programatically.
i
I will give it some thought over the next few days. The benefit to effort ratio for my team isn't exactly there for me to dedicate it too much work time, but it's an interesting problem for me personally so might dedicate it some extra time on my own
using sqlalchemy directly likely let us do this programatically.
could you explain what you mean by this? I'm not familiar with the sql dataset implementation at all, haven't gotten round to looking at it just yet, so if I just need to go look at the code tell me that haha
d
so far we delegate to spark and pandas for reading SQL, but if we’re talking about lazy implementations an ORM may be the right way to do it. In terms of thinking time - raising an issue with some ideas is still a great contribution to the project kedroid
👍 1
i
okay, so at the moment it's just a read_sql with the connection string generated from credentials etc?
d
pretty much - just a thin layer ontop of pandas or the spark jdbc connection
i
Thanks, I've started to look into possible options. I'm not familiar with SQLAlchemy at all but I'd always heard it was more geared towards CRUD applications rather than Analytics workflows. Is there something about its functionality that draws you to it, vs something simpler like composing queries through string addition? Bc that was my first idea for the lazy evaluation lol
I know you can add filters etc through methods, but for some reason I get the impression it might be overkill if all we want is simple filters? I guess it would be more extendable
d
pandas actually uses it as the mechanism to talk to any database too
so it’s already a dependncy
it’s just a lower level abstraction
i
Interesting, had no idea tbh. Thanks 🙂
I'll open an issue some time next week when I've had some time to look at it in some detail, and I can provide some more specific usecases
d
❤️
thanks!
i
One problem I might have when writing this stuff out is all our projects at work are still on 0.17.1 but as far as I know not much has changed functionality-wise for datasets, right?
Main difference is we've lost the Context
d
no on the whole it should be the same
you still have the context in hooks 😛
🧐 1
i
But I am aware of that and wouldn't want to depend on that behavior
Nerdsniped myself lol @datajoely and built a simple implementation (wasn't aware of SQLAlchemy Core as a separate thing to the ORM, it's actually super useful) https://github.com/kedro-org/kedro/issues/2374
d
thanks for raising that!
now I’ve not had time to understand what you’ve done, but it looks very clever
i
Thanks 🙂 I look forward to you and your team's thoughts when you do get a chance to look at it. I will probably start using this functionality in my pipelines soon (we don't use spark and have simple loading needs)
🥳 1
K 1
d
absolutely
our next backlog grooming is in two weeks
but I’ll try and put some thoughts on there beforehand
K 1
i
No rush, we will start using it so I guess I'll have some additional thoughts soon
K 1
f
Interesting! We really didn't manage to come up with a satisfying solution within Kedro yet, so I'm very happy to see this!