https://kedro.org/ logo
#questions
Title
# questions
f

Frits

12/20/2022, 9:24 AM
Hello! Does anyone know how to properly update an existing data set from the catalog? It is not possible for us to download the whole data set again (from a sql db), so we would love to be able to download only the data after a certain date, which is determined at runtime.
d

Dhaval Thakkar

12/20/2022, 9:25 AM
+1 I'd like to know this too
😀 1
t

Tom Taylor-Vigrass

12/20/2022, 9:30 AM
If I understand the situation, you could use the SqlQueryDataset and have a filter statement in the query argument. With that you can use templatedconfigloader in settings.py to inject any string to the filter statement from a cli argument, such as a preferred date
f

Frits

12/20/2022, 9:35 AM
Thanks, I'll look that up
👍 1
t

Tom Taylor-Vigrass

12/20/2022, 9:42 AM
As a simple example with environment variables, you could do this:
Copy code
# in settings.py
from kedro.config import TemplatedConfigLoader

CONFIG_LOADER_CLASS = TemplatedConfigLoader
CONFIG_LOADER_ARGS = {
    "globals_dict": {'filter_date': os.getenv('FILTER_DATE')},
}
then the query argument in the dataset could look like this
select * from table where dt>${filter_date}
f

Frits

12/20/2022, 9:54 AM
Great, thanks again! I'm going to try that soon
Do you know whether it is possible to do this with a parameter as well?
t

Tom Taylor-Vigrass

12/20/2022, 9:58 AM
if you check here you can see the docs for templating https://kedro.readthedocs.io/en/stable/kedro.config.TemplatedConfigLoader.html if you have globals_pattern in the config_loader_args then you can put entries in globals.yml if thats preferred, then they get injected to parameters or catalog entries with the
${variable}
pattern
1
f

Frits

12/20/2022, 10:00 AM
Allright! 🙏
i

Iñigo Hidalgo

02/24/2023, 5:30 PM
I've been looking at the same thing. We are pulling from some upstream tables that have a long history, and when running our feature engineering, training etc pipelines we only want to grab a certain window of data (last year for example). This isn't a static query and can change with each successive run of the pipeline. What would the current recommended way to do this be? I see two options: 1. Something along the lines of what's discussed in this thread, using hooks to inject this behavior during execution, this feels kind of hacky 2. Extending the customsqlquery dataset to behave in a similar way to partitioneddatasets, evaluating lazily and doing the actual loading in a node which receives as input the loader and can filter using parameters. I like the idea of this but seems like it would be too much work for a "simple" need. Has anybody faced this issue and would be willing to share how they approached it?
I realize it's been said before that this is against some kedro design principles regarding repeatability, but I would disagree that this hinders repeatability since it is still a deterministic run given the input filtering parameters.
d

datajoely

02/24/2023, 5:33 PM
I think this is a very valid pitch @Iñigo Hidalgo - the short answer is that it needs careful design as I think it would be non-trivial in execution. My suggestion to you is to build a custom dataset for your purposes and then raise a GitHub issue for the purposes of discussion and potentially adoption in the framework? Happy to help think through the custom dataset design if helpful
the laziness idea is super neat and using sqlalchemy directly likely let us do this programatically.
i

Iñigo Hidalgo

02/24/2023, 5:39 PM
I will give it some thought over the next few days. The benefit to effort ratio for my team isn't exactly there for me to dedicate it too much work time, but it's an interesting problem for me personally so might dedicate it some extra time on my own
using sqlalchemy directly likely let us do this programatically.
could you explain what you mean by this? I'm not familiar with the sql dataset implementation at all, haven't gotten round to looking at it just yet, so if I just need to go look at the code tell me that haha
d

datajoely

02/24/2023, 5:40 PM
so far we delegate to spark and pandas for reading SQL, but if we’re talking about lazy implementations an ORM may be the right way to do it. In terms of thinking time - raising an issue with some ideas is still a great contribution to the project kedroid
👍 1
i

Iñigo Hidalgo

02/24/2023, 5:42 PM
okay, so at the moment it's just a read_sql with the connection string generated from credentials etc?
d

datajoely

02/24/2023, 5:58 PM
pretty much - just a thin layer ontop of pandas or the spark jdbc connection
i

Iñigo Hidalgo

02/24/2023, 6:01 PM
Thanks, I've started to look into possible options. I'm not familiar with SQLAlchemy at all but I'd always heard it was more geared towards CRUD applications rather than Analytics workflows. Is there something about its functionality that draws you to it, vs something simpler like composing queries through string addition? Bc that was my first idea for the lazy evaluation lol
I know you can add filters etc through methods, but for some reason I get the impression it might be overkill if all we want is simple filters? I guess it would be more extendable
d

datajoely

02/24/2023, 6:07 PM
pandas actually uses it as the mechanism to talk to any database too
so it’s already a dependncy
it’s just a lower level abstraction
i

Iñigo Hidalgo

02/24/2023, 6:07 PM
Interesting, had no idea tbh. Thanks 🙂
I'll open an issue some time next week when I've had some time to look at it in some detail, and I can provide some more specific usecases
d

datajoely

02/24/2023, 6:08 PM
❤️
thanks!
i

Iñigo Hidalgo

02/24/2023, 6:08 PM
One problem I might have when writing this stuff out is all our projects at work are still on 0.17.1 but as far as I know not much has changed functionality-wise for datasets, right?
Main difference is we've lost the Context
d

datajoely

02/24/2023, 6:09 PM
no on the whole it should be the same
you still have the context in hooks 😛
🧐 1
i

Iñigo Hidalgo

02/24/2023, 6:09 PM
But I am aware of that and wouldn't want to depend on that behavior
Nerdsniped myself lol @datajoely and built a simple implementation (wasn't aware of SQLAlchemy Core as a separate thing to the ORM, it's actually super useful) https://github.com/kedro-org/kedro/issues/2374
d

datajoely

02/28/2023, 4:13 PM
thanks for raising that!
now I’ve not had time to understand what you’ve done, but it looks very clever
i

Iñigo Hidalgo

02/28/2023, 4:43 PM
Thanks 🙂 I look forward to you and your team's thoughts when you do get a chance to look at it. I will probably start using this functionality in my pipelines soon (we don't use spark and have simple loading needs)
🥳 1
K 1
d

datajoely

02/28/2023, 4:55 PM
absolutely
our next backlog grooming is in two weeks
but I’ll try and put some thoughts on there beforehand
K 1
i

Iñigo Hidalgo

02/28/2023, 4:55 PM
No rush, we will start using it so I guess I'll have some additional thoughts soon
K 1
f

Frits

02/28/2023, 6:07 PM
Interesting! We really didn't manage to come up with a satisfying solution within Kedro yet, so I'm very happy to see this!
3 Views