Hi all, I’m revisiting Kedro for the first time in...
# questions
r
Hi all, I’m revisiting Kedro for the first time in a few years and have a question on how to best define a data enrichment step that calls an external API in my pipeline. The purpose of the data pipeline is to generate and enriched geospatial dataset. The pipeline takes lon/lats, then calls an (expensive) external API to generate isochrones. I want to avoid calling the external API everytime I run the pipeline, so in my dev notebooks I’ve been calling the external API once for each lon/lat then storing that result in Snowflake before using it for feature generation in future pipeline runs. Based on my understanding of Kedro, I think one approach I could take is to define a data processing node that: 1 - Takes an input dataset with all lon/lats 2 - Checks existing dataset (e.g., Snowflake table or other) to see whether API has already been called for each lon/lat 3 - If API has not been called for lon/lat, call API and insert into dataset 4 - Output enriched dataset My questions are: • What is best practice for calling external APIs for data enrichment in Kedro? • Is this the best way to make sure API is not being called unnecessarily, or am I missing a trick?
d
I don't think this is Kedro-specific per se, but what I might do in your node, for each lon/lat, is: 1. Check your Snowflake table or local cache to see if the API has already been called (if so, use the value, and move on to the next point) 2. If it hasn't been called, get it from the API. Insert the lon/lat-to-value mapping into your local cache (can be like a dictionary), and use the value. At the end, you can insert the whole local cache into the database, rather than making so many database insert calls.
r
Thanks, that makes sense.
I’m just looking at hooks - do you know if there is any precedence to use them for this sort of thing? Or is what I’m describing pretty standard for code within a node?
d
Don't think a hook makes sense; if anything, you could have a separate node update the Snowflake table first.
r
What would the benefit of that be?
d
No real benefit; probably sightly slower, but you get a separation of logic so the second node looks like it's working on a fully-populated table.
r
Ok great, thanks for your help