Hi all What is the best way was to read in an SQL table when Kedro #questions

Hi all. What is the best way was to read in an SQL...

Zubin Roy

04/22/2024, 10:22 AM

Hi all. What is the best way was to read in an SQL table when using Kedro. I find that the when I use a kedro catalog entry (shown below) it tends to be quite slow. For example the below takes about 16 seconds to load and the actual table size is millions of rows and so running/testing of code then becomes quite cumbersome. As far as a understand kedro uses

pd.read_sql_query

to read in SQL tables which I think is what causes it to take so long. And was wondering if there were quicker alternative ways to load tables that people use? Thanks!

Copy code

kedro_athena_test:
  type: pandas.SQLQueryDataSet
  sql: "select * from sigma.fact_wiki_media limit 100000;"

❤️ 1

datajoely

04/22/2024, 10:23 AM

As of last week - @Deepyaman Datta’s new

ibis.TableDataset

is by far the best way https://kedro.org/blog/building-scalable-data-pipelines-with-kedro-and-ibis

👍 1

datajoely

04/22/2024, 10:24 AM

Ibis is lazy by default and gives you a lazy dataframe to work with rather than loading everything into pandas

datajoely

04/22/2024, 10:25 AM

you’ll have to learn Ibis syntax but it’s very similar to Pandas / Polars https://ibis-project.org/

Zubin Roy

04/22/2024, 10:43 AM

Cool. I'd not heard of Ibis before but will check it out. Thanks

datajoely

04/22/2024, 10:50 AM

💪

datajoely

04/22/2024, 10:50 AM

let us know how you find it, I’m personally really excited by it

Zubin Roy

04/22/2024, 11:20 AM

@datajoely. So I've tried out the jaffle-shop tutorial in order to better understand ibis. I typically use PyCharm to execute my code. And I find it's super useful when debugging in order to better understand what the code is doing at each step. When I've been using pandas in the past you can then visualise the dataframe within the PyCharm console which is helpful when debugging/testing code. Is there a way to do the same when using

ibis.TableDataset

datajoely

04/22/2024, 11:25 AM

So… it’s a bit new for PyCharm to have rich support, but there are two options •

df.preview

to see a nice text preview • you can do

<http://df.to|df.to>_pandas

at any point to view a materialised dataframe For quick previews you may want to do something like

df.limit(5).to_pandas()

as that will only load 5 rows very efficiently

Zubin Roy

04/22/2024, 11:45 AM

Cool. Again thanks for all your help. And is the

ibis.TableDataset

different to the

kedro_ibis_dataset.IbisDataSet

mentioned in this documentation? https://github.com/inigohidalgo/kedro-ibis-dataset

Iñigo Hidalgo

04/22/2024, 11:50 AM

Yes they’re different. The one you just linked was an initial prototype I wrote. They both return ibis tables but you should look at the ibis.TableDataset from Kedro and not the other one as it’s unmaintained. I will update the GitHub repository to reflect this

👍 1

Zubin Roy

04/22/2024, 11:55 AM

And is there a link to further documentation around

ibis.TableDataset

? Or is the blog post and tutorial it? I'm still not entirely sure how to then create a catalog entry which reads in my SQL query (which is stored in AWS ATHENA) and convert that into a

ibis.TableDataset

. Thanks!

Deepyaman Datta

04/22/2024, 2:11 PM

The API docs should at least have been there, but unfortunately I missed including it in the docs index. They will only be added after https://github.com/kedro-org/kedro-plugins/pull/657 Happy to improve the documentation! Got me thinking, maybe should just have a separate section on working with databases, since that seems to be something that comes up frequently. In the meantime, can help best as I can here. Are you using Athena as your database?

Zubin Roy

04/22/2024, 2:22 PM

Hey @Deepyaman Datta no worries. Sounds like you've developed a cool technology and I am eager to understand it better. Yes I am. Currently my catalog entry looks like below. How do I modify it to take advantage of

ibis.TableDataset

? As I was not entirely sure how to modify the duckdb catalog entries in the jaffle-shop tutorial to achieve this. Thanks!

social_media_dim_table:

type: pandas.SQLQueryDataSet

sql: "dummy sql query string"

credentials:

con: 'dummy credential string'

datajoely

04/22/2024, 2:23 PM

what SQL backend are you using?

Zubin Roy

04/22/2024, 2:26 PM

Amazon Athena

datajoely

04/22/2024, 2:30 PM

so this isn’t my area of expertise but I think you can use this athena mysql connector

datajoely

04/22/2024, 2:30 PM

https://docs.aws.amazon.com/athena/latest/ug/connectors-mysql.html

datajoely

04/22/2024, 2:30 PM

and then these Ibis set up https://ibis-project.org/backends/mysql

datajoely

04/22/2024, 2:30 PM

the kedro implementation just passes these keys/values to the Ibis native implementation

Deepyaman Datta

04/22/2024, 2:32 PM

@datajoely reading the description, I think that only applies if your data is in MySQL already, and you want to access it with Athena?

👍 1

Deepyaman Datta

04/22/2024, 2:35 PM

@Zubin Roy Sorry taking a bit here, trying to figure out if using Ibis is possible in your use case. One of the 20 backends Ibis currently supports is Trino: https://ibis-project.org/backends/trino If something will work with Athena, it's probably this.

👍 1

Deepyaman Datta

04/22/2024, 2:39 PM

If want to explore this route, I would try using Ibis directly first to make sure you can connect to the database. You can

pip install 'ibis-framework[trino]

and try to create a connection, see if that works: https://ibis-project.org/backends/trino#connect If this works, we can create the Kedro catalog entry.

👍 1

Zubin Roy

04/22/2024, 2:47 PM

So I'm no data engineer so apologies if this explanation is slightly incorrect. But (as far as I understand) ATHENA is not actually connecting to an SQL database. It's a query tool that is connecting to a data lake. Where our data lake is an Amazon s3 folder which contains loads of .parquet files and then ATHENA aggregates all of these instantaneously in order for them to be returned in the form of a table. Let me know if that explanation makes sense and if trino is still the best option for this case?

Cody Peterson

04/22/2024, 3:08 PM

But (as far as I understand) ATHENA is not actually connecting to an SQL database. It's a query tool that is connecting to a data lake. Where our data lake is an Amazon s3 folder which contains loads of .parquet files and then ATHENA aggregates all of these instantaneously in order for them to be returned in the form of a table.

this is correct -- Trino is basically a distributed SQL query engine that queries over a filesystem (like Amazon S3) -- a "lakehouse" architecture or a kind of decoupled database. but it still looks like a "database" in that you interface through SQL and it has a catalog of tables (the actual data being in the S3 filesystem or elsewhere) Athena is AWS's managed version of Trino, so in theory you should be able to use Ibis with the Trino backend (though it's not something we directly test w/ AWS Athena) -- if you can connect and list tables, then you should be good to go Ibis then converts Python dataframe code into Trino SQL to execute on the backend (Athena in this case)

👍 1

Cody Peterson

04/22/2024, 3:09 PM

Ibis might not work with AWS Athena out of the box though

Zubin Roy

04/22/2024, 3:15 PM

@Cody Peterson. Cool. Thanks for your help. will give Trino a try then!

George p

04/22/2024, 10:02 PM

(as just an observer, this has been a very insightful thread. thanks all for asking/explaining)

❤️ 1

➕ 1

datajoely

04/23/2024, 7:51 AM

Would there Kedro community be interested in a Ibis demo / presentation?

👍 7

Zubin Roy

04/23/2024, 8:14 AM

Yeah for sure that would be great!

243 Views

Open in Slack

Previous Next