Hi all. What is the best way was to read in an SQL...
# questions
z
Hi all. What is the best way was to read in an SQL table when using Kedro. I find that the when I use a kedro catalog entry (shown below) it tends to be quite slow. For example the below takes about 16 seconds to load and the actual table size is millions of rows and so running/testing of code then becomes quite cumbersome. As far as a understand kedro uses
pd.read_sql_query
to read in SQL tables which I think is what causes it to take so long. And was wondering if there were quicker alternative ways to load tables that people use? Thanks!
Copy code
kedro_athena_test:
  type: pandas.SQLQueryDataSet
  sql: "select * from sigma.fact_wiki_media limit 100000;"
❤️ 1
d
As of last week - @Deepyaman Datta’s new
ibis.TableDataset
is by far the best way https://kedro.org/blog/building-scalable-data-pipelines-with-kedro-and-ibis
👍 1
Ibis is lazy by default and gives you a lazy dataframe to work with rather than loading everything into pandas
you’ll have to learn Ibis syntax but it’s very similar to Pandas / Polars https://ibis-project.org/
z
Cool. I'd not heard of Ibis before but will check it out. Thanks
d
💪
let us know how you find it, I’m personally really excited by it
z
@datajoely. So I've tried out the jaffle-shop tutorial in order to better understand ibis. I typically use PyCharm to execute my code. And I find it's super useful when debugging in order to better understand what the code is doing at each step. When I've been using pandas in the past you can then visualise the dataframe within the PyCharm console which is helpful when debugging/testing code. Is there a way to do the same when using
ibis.TableDataset
?
d
So… it’s a bit new for PyCharm to have rich support, but there are two options •
df.preview
to see a nice text preview • you can do
<http://df.to|df.to>_pandas
at any point to view a materialised dataframe For quick previews you may want to do something like
df.limit(5).to_pandas()
as that will only load 5 rows very efficiently
z
Cool. Again thanks for all your help. And is the
ibis.TableDataset
different to the
kedro_ibis_dataset.IbisDataSet
mentioned in this documentation? https://github.com/inigohidalgo/kedro-ibis-dataset
i
Yes they’re different. The one you just linked was an initial prototype I wrote. They both return ibis tables but you should look at the ibis.TableDataset from Kedro and not the other one as it’s unmaintained. I will update the GitHub repository to reflect this
👍 1
z
And is there a link to further documentation around
ibis.TableDataset
? Or is the blog post and tutorial it? I'm still not entirely sure how to then create a catalog entry which reads in my SQL query (which is stored in AWS ATHENA) and convert that into a
ibis.TableDataset
. Thanks!
d
The API docs should at least have been there, but unfortunately I missed including it in the docs index. They will only be added after https://github.com/kedro-org/kedro-plugins/pull/657 Happy to improve the documentation! Got me thinking, maybe should just have a separate section on working with databases, since that seems to be something that comes up frequently. In the meantime, can help best as I can here. Are you using Athena as your database?
z
Hey @Deepyaman Datta no worries. Sounds like you've developed a cool technology and I am eager to understand it better. Yes I am. Currently my catalog entry looks like below. How do I modify it to take advantage of
ibis.TableDataset
? As I was not entirely sure how to modify the duckdb catalog entries in the jaffle-shop tutorial to achieve this. Thanks!
social_media_dim_table:
type: pandas.SQLQueryDataSet
sql: "dummy sql query string"
credentials:
con: 'dummy credential string'
d
what SQL backend are you using?
z
Amazon Athena
d
so this isn’t my area of expertise but I think you can use this athena mysql connector
and then these Ibis set up https://ibis-project.org/backends/mysql
the kedro implementation just passes these keys/values to the Ibis native implementation
d
@datajoely reading the description, I think that only applies if your data is in MySQL already, and you want to access it with Athena?
👍 1
@Zubin Roy Sorry taking a bit here, trying to figure out if using Ibis is possible in your use case. One of the 20 backends Ibis currently supports is Trino: https://ibis-project.org/backends/trino If something will work with Athena, it's probably this.
👍 1
If want to explore this route, I would try using Ibis directly first to make sure you can connect to the database. You can
pip install 'ibis-framework[trino]
and try to create a connection, see if that works: https://ibis-project.org/backends/trino#connect If this works, we can create the Kedro catalog entry.
👍 1
z
So I'm no data engineer so apologies if this explanation is slightly incorrect. But (as far as I understand) ATHENA is not actually connecting to an SQL database. It's a query tool that is connecting to a data lake. Where our data lake is an Amazon s3 folder which contains loads of .parquet files and then ATHENA aggregates all of these instantaneously in order for them to be returned in the form of a table. Let me know if that explanation makes sense and if trino is still the best option for this case?
c
But (as far as I understand) ATHENA is not actually connecting to an SQL database. It's a query tool that is connecting to a data lake. Where our data lake is an Amazon s3 folder which contains loads of .parquet files and then ATHENA aggregates all of these instantaneously in order for them to be returned in the form of a table.
this is correct -- Trino is basically a distributed SQL query engine that queries over a filesystem (like Amazon S3) -- a "lakehouse" architecture or a kind of decoupled database. but it still looks like a "database" in that you interface through SQL and it has a catalog of tables (the actual data being in the S3 filesystem or elsewhere) Athena is AWS's managed version of Trino, so in theory you should be able to use Ibis with the Trino backend (though it's not something we directly test w/ AWS Athena) -- if you can connect and list tables, then you should be good to go Ibis then converts Python dataframe code into Trino SQL to execute on the backend (Athena in this case)
👍 1
Ibis might not work with AWS Athena out of the box though
z
@Cody Peterson. Cool. Thanks for your help. will give Trino a try then!
g
(as just an observer, this has been a very insightful thread. thanks all for asking/explaining)
❤️ 1
1
d
Would there Kedro community be interested in a Ibis demo / presentation?
👍 7
z
Yeah for sure that would be great!