Zubin Roy
04/22/2024, 10:22 AMpd.read_sql_query
to read in SQL tables which I think is what causes it to take so long. And was wondering if there were quicker alternative ways to load tables that people use? Thanks!
kedro_athena_test:
type: pandas.SQLQueryDataSet
sql: "select * from sigma.fact_wiki_media limit 100000;"
datajoely
04/22/2024, 10:23 AMibis.TableDataset
is by far the best way
https://kedro.org/blog/building-scalable-data-pipelines-with-kedro-and-ibisdatajoely
04/22/2024, 10:24 AMdatajoely
04/22/2024, 10:25 AMZubin Roy
04/22/2024, 10:43 AMdatajoely
04/22/2024, 10:50 AMdatajoely
04/22/2024, 10:50 AMZubin Roy
04/22/2024, 11:20 AMibis.TableDataset
?datajoely
04/22/2024, 11:25 AMdf.preview
to see a nice text preview
• you can do <http://df.to|df.to>_pandas
at any point to view a materialised dataframe
For quick previews you may want to do something like df.limit(5).to_pandas()
as that will only load 5 rows very efficientlyZubin Roy
04/22/2024, 11:45 AMibis.TableDataset
different to the kedro_ibis_dataset.IbisDataSet
mentioned in this documentation? https://github.com/inigohidalgo/kedro-ibis-datasetIñigo Hidalgo
04/22/2024, 11:50 AMZubin Roy
04/22/2024, 11:55 AMibis.TableDataset
? Or is the blog post and tutorial it?
I'm still not entirely sure how to then create a catalog entry which reads in my SQL query (which is stored in AWS ATHENA) and convert that into a ibis.TableDataset
. Thanks!Deepyaman Datta
04/22/2024, 2:11 PMZubin Roy
04/22/2024, 2:22 PMibis.TableDataset
? As I was not entirely sure how to modify the duckdb catalog entries in the jaffle-shop tutorial to achieve this. Thanks!
social_media_dim_table:
type: pandas.SQLQueryDataSet
sql: "dummy sql query string"
credentials:
con: 'dummy credential string'
datajoely
04/22/2024, 2:23 PMZubin Roy
04/22/2024, 2:26 PMdatajoely
04/22/2024, 2:30 PMdatajoely
04/22/2024, 2:30 PMdatajoely
04/22/2024, 2:30 PMdatajoely
04/22/2024, 2:30 PMDeepyaman Datta
04/22/2024, 2:32 PMDeepyaman Datta
04/22/2024, 2:35 PMDeepyaman Datta
04/22/2024, 2:39 PMpip install 'ibis-framework[trino]
and try to create a connection, see if that works: https://ibis-project.org/backends/trino#connect
If this works, we can create the Kedro catalog entry.Zubin Roy
04/22/2024, 2:47 PMCody Peterson
04/22/2024, 3:08 PMBut (as far as I understand) ATHENA is not actually connecting to an SQL database. It's a query tool that is connecting to a data lake. Where our data lake is an Amazon s3 folder which contains loads of .parquet files and then ATHENA aggregates all of these instantaneously in order for them to be returned in the form of a table.this is correct -- Trino is basically a distributed SQL query engine that queries over a filesystem (like Amazon S3) -- a "lakehouse" architecture or a kind of decoupled database. but it still looks like a "database" in that you interface through SQL and it has a catalog of tables (the actual data being in the S3 filesystem or elsewhere) Athena is AWS's managed version of Trino, so in theory you should be able to use Ibis with the Trino backend (though it's not something we directly test w/ AWS Athena) -- if you can connect and list tables, then you should be good to go Ibis then converts Python dataframe code into Trino SQL to execute on the backend (Athena in this case)
Cody Peterson
04/22/2024, 3:09 PMZubin Roy
04/22/2024, 3:15 PMGeorge p
04/22/2024, 10:02 PMdatajoely
04/23/2024, 7:51 AMZubin Roy
04/23/2024, 8:14 AM