Has anyone used ibis TableDataset with duckdb schemas If I s Kedro #plugins-integrations

Has anyone used ibis.TableDataset with duckdb sche...

Mark Druffel

09/13/2024, 6:44 PM

Has anyone used ibis.TableDataset with duckdb schemas? If I set a schema on a data catalog entry I get the error

Invalid Input Error: Could not set option "schema" as a global option

Copy code

bronze_x:
  type: ibis.TableDataset
  filepath: x.csv
  file_format: csv
  table_name: x
  backend: duckdb
  database: data.duckdb
  schema: bronze

I can reproduce this error with vanilla ibis:

Copy code

con = ibis.duckdb.connect(database="data.duckdb", schema = "bronze")

Found a related question on ibis' github, it sounds like duckdb can't set the schema globally so it has to be done in the table functions. Wondering if this would require a change to ibis.TableDataset, and if so, would this pattern work the same with other backends?

Deepyaman Datta

09/13/2024, 6:56 PM

Wondering if this would require a change to ibis.TableDataset,

Probably. If I understand correctly, this would be a request to pass

schema

(actually,

database

, since

schema

is deprecated as an argument to

table

) as a

table_arg

or something in the dataset?

and if so, would this pattern work the same with other backends?

I think so, because https://github.com/ibis-project/ibis/blob/main/ibis/backends/sql/__init__.py#L47 for example (called in the

table()

function) is generic to SQL backends).

Mark Druffel

09/13/2024, 7:36 PM

Yea I was just looking through some of the other backends, agree. I'm trying to check pyspark now to recall if

table(database)

is actually equivalent to catalog, database, or schema in hive. On the ibis side, it feels like

do_connect

using a database parameter is confusing. For example:

Copy code

con=ibis.duckdb.connect(database = "data/db/spotify/spotify.duckdb")
con.create_database(name = "bronze")
con.create_database(name = "silver")
con.create_database(name = "gold")
con.table("x", database = "bronze")

The

create_database

table

calls use database to mean something completely different. On the kedro-datasets side, my question becomes would it make sense to accept an argument called "schema" and just pass that to

table(database = schema)

since the "database" argument is already used in the connection string for

do_connect

Deepyaman Datta

09/14/2024, 1:45 AM

Sorry, I missed your reply. In short: I'd be inclined to match the Ibis name, but create a section for

table_args

(i.e. arguments that get passed to the underlying

table()

call), since that's more in line with how most Kedro datasets are structured in my experience, but I haven't thought about it that much. By the way, I've created an issue for this: https://github.com/kedro-org/kedro-plugins/issues/833 (since was having trouble sharing some of this context with Ibis team otherwise, as needed)

❤️ 1

20 Views

Open in Slack

Previous Next