Question about ibis TableDataset Is there a way to use the p Kedro #plugins-integrations

Question about ibis.TableDataset. Is there a way t...

Mark Druffel

08/22/2024, 9:31 PM

Question about ibis.TableDataset. Is there a way to use the pandas backend in a pipeline? It seems like you can't write pandas output to a file or a database. It seems like this is by design and makes sense for a *Table*Dataset, but is that the intent? I really like the Ibis API and would prefer to use it as my primary dataframe library. I mostly work with pyspark and duckdb so it's a natural fit there, but I'm wondering if there is a long-term plan or willingness to consder adding

to_

methods (i.e. to_csv, to_delta, etc.) to the ibis.TableDataset? Or perhaps there should be a different ibis Dataset? Details I'm trying to pre-process some badly formed csv files in my pipeline. I know I can use a pandas node separately, but I prefer the ibis api so I tried to use TableDataset. I have the following data catalog entries:

Copy code

raw:
  type: ibis.TableDataset
  filepath: data/01_raw/raw.csv
  file_format: csv
  connection: 
    backend: pandas
  load_args:
    sep: ","

preprocessed:
  type: ibis.TableDataset
  table_name: preprocessed
  connection: 
    backend: pandas
    database: test.db
  save_args:
    materialized: table

standardized:
  type: ibis.TableDataset
  table_name: standardized
  file_format: csv
  connection: 
    backend: duckdb
    database: finance.db
  save_args:
    materialized: table

The pipeline code looks like this:

Copy code

def create_pipeline(**kwargs) -> Pipeline:
    return pipeline(
        [
            node(
                func=preprocess_raw,
                inputs="raw",
                outputs="preprocessed",
                name="preprocess"
            ),
            node(
                func=standardize,
                inputs="preprocessed",
                outputs="standardized",
                name="standardize"
            ),
        ]
    )

I jump into an ipython session with

kedro ipython

and run `catalog.load("preprocessed") and get the error

TypeError: BasePandasBackend.do_connect() got an unexpected keyword argument 'database'

, which is coming from Ibis. After looking at the backend setup, I see database isn't a valid argument. I remove database and reran and got the error

DatasetError: Failed while saving data to data set... Unable to convert <class 'ibis.expr.types.relations.Table'> object to backend type: <class 'pandas.core.frame.DataFrame'>

. I didn't exactly expect this to work, but I wasn't sure...

Copy code

preprocessed:
  type: ibis.TableDataset
  table_name: preprocessed
  connection: 
    backend: pandas

Then I tried removing table_name as well and got the obvious error that I need a table_name or a filepath. `DatasetError: Must provide at least one of

filepath

table_name

.` No doubt 😂

Copy code

preprocessed:
  type: ibis.TableDataset
  connection: 
    backend: pandas

Then I tried adding a filepath and get the error `DatasetError: Must provide

table_name

for materialization.` which I can see in TableDataset's

_write

method.

Copy code

preprocessed:
  type: ibis.TableDataset
  filepath: data/02_preprocessed/preprocessed.csv
  connection: 
    backend: pandas

👍 1

Deepyaman Datta

08/22/2024, 9:51 PM

In implementing the initial

ibis.TableDataset

, I didn't implement output to files—but there's nothing stopping us from adding this functionality. Ibis itself already supports it, as you point out, and I think it would be natural. I was actually thinking about this while I was out somewhere yesterday, so I'm glad you bring it up. 😅 ~~I envision it should be part of a single dataset. I think we just need to be careful about how different combinations of arguments work.~~ Current behavior: • Table name and filepath specified ◦ Load: from filepath, create named Ibis table in memory ◦ Save: write to table with given name • Table name specified, no filepath ◦ Load: from existing table (resulting in named Ibis table in memory) ◦ Save: write to table with given name So, how would we specify something being written to a file? As I was writing this out, I crossed out my initial thought; maybe it makes sense to have

ibis.FileDataset

and

ibis.TableDataset

? If you just wanted to load something from a file and write it back to a database table, I guess you could have the node load from an

ibis.FileDataset

and write to a

ibis.TableDataset

. This could make the use of filepaths much more explicit/obvious.

Deepyaman Datta

08/22/2024, 9:51 PM

(Sorry I rambled a bit there, but would love your thoughts!)

Deepyaman Datta

08/22/2024, 9:53 PM

There's also a side point around your use of the pandas backend. Is there a reason you specifically want to use the pandas backend, or would it be fine if you could use DuckDB backend and write to file, or Polars backend and write to file? https://github.com/ibis-project/ibis/pull/9772#issuecomment-2298784456 proposes to remove the pandas (and Dask) backends from Ibis. FYI @Cody Peterson

Mark Druffel

08/22/2024, 11:08 PM

Thanks for the reply @Deepyaman Datta! I agree on the complexity, I've been going back and forth in my head on whether it should be two or one dataset type, feels very validating you've arrived in a similar place 🙂 Regarding why pandas, this issue led me try to use something other than duckdb. My csv files have random rows that start with

and it was causing problems for duckdb...

👍 1

Mark Druffel

08/22/2024, 11:14 PM

Also, just looking through the PR discussion. That all makes a lot of sense, but what about using UDFs in pyspark? I'm not a pandas person so I don't know how all the pieces fit together too well, but I know my team sometimes passes pandas UDFs to our spark session. Would ibis still support that?

Deepyaman Datta

08/22/2024, 11:17 PM

By the way, just thinking a bit more, I think the other option is to keep it as one

TableDataset

, rename the

file_format

argument to

format

, and let that also be

table

... or some variation on that (another possibility is to require the

materialized

keyword to be explicitly given if not writing to a file). I think having two datasets might be more clear, though, and in line with how most Kedro datasets currently look. The other solutions look like they basically add that branching to the dataset implementation.

Deepyaman Datta

08/22/2024, 11:18 PM

Also, just looking through the PR discussion. That all makes a lot of sense, but what about using UDFs in pyspark? I'm not a pandas person so I don't know how all the pieces fit together too well, but I know my team sometimes passes pandas UDFs to our spark session. Would ibis still support that?

Yes, pandas UDFs would still work, on backends that support pandas UDFs; we just wouldn't support pandas as the primary execution backend.

❤️ 1

Mark Druffel

08/22/2024, 11:18 PM

Yea in our version that we wrote before this was released I actually did format like that. I thought materialized was confusing since it just stands in for

create_

so it was hard to see what it did without reading and understanding the TableDataset code. I'll definitely give this some more thought as we continue using it

Deepyaman Datta

08/23/2024, 7:08 PM

@Mark Druffel doesn't solve for being able to write to a file (I'll try to put together a PR for that soon), but re the use of the pandas backend—would the Polars backend suffice? Would also be useful if you want to share more on your use case for reading the CSV with comments on https://github.com/ibis-project/ibis/pull/9772 or with me directly; we can try to bring it up with the DuckDB folks, since https://github.com/duckdb/duckdb/issues/7896 doesn't have a very satisfying resolution for a valid question.

Mark Druffel

08/27/2024, 6:17 PM

Hey @Deepyaman Datta, I'm sorry I totally missed your last comment. I can definitely share. I don't have my repo on github yet, it's a personal pipeline for all my bank accounts so it's gonna take me a few minutes to make sure my gitignore is up to snuff, but I'll share the link as soon as I get it published. Regarding polars, I definitely would have gone with that over pandas, but my personal computer is a really old custom rig and it's on an old lts processor. Polars has repo for it, but when I tried to install awhile back I kept having dependency issues https://pypi.org/project/polars-lts-cpu/ Always something🤦‍♂️

🙌 1

Mark Druffel

08/29/2024, 5:20 AM

After coming back to this, transcoding works perfectly for my current use case because I'm just trying to use pandas to do some pre-processing, but I want all my data in duckdb. I still think there's probably plenty of use cases where I'd want to write to parquet or csv from ibis, but this project isn't it.

Copy code

raw:
  type: pandas.CSVDataset
  filepath: data/01_raw/raw.csv

preprocessed@pandas:
  type: pandas.CSVDataset
  filepath: data/02_preprocessed/preprocessed.csv

preprocessed@ibis:
  type: ibis.TableDataset
  filepath: data/02_preprocessed/preprocessed.csv
  file_format: csv
  table_name: preprocessed
  connection: 
    backend: duckdb
    database: test.db
  save_args:
    materialized: table

standardized:
  type: ibis.TableDataset
  table_name: standardized
  connection: 
    backend: duckdb
    database: test.db
  save_args:
    materialized: table

Ricardo Quiroz

09/04/2024, 4:07 PM

Thanks for the great discussion @Mark Druffel @Deepyaman Datta 🙂 I've faced a similar issue attempting to save an ibis table (pyspark backend) to a partitioned parquet folder, but the ibis TableDataset only supports to save either with

create_table

create_view

. I fixed by modifying the _save method to use pyspark

write.save()

method. This is tailored to my issue but I'd love to have the

to_

capability in

ibis.TableDataset

Deepyaman Datta

09/05/2024, 11:49 AM

Thanks for the feedback! I'm OOO this week, but this is one of the things top of my list to address once I return. 🙂

💙 1

Deepyaman Datta

09/20/2024, 3:30 PM

Sorry for the delay @Mark Druffel @Ricardo Quiroz, but I've got a PR up for an initial

FileDataset

implementation: https://github.com/kedro-org/kedro-plugins/pull/842 (I've also listed a few other action items on that PR that I'll try to address later today, but thought I'd share this early before I get distracted by other things)

🙌 3

Deepyaman Datta

09/24/2024, 5:06 AM

I've gone ahead and completed most of the open items. I'll try to get a better implementation of versioning in later this week. 🤞 I think we are looking to cut a release of Kedro-Datasets in the next 2-3 weeks (I could be wrong 😅 but I think that's what I heard), in case any of you would like to test it out before then.

Deepyaman Datta

10/18/2024, 3:27 PM

For whoever's interested, the initial version of

FileDataset

is out in Kedro-Datasets 5.1.0 (released just now). Will probably make the official announcement beginning of next week, but wanted to let you all know in case want to check it out before then!

❤️ 2

14 Views

Open in Slack

Previous Next