Question about ibis.TableDataset. Is there a way t...
# plugins-integrations
m
Question about ibis.TableDataset. Is there a way to use the pandas backend in a pipeline? It seems like you can't write pandas output to a file or a database. It seems like this is by design and makes sense for a *Table*Dataset, but is that the intent? I really like the Ibis API and would prefer to use it as my primary dataframe library. I mostly work with pyspark and duckdb so it's a natural fit there, but I'm wondering if there is a long-term plan or willingness to consder adding
to_
methods (i.e. to_csv, to_delta, etc.) to the ibis.TableDataset? Or perhaps there should be a different ibis Dataset? Details I'm trying to pre-process some badly formed csv files in my pipeline. I know I can use a pandas node separately, but I prefer the ibis api so I tried to use TableDataset. I have the following data catalog entries:
Copy code
raw:
  type: ibis.TableDataset
  filepath: data/01_raw/raw.csv
  file_format: csv
  connection: 
    backend: pandas
  load_args:
    sep: ","

preprocessed:
  type: ibis.TableDataset
  table_name: preprocessed
  connection: 
    backend: pandas
    database: test.db
  save_args:
    materialized: table

standardized:
  type: ibis.TableDataset
  table_name: standardized
  file_format: csv
  connection: 
    backend: duckdb
    database: finance.db
  save_args:
    materialized: table
The pipeline code looks like this:
Copy code
def create_pipeline(**kwargs) -> Pipeline:
    return pipeline(
        [
            node(
                func=preprocess_raw,
                inputs="raw",
                outputs="preprocessed",
                name="preprocess"
            ),
            node(
                func=standardize,
                inputs="preprocessed",
                outputs="standardized",
                name="standardize"
            ),
        ]
    )
I jump into an ipython session with
kedro ipython
and run `catalog.load("preprocessed") and get the error
TypeError: BasePandasBackend.do_connect() got an unexpected keyword argument 'database'
, which is coming from Ibis. After looking at the backend setup, I see database isn't a valid argument. I remove database and reran and got the error
DatasetError: Failed while saving data to data set... Unable to convert <class 'ibis.expr.types.relations.Table'> object to backend type: <class 'pandas.core.frame.DataFrame'>
. I didn't exactly expect this to work, but I wasn't sure...
Copy code
preprocessed:
  type: ibis.TableDataset
  table_name: preprocessed
  connection: 
    backend: pandas
Then I tried removing table_name as well and got the obvious error that I need a table_name or a filepath. `DatasetError: Must provide at least one of
filepath
or
table_name
.` No doubt 😂
Copy code
preprocessed:
  type: ibis.TableDataset
  connection: 
    backend: pandas
Then I tried adding a filepath and get the error `DatasetError: Must provide
table_name
for materialization.` which I can see in TableDataset's
_write
method.
Copy code
preprocessed:
  type: ibis.TableDataset
  filepath: data/02_preprocessed/preprocessed.csv
  connection: 
    backend: pandas
👍 1
d
In implementing the initial
ibis.TableDataset
, I didn't implement output to files—but there's nothing stopping us from adding this functionality. Ibis itself already supports it, as you point out, and I think it would be natural. I was actually thinking about this while I was out somewhere yesterday, so I'm glad you bring it up. 😅 I envision it should be part of a single dataset. I think we just need to be careful about how different combinations of arguments work. Current behavior: • Table name and filepath specified ◦ Load: from filepath, create named Ibis table in memory ◦ Save: write to table with given name • Table name specified, no filepath ◦ Load: from existing table (resulting in named Ibis table in memory) ◦ Save: write to table with given name So, how would we specify something being written to a file? As I was writing this out, I crossed out my initial thought; maybe it makes sense to have
ibis.FileDataset
and
ibis.TableDataset
? If you just wanted to load something from a file and write it back to a database table, I guess you could have the node load from an
ibis.FileDataset
and write to a
ibis.TableDataset
. This could make the use of filepaths much more explicit/obvious.
(Sorry I rambled a bit there, but would love your thoughts!)
There's also a side point around your use of the pandas backend. Is there a reason you specifically want to use the pandas backend, or would it be fine if you could use DuckDB backend and write to file, or Polars backend and write to file? https://github.com/ibis-project/ibis/pull/9772#issuecomment-2298784456 proposes to remove the pandas (and Dask) backends from Ibis. FYI @Cody Peterson
m
Thanks for the reply @Deepyaman Datta! I agree on the complexity, I've been going back and forth in my head on whether it should be two or one dataset type, feels very validating you've arrived in a similar place 🙂 Regarding why pandas, this issue led me try to use something other than duckdb. My csv files have random rows that start with
#
and it was causing problems for duckdb...
👍 1
Also, just looking through the PR discussion. That all makes a lot of sense, but what about using UDFs in pyspark? I'm not a pandas person so I don't know how all the pieces fit together too well, but I know my team sometimes passes pandas UDFs to our spark session. Would ibis still support that?
d
By the way, just thinking a bit more, I think the other option is to keep it as one
TableDataset
, rename the
file_format
argument to
format
, and let that also be
table
... or some variation on that (another possibility is to require the
materialized
keyword to be explicitly given if not writing to a file). I think having two datasets might be more clear, though, and in line with how most Kedro datasets currently look. The other solutions look like they basically add that branching to the dataset implementation.
Also, just looking through the PR discussion. That all makes a lot of sense, but what about using UDFs in pyspark? I'm not a pandas person so I don't know how all the pieces fit together too well, but I know my team sometimes passes pandas UDFs to our spark session. Would ibis still support that?
Yes, pandas UDFs would still work, on backends that support pandas UDFs; we just wouldn't support pandas as the primary execution backend.
❤️ 1
m
Yea in our version that we wrote before this was released I actually did format like that. I thought materialized was confusing since it just stands in for
create_
so it was hard to see what it did without reading and understanding the TableDataset code. I'll definitely give this some more thought as we continue using it
d
@Mark Druffel doesn't solve for being able to write to a file (I'll try to put together a PR for that soon), but re the use of the pandas backend—would the Polars backend suffice? Would also be useful if you want to share more on your use case for reading the CSV with comments on https://github.com/ibis-project/ibis/pull/9772 or with me directly; we can try to bring it up with the DuckDB folks, since https://github.com/duckdb/duckdb/issues/7896 doesn't have a very satisfying resolution for a valid question.
m
Hey @Deepyaman Datta, I'm sorry I totally missed your last comment. I can definitely share. I don't have my repo on github yet, it's a personal pipeline for all my bank accounts so it's gonna take me a few minutes to make sure my gitignore is up to snuff, but I'll share the link as soon as I get it published. Regarding polars, I definitely would have gone with that over pandas, but my personal computer is a really old custom rig and it's on an old lts processor. Polars has repo for it, but when I tried to install awhile back I kept having dependency issues https://pypi.org/project/polars-lts-cpu/ Always something🤦‍♂️
🙌 1
After coming back to this, transcoding works perfectly for my current use case because I'm just trying to use pandas to do some pre-processing, but I want all my data in duckdb. I still think there's probably plenty of use cases where I'd want to write to parquet or csv from ibis, but this project isn't it.
Copy code
raw:
  type: pandas.CSVDataset
  filepath: data/01_raw/raw.csv

preprocessed@pandas:
  type: pandas.CSVDataset
  filepath: data/02_preprocessed/preprocessed.csv

preprocessed@ibis:
  type: ibis.TableDataset
  filepath: data/02_preprocessed/preprocessed.csv
  file_format: csv
  table_name: preprocessed
  connection: 
    backend: duckdb
    database: test.db
  save_args:
    materialized: table

standardized:
  type: ibis.TableDataset
  table_name: standardized
  connection: 
    backend: duckdb
    database: test.db
  save_args:
    materialized: table
r
Thanks for the great discussion @Mark Druffel @Deepyaman Datta 🙂 I've faced a similar issue attempting to save an ibis table (pyspark backend) to a partitioned parquet folder, but the ibis TableDataset only supports to save either with
create_table
or
create_view
. I fixed by modifying the _save method to use pyspark
write.save()
method. This is tailored to my issue but I'd love to have the
to_
capability in
ibis.TableDataset
.
d
Thanks for the feedback! I'm OOO this week, but this is one of the things top of my list to address once I return. 🙂
💙 1
Sorry for the delay @Mark Druffel @Ricardo Quiroz, but I've got a PR up for an initial
FileDataset
implementation: https://github.com/kedro-org/kedro-plugins/pull/842 (I've also listed a few other action items on that PR that I'll try to address later today, but thought I'd share this early before I get distracted by other things)
🙌 3
I've gone ahead and completed most of the open items. I'll try to get a better implementation of versioning in later this week. 🤞 I think we are looking to cut a release of Kedro-Datasets in the next 2-3 weeks (I could be wrong 😅 but I think that's what I heard), in case any of you would like to test it out before then.
For whoever's interested, the initial version of
FileDataset
is out in Kedro-Datasets 5.1.0 (released just now). Will probably make the official announcement beginning of next week, but wanted to let you all know in case want to check it out before then!
❤️ 2