Mark Druffel
08/22/2024, 9:31 PMto_ methods (i.e. to_csv, to_delta, etc.) to the ibis.TableDataset? Or perhaps there should be a different ibis Dataset?
Details
I'm trying to pre-process some badly formed csv files in my pipeline. I know I can use a pandas node separately, but I prefer the ibis api so I tried to use TableDataset. I have the following data catalog entries:
raw:
  type: ibis.TableDataset
  filepath: data/01_raw/raw.csv
  file_format: csv
  connection: 
    backend: pandas
  load_args:
    sep: ","
preprocessed:
  type: ibis.TableDataset
  table_name: preprocessed
  connection: 
    backend: pandas
    database: test.db
  save_args:
    materialized: table
standardized:
  type: ibis.TableDataset
  table_name: standardized
  file_format: csv
  connection: 
    backend: duckdb
    database: finance.db
  save_args:
    materialized: table
The pipeline code looks like this:
def create_pipeline(**kwargs) -> Pipeline:
    return pipeline(
        [
            node(
                func=preprocess_raw,
                inputs="raw",
                outputs="preprocessed",
                name="preprocess"
            ),
            node(
                func=standardize,
                inputs="preprocessed",
                outputs="standardized",
                name="standardize"
            ),
        ]
    )
I jump into an ipython session with kedro ipython and run `catalog.load("preprocessed") and get the error TypeError: BasePandasBackend.do_connect() got an unexpected keyword argument 'database', which is coming from Ibis. After looking at the backend setup, I see database isn't a valid argument.
I remove database and reran and got the error DatasetError: Failed while saving data to data set... Unable to convert <class 'ibis.expr.types.relations.Table'> object to backend type: <class 'pandas.core.frame.DataFrame'> . I didn't exactly expect this to work, but I wasn't sure...
preprocessed:
  type: ibis.TableDataset
  table_name: preprocessed
  connection: 
    backend: pandas
Then I tried removing table_name as well and got the obvious error that I need a table_name or a filepath. `DatasetError: Must provide at least one of filepath or table_name.` No doubt 😂
preprocessed:
  type: ibis.TableDataset
  connection: 
    backend: pandas
Then I tried adding a filepath and get the error `DatasetError: Must provide table_name for materialization.`  which I can see in TableDataset's _write method.
preprocessed:
  type: ibis.TableDataset
  filepath: data/02_preprocessed/preprocessed.csv
  connection: 
    backend: pandasDeepyaman Datta
08/22/2024, 9:51 PMibis.TableDataset, I didn't implement output to files—but there's nothing stopping us from adding this functionality. Ibis itself already supports it, as you point out, and I think it would be natural. I was actually thinking about this while I was out somewhere yesterday, so I'm glad you bring it up. 😅
ibis.FileDataset and ibis.TableDataset? If you just wanted to load something from a file and write it back to a database table, I guess you could have the node load from an ibis.FileDataset and write to a ibis.TableDataset. This could make the use of filepaths much more explicit/obvious.Deepyaman Datta
08/22/2024, 9:51 PMDeepyaman Datta
08/22/2024, 9:53 PMMark Druffel
08/22/2024, 11:08 PM# and it was causing problems for duckdb...Mark Druffel
08/22/2024, 11:14 PMDeepyaman Datta
08/22/2024, 11:17 PMTableDataset, rename the file_format argument to format, and let that also be table... or some variation on that (another possibility is to require the materialized keyword to be explicitly given if not writing to a file).
I think having two datasets might be more clear, though, and in line with how most Kedro datasets currently look. The other solutions look like they basically add that branching to the dataset implementation.Deepyaman Datta
08/22/2024, 11:18 PMAlso, just looking through the PR discussion. That all makes a lot of sense, but what about using UDFs in pyspark? I'm not a pandas person so I don't know how all the pieces fit together too well, but I know my team sometimes passes pandas UDFs to our spark session. Would ibis still support that?Yes, pandas UDFs would still work, on backends that support pandas UDFs; we just wouldn't support pandas as the primary execution backend.
Mark Druffel
08/22/2024, 11:18 PMcreate_  so it was hard to see what it did without reading and understanding the TableDataset code. I'll definitely give this some more thought as we continue using itDeepyaman Datta
08/23/2024, 7:08 PMMark Druffel
08/27/2024, 6:17 PMMark Druffel
08/29/2024, 5:20 AMraw:
  type: pandas.CSVDataset
  filepath: data/01_raw/raw.csv
preprocessed@pandas:
  type: pandas.CSVDataset
  filepath: data/02_preprocessed/preprocessed.csv
preprocessed@ibis:
  type: ibis.TableDataset
  filepath: data/02_preprocessed/preprocessed.csv
  file_format: csv
  table_name: preprocessed
  connection: 
    backend: duckdb
    database: test.db
  save_args:
    materialized: table
standardized:
  type: ibis.TableDataset
  table_name: standardized
  connection: 
    backend: duckdb
    database: test.db
  save_args:
    materialized: tableRicardo Quiroz
09/04/2024, 4:07 PMcreate_table or create_view.
I fixed by modifying the _save method to use pyspark write.save() method. This is tailored to my issue but I'd love to have the to_ capability in ibis.TableDataset.Deepyaman Datta
09/05/2024, 11:49 AMDeepyaman Datta
09/20/2024, 3:30 PMFileDataset implementation: https://github.com/kedro-org/kedro-plugins/pull/842
(I've also listed a few other action items on that PR that I'll try to address later today, but thought I'd share this early before I get distracted by other things)Deepyaman Datta
09/24/2024, 5:06 AMDeepyaman Datta
10/18/2024, 3:27 PMFileDataset is out in Kedro-Datasets 5.1.0 (released just now). Will probably make the official announcement beginning of next week, but wanted to let you all know in case want to check it out before then!