Mark Druffel
08/22/2024, 9:31 PMto_
methods (i.e. to_csv, to_delta, etc.) to the ibis.TableDataset? Or perhaps there should be a different ibis Dataset?
Details
I'm trying to pre-process some badly formed csv files in my pipeline. I know I can use a pandas node separately, but I prefer the ibis api so I tried to use TableDataset. I have the following data catalog entries:
raw:
type: ibis.TableDataset
filepath: data/01_raw/raw.csv
file_format: csv
connection:
backend: pandas
load_args:
sep: ","
preprocessed:
type: ibis.TableDataset
table_name: preprocessed
connection:
backend: pandas
database: test.db
save_args:
materialized: table
standardized:
type: ibis.TableDataset
table_name: standardized
file_format: csv
connection:
backend: duckdb
database: finance.db
save_args:
materialized: table
The pipeline code looks like this:
def create_pipeline(**kwargs) -> Pipeline:
return pipeline(
[
node(
func=preprocess_raw,
inputs="raw",
outputs="preprocessed",
name="preprocess"
),
node(
func=standardize,
inputs="preprocessed",
outputs="standardized",
name="standardize"
),
]
)
I jump into an ipython session with kedro ipython
and run `catalog.load("preprocessed") and get the error TypeError: BasePandasBackend.do_connect() got an unexpected keyword argument 'database'
, which is coming from Ibis. After looking at the backend setup, I see database isn't a valid argument.
I remove database and reran and got the error DatasetError: Failed while saving data to data set... Unable to convert <class 'ibis.expr.types.relations.Table'> object to backend type: <class 'pandas.core.frame.DataFrame'>
. I didn't exactly expect this to work, but I wasn't sure...
preprocessed:
type: ibis.TableDataset
table_name: preprocessed
connection:
backend: pandas
Then I tried removing table_name as well and got the obvious error that I need a table_name or a filepath. `DatasetError: Must provide at least one of filepath
or table_name
.` No doubt 😂
preprocessed:
type: ibis.TableDataset
connection:
backend: pandas
Then I tried adding a filepath and get the error `DatasetError: Must provide table_name
for materialization.` which I can see in TableDataset's _write
method.
preprocessed:
type: ibis.TableDataset
filepath: data/02_preprocessed/preprocessed.csv
connection:
backend: pandas
Deepyaman Datta
08/22/2024, 9:51 PMibis.TableDataset
, I didn't implement output to files—but there's nothing stopping us from adding this functionality. Ibis itself already supports it, as you point out, and I think it would be natural. I was actually thinking about this while I was out somewhere yesterday, so I'm glad you bring it up. 😅
ibis.FileDataset
and ibis.TableDataset
? If you just wanted to load something from a file and write it back to a database table, I guess you could have the node load from an ibis.FileDataset
and write to a ibis.TableDataset
. This could make the use of filepaths much more explicit/obvious.Deepyaman Datta
08/22/2024, 9:51 PMDeepyaman Datta
08/22/2024, 9:53 PMMark Druffel
08/22/2024, 11:08 PM#
and it was causing problems for duckdb...Mark Druffel
08/22/2024, 11:14 PMDeepyaman Datta
08/22/2024, 11:17 PMTableDataset
, rename the file_format
argument to format
, and let that also be table
... or some variation on that (another possibility is to require the materialized
keyword to be explicitly given if not writing to a file).
I think having two datasets might be more clear, though, and in line with how most Kedro datasets currently look. The other solutions look like they basically add that branching to the dataset implementation.Deepyaman Datta
08/22/2024, 11:18 PMAlso, just looking through the PR discussion. That all makes a lot of sense, but what about using UDFs in pyspark? I'm not a pandas person so I don't know how all the pieces fit together too well, but I know my team sometimes passes pandas UDFs to our spark session. Would ibis still support that?Yes, pandas UDFs would still work, on backends that support pandas UDFs; we just wouldn't support pandas as the primary execution backend.
Mark Druffel
08/22/2024, 11:18 PMcreate_
so it was hard to see what it did without reading and understanding the TableDataset code. I'll definitely give this some more thought as we continue using itDeepyaman Datta
08/23/2024, 7:08 PMMark Druffel
08/27/2024, 6:17 PMMark Druffel
08/29/2024, 5:20 AMraw:
type: pandas.CSVDataset
filepath: data/01_raw/raw.csv
preprocessed@pandas:
type: pandas.CSVDataset
filepath: data/02_preprocessed/preprocessed.csv
preprocessed@ibis:
type: ibis.TableDataset
filepath: data/02_preprocessed/preprocessed.csv
file_format: csv
table_name: preprocessed
connection:
backend: duckdb
database: test.db
save_args:
materialized: table
standardized:
type: ibis.TableDataset
table_name: standardized
connection:
backend: duckdb
database: test.db
save_args:
materialized: table
Ricardo Quiroz
09/04/2024, 4:07 PMcreate_table
or create_view
.
I fixed by modifying the _save method to use pyspark write.save()
method. This is tailored to my issue but I'd love to have the to_
capability in ibis.TableDataset
.Deepyaman Datta
09/05/2024, 11:49 AMDeepyaman Datta
09/20/2024, 3:30 PMFileDataset
implementation: https://github.com/kedro-org/kedro-plugins/pull/842
(I've also listed a few other action items on that PR that I'll try to address later today, but thought I'd share this early before I get distracted by other things)Deepyaman Datta
09/24/2024, 5:06 AMDeepyaman Datta
10/18/2024, 3:27 PMFileDataset
is out in Kedro-Datasets 5.1.0 (released just now). Will probably make the official announcement beginning of next week, but wanted to let you all know in case want to check it out before then!