Hey Guys quick question Is there a way to enforce on the cat Kedro #questions

Hey Guys, quick question: Is there a way to enfor...

Thiago José Moser Poletto

01/24/2025, 12:12 PM

Hey Guys, quick question: Is there a way to enforce on the catalog the schema data type? Like:

Copy code

cars:
  type: pandas.CSVDataset
  filepath: data/01_raw/company/cars.csv
  load_args:
    sep: ','
  save_args:
    index: False
    date_format: '%Y-%m-%d %H:%M'
    decimal: .
    schema:

Hall

01/24/2025, 12:12 PM

Someone will reply to you shortly. In the meantime, this might help:

Thiago José Moser Poletto

01/24/2025, 12:16 PM

so that I can specify the columns that I want to be in certain data type

Matthias Roels

01/24/2025, 1:03 PM

When reading, you can use the

dtype

arg of pandas’

read_csv

method in the

load_args

field. Is that what you are looking for? So:

Copy code

cars:
  type: pandas.CSVDataset
  filepath: data/01_raw/company/cars.csv
  load_args:
    sep: ','
    dtype: {}
  save_args:
    index: False
    date_format: '%Y-%m-%d %H:%M'
    decimal: .

Thiago José Moser Poletto

01/24/2025, 1:05 PM

not on loading but when saving it

Matthias Roels

01/24/2025, 1:07 PM

Unfortunately no… pandas does not have such am option as it infers type from

df.dtypes

when writing to csv. You can always cast to the correct dtype from within the node

Thiago José Moser Poletto

01/24/2025, 1:10 PM

https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.to_gbq.html table_schema list of dicts, optional List of BigQuery table fields to which according DataFrame columns conform to, e.g.

[{'name': 'col1', 'type': 'STRING'},...]

. If schema is not provided, it will be generated according to dtypes of DataFrame columns. See BigQuery API documentation on available names of a field. New in version 0.3.1 of pandas-gbq.

Thiago José Moser Poletto

01/24/2025, 1:12 PM

so it seems that it is possible

Nok Lam Chan

01/24/2025, 1:22 PM

to be clear, it's not a pandas issue. CSV does not have type because it is plain text and thus always known as a bad format for any data processing pipeline. The reason why you can do that to BiqQuery because it is a typed system that stores your dataframe as a table. If you want to preserve type, use something like Parquet

Thiago José Moser Poletto

01/24/2025, 1:22 PM

but for some reason...

Thiago José Moser Poletto

01/24/2025, 1:24 PM

so theres no way to feed a BQ table with specific data type for the columns using type: pandas.GBQTableDataSet

Nok Lam Chan

01/24/2025, 1:40 PM

You can definitely do that with bigquery, I said you cannot save CSV with types as that was the original example that you provided.

Thiago José Moser Poletto

01/24/2025, 1:49 PM

oh yeah, my bad, I'm using type: pandas.GBQTableDataSet but for some reason I got

Copy code

kedro.io.core.DatasetError: Failed while saving data to data set GBQTableDataset
Could not convert DataFrame to Parquet.

3 Views

Open in Slack

Previous Next