Hey Guys, quick question: Is there a way to enfor...
# questions
t
Hey Guys, quick question: Is there a way to enforce on the catalog the schema data type? Like:
Copy code
cars:
  type: pandas.CSVDataset
  filepath: data/01_raw/company/cars.csv
  load_args:
    sep: ','
  save_args:
    index: False
    date_format: '%Y-%m-%d %H:%M'
    decimal: .
    schema:
h
Someone will reply to you shortly. In the meantime, this might help:
t
so that I can specify the columns that I want to be in certain data type
m
When reading, you can use the
dtype
arg of pandas’
read_csv
method in the
load_args
field. Is that what you are looking for? So:
Copy code
cars:
  type: pandas.CSVDataset
  filepath: data/01_raw/company/cars.csv
  load_args:
    sep: ','
    dtype: {}
  save_args:
    index: False
    date_format: '%Y-%m-%d %H:%M'
    decimal: .
t
not on loading but when saving it
m
Unfortunately no… pandas does not have such am option as it infers type from
df.dtypes
when writing to csv. You can always cast to the correct dtype from within the node
t
https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.to_gbq.html table_schema list of dicts, optional List of BigQuery table fields to which according DataFrame columns conform to, e.g.
[{'name': 'col1', 'type': 'STRING'},...]
. If schema is not provided, it will be generated according to dtypes of DataFrame columns. See BigQuery API documentation on available names of a field. New in version 0.3.1 of pandas-gbq.
so it seems that it is possible
n
to be clear, it's not a pandas issue. CSV does not have type because it is plain text and thus always known as a bad format for any data processing pipeline. The reason why you can do that to BiqQuery because it is a typed system that stores your dataframe as a table. If you want to preserve type, use something like Parquet
t
but for some reason...
so theres no way to feed a BQ table with specific data type for the columns using type: pandas.GBQTableDataSet
n
You can definitely do that with bigquery, I said you cannot save CSV with types as that was the original example that you provided.
t
oh yeah, my bad, I'm using type: pandas.GBQTableDataSet but for some reason I got
Copy code
kedro.io.core.DatasetError: Failed while saving data to data set GBQTableDataset
Could not convert DataFrame to Parquet.