Marc Gris
11/21/2023, 10:28 AMpd.read_csv()
has a nrows
param that takes an int
as argument but is set by default to None
, in which case, the full dataset is loaded.
Therefore, I thought that I could leverage that in order to give myself the ability to experiment / iterable quickly by downsampling my datasets at runtime with something like:
"raw_{table}":
type: pandas.CSVDataset
filepath: data/01_raw/tables/raw_{table}.csv
load_args:
nrows: "${runtime_params:nrows, None}"
While this works perfectly with kedro run --params nrows=100
, when leaving the param unspecified, I end up with
DatasetError: Failed while loading data from data set
CSVDataset(filepath=/Users/marc/DODOBIRD/DODO_CODE/kedro-etl/data/01_raw/tables/raw_users.csv, load_args={'nrows': None}, protocol=file, save_args={'index': False}).
'nrows' must be an integer >=0
and yet running pd.read_csv("data.csv", nrows=None)
simply returns the full dataset as expected.
Is this a bug ? or am I missing something / doing something wrong.
Thanks for your input,
M.Marc Gris
11/21/2023, 10:31 AMIñigo Hidalgo
11/21/2023, 10:34 AMAnkita Katiyar
11/21/2023, 10:36 AMnrows: "${runtime_params:nrows, null}"
Ankita Katiyar
11/21/2023, 10:39 AMnull
should work instead of None
which might be getting treated as a stringIñigo Hidalgo
11/21/2023, 10:41 AMMarc Gris
11/21/2023, 10:44 AMnull
instead of None
thing right now. 👍🏼Marc Gris
11/21/2023, 10:52 AMNone
… If I have time, I’ll put a breakpoint to inspect this… But right now, I have to “deliver” 😅
Thanks again.
Good day to you both.
M.Iñigo Hidalgo
11/23/2023, 11:55 PMMarc Gris
11/24/2023, 6:59 AMIñigo Hidalgo
11/24/2023, 9:49 AM