Edit: documenting a solution to my own question. ...
# questions
s
Edit: documenting a solution to my own question. I'm trying to load a multisheet ExcelDataSet through the Catalog. I'm trying to load all sheets this way,
Copy code
my_excel_file:
  type: pandas.ExcelDataSet
  filepath: some-excel-file.xlsx
  load_args:
    sheet_name: None
and I get
Worksheet named 'None' not found
Is there a way to load all sheets through the catalog? Yes Edit: Not documented fully in Kedro, but in case someone comes across this, reminder to use YAML API syntax for
None
which is
null
or
~
or . Thanks in advance!
🌟 2
v
hi -- newbie question 🙈 How do you concat your sheets afterwards ? Defining a
pd.concat
node with the dict as input ? Or is there a more direct way of concatenating sheets, if needed ?
s
Precisely using pd.concat. In the concatenation node, using dict.values() which gives you a list of DataFrames, assuming the schema is uniform across the sheets (otherwise you get errors). Otherwise, you can address the sheets by dict["sheet_name(s)"]. Bear in mind the description of the obj argument.
Copy code
objs: sequence or mapping of Series or DataFrame objects
If a mapping is passed, the sorted keys will be used as the keys argument, unless it is passed, in which case the values will be selected (see below). Any None objects will be dropped silently unless they are all None in which case a ValueError will be raised.
Feel free to ask any question no matter your level. As far as I know, there's no more direct method other than pd.concat. Depending on your case, you may concatenate by consuming a generator of DataFrames (i.e. streaming batches). Hope you find my comments helpful!
v
Thank you very much for your detailed answer ! That's what I had done, and there was precisely one last different column causing me issues (+ a bad pandas dtype inference in the shortest table).
s
You can add the dtype argument in the catalog for your dataset if you know the desired type for that column. Then handle the different column conditionally based on the dictionary key (sheet name) or column name