Edit documenting a solution to my own question I m trying to Kedro #questions

Edit: documenting a solution to my own question. ...

Solomon Yu

02/17/2023, 3:34 PM

Edit: documenting a solution to my own question. I'm trying to load a multisheet ExcelDataSet through the Catalog. I'm trying to load all sheets this way,

Copy code

my_excel_file:
  type: pandas.ExcelDataSet
  filepath: some-excel-file.xlsx
  load_args:
    sheet_name: None

and I get

Worksheet named 'None' not found

~~Is there a way to load all sheets through the catalog?~~ Yes Edit: Not documented fully in Kedro, but in case someone comes across this, reminder to use YAML API syntax for

None

which is

null

or . Thanks in advance!

🌟 2

vhbtad

04/26/2023, 8:57 AM

hi -- newbie question 🙈 How do you concat your sheets afterwards ? Defining a

pd.concat

node with the dict as input ? Or is there a more direct way of concatenating sheets, if needed ?

Solomon Yu

04/26/2023, 12:18 PM

Precisely using pd.concat. In the concatenation node, using dict.values() which gives you a list of DataFrames, assuming the schema is uniform across the sheets (otherwise you get errors). Otherwise, you can address the sheets by dict["sheet_name(s)"]. Bear in mind the description of the obj argument.

Copy code

objs: sequence or mapping of Series or DataFrame objects
If a mapping is passed, the sorted keys will be used as the keys argument, unless it is passed, in which case the values will be selected (see below). Any None objects will be dropped silently unless they are all None in which case a ValueError will be raised.

Solomon Yu

04/26/2023, 12:22 PM

Feel free to ask any question no matter your level. As far as I know, there's no more direct method other than pd.concat. Depending on your case, you may concatenate by consuming a generator of DataFrames (i.e. streaming batches). Hope you find my comments helpful!

vhbtad

04/26/2023, 4:11 PM

Thank you very much for your detailed answer ! That's what I had done, and there was precisely one last different column causing me issues (+ a bad pandas dtype inference in the shortest table).

Solomon Yu

04/27/2023, 8:46 AM

You can add the dtype argument in the catalog for your dataset if you know the desired type for that column. Then handle the different column conditionally based on the dictionary key (sheet name) or column name

5 Views

Open in Slack

Previous Next