Hello I want to use a namespaced pipeline and data catalog t Kedro #questions

Hello, I want to use a namespaced pipeline and dat...

minmin

11/15/2024, 5:34 PM

Hello, I want to use a namespaced pipeline and data catalog to get a series of dataframes, do some manipulations, and then save them all in one Excel spreadsheet in different sheets. I thought something like this would work in the catalog:

Copy code

"{namespace}.spreadsheet_data":
  type: pandas.ExcelDataset
  filepath: data/03_primary/all_data_sources.xlsx
  save_args:
    sheet_name: "{namespace}_data"

but this doesn't work. I just end up with a spreadsheet with one sheet - with the name of whatever namespace ran last. I.e. it must be overwriting it each time. I have read that I will need to specify a writer if I want to write to a file that already exists (https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.to_excel.html) but I can't get that to work. Is what I would like to do possible? Many thanks

Hall

11/15/2024, 5:34 PM

Someone will reply to you shortly. In the meantime, this might help:

Ravi Kumar Pilla

11/15/2024, 5:44 PM

Hi Minesh, Thank you for raising the issue. Did you try using the append mode for the Excel Writer. I also see an option of if_sheet_exists
which might help.

Ravi Kumar Pilla

11/15/2024, 5:52 PM

Copy code

"{namespace}.spreadsheet_data":
  type: pandas.ExcelDataset
  filepath: data/03_primary/all_data_sources.xlsx
  save_args:
    writer:
      mode: a
      if_sheet_exists: new
    sheet_name: "{namespace}_data"

Something like above ?

Ankita Katiyar

11/18/2024, 8:12 AM

Like @Ravi Kumar Pilla mentioned, you can pass

kwargs

to the underlying load and save functions with

save_args

and

load_args

in the Data Catalog configuration. There’s a few examples on this page - https://docs.kedro.org/en/stable/data/data_catalog_yaml_examples.html

minmin

11/18/2024, 4:53 PM

Thank you both for you suggestions. I too thought it would be fairly straight forward. Especially given how easy it is to accomplish with just pd.ExcelWriter. However there were a couple of snags - "append" mode only works for spreadsheets that already exist. As far as I can see there is no "append if exists, but create first if it doesn't exist" option. But even when I manually created the spreadsheet first I would get an "file is not a zip file" error which is some sort of openpyxl error (could be package version issue). In any case, I just got round the issue by created an custom dataset and just using pd.ExcelWriter in the save method there. Thank you both for you help.

Open in Slack

Previous Next