Hello, I want to use a namespaced pipeline and dat...
# questions
m
Hello, I want to use a namespaced pipeline and data catalog to get a series of dataframes, do some manipulations, and then save them all in one Excel spreadsheet in different sheets. I thought something like this would work in the catalog:
Copy code
"{namespace}.spreadsheet_data":
  type: pandas.ExcelDataset
  filepath: data/03_primary/all_data_sources.xlsx
  save_args:
    sheet_name: "{namespace}_data"
but this doesn't work. I just end up with a spreadsheet with one sheet - with the name of whatever namespace ran last. I.e. it must be overwriting it each time. I have read that I will need to specify a writer if I want to write to a file that already exists (https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.to_excel.html) but I can't get that to work. Is what I would like to do possible? Many thanks
h
Someone will reply to you shortly. In the meantime, this might help:
r
Hi Minesh, Thank you for raising the issue. Did you try using the append mode for the Excel Writer. I also see an option of
if_sheet_exists
which might help.
Copy code
"{namespace}.spreadsheet_data":
  type: pandas.ExcelDataset
  filepath: data/03_primary/all_data_sources.xlsx
  save_args:
    writer:
      mode: a
      if_sheet_exists: new
    sheet_name: "{namespace}_data"
Something like above ?
a
Like @Ravi Kumar Pilla mentioned, you can pass
kwargs
to the underlying load and save functions with
save_args
and
load_args
in the Data Catalog configuration. There’s a few examples on this page - https://docs.kedro.org/en/stable/data/data_catalog_yaml_examples.html
m
Thank you both for you suggestions. I too thought it would be fairly straight forward. Especially given how easy it is to accomplish with just pd.ExcelWriter. However there were a couple of snags - "append" mode only works for spreadsheets that already exist. As far as I can see there is no "append if exists, but create first if it doesn't exist" option. But even when I manually created the spreadsheet first I would get an "file is not a zip file" error which is some sort of openpyxl error (could be package version issue). In any case, I just got round the issue by created an custom dataset and just using pd.ExcelWriter in the save method there. Thank you both for you help.