At the end of my data science pipeline I need to save multip Kedro #questions

At the end of my data science pipeline I need to s...

Jaakko

12/15/2022, 6:59 PM

At the end of my data science pipeline I need to save multiple plots. The number of plots depends on hyperparameters of the model and there could be around 5-30 plots. How would I do this with Kedro? I took a look at https://kedro.readthedocs.io/en/stable/kedro.extras.datasets.matplotlib.MatplotlibWriter.html. However, there is only one example using YAML api (which I think I need to use to be able to see pictures when looking at my experiments through kedro viz) and in that example only one plot is saved. There are also examples where a list of plots is saved but there the python api is used and with the python api approach I can't figure out how to get the list of images be displayed in my experiments section in Kedro viz.

Ian Whalen

12/15/2022, 7:21 PM

For the first part, you could implement your own dataset that saves a list of matplotlib figures to a directory using

MatplotlibWriter

in a loop As far as displaying in kedro viz, I’m not sure

Deepyaman Datta

12/16/2022, 2:40 PM

Both saving a list of plots or a single plots can be done via the YAML API; the difference is whether you return a list or not from the node that saves to the MatplotlibWriter dataset.

Deepyaman Datta

12/16/2022, 2:40 PM

If you return a list from the node, how does it look in Kedro-Viz?

Jaakko

12/16/2022, 6:17 PM

@Deepyaman Datta This the example from https://kedro.readthedocs.io/en/stable/kedro.extras.datasets.matplotlib.MatplotlibWriter.html to save one plot:

Copy code

output_plot:
  type: matplotlib.MatplotlibWriter
  filepath: data/08_reporting/output_plot.png
  save_args:
    format: png

I can't figure out how to modify the yaml in the case where there is a list of plots that we want to save. What should be the filepath argument, for example?

Deepyaman Datta

12/16/2022, 6:26 PM

If you look at the implementation under https://kedro.readthedocs.io/en/stable/_modules/kedro/extras/datasets/matplotlib/matplotlib_writer.html#MatplotlibWriter, you'll see that it will write to

data/08_reporting/output_plot.png/0.png

data/08_reporting/output_plot.png/1.png

, etc. If you want to control the names, you can return a dictionary instead of a list. Ideally, seeing the filepaths above, you would want to specify a directory-like name (rather than a filename) as the

filepath

argument. Relevant snippet from that link:

Copy code

if isinstance(data, list):
            for index, plot in enumerate(data):
                full_key_path = get_filepath_str(
                    save_path / f"{index}.png", self._protocol
                )
                self._save_to_fs(full_key_path=full_key_path, plot=plot)

(Of course, it would also be better if this were more clearly documented, and you didn't have to understand the implementation, but just trying to help for now)

🥳 2

Jaakko

12/19/2022, 6:37 AM

Awesome, thanks!

Jaakko

12/22/2022, 7:08 PM

@Deepyaman Datta I am still getting back to this. I have now the following dataset entry in my catalog:

Copy code

output_plot:
  type: matplotlib.MatplotlibWriter
  filepath: data/08_reporting/output_plots
  save_args:
    format: png
  versioned: true

If I have my pipeline return a single plot everything works fine and I can see the plot in the experiments section in kedro viz. However, if I return a list of plots the plots are still created but I can't see them through kedro viz. Also, I see the following warning displayed in the terminal where kedro viz was started:

Copy code

'output_plot' with version '2022-12-22T19.00.19.079Z' could not be loaded. Full exception: DataSetError: Failed while loading data from data set    experiment_tracking.py:101
                             MatplotlibWriter(filepath=[my project path]/data/08_reporting/output_plots, protocol=file, save_args={'format': png},                                         
                             version=Version(load='2022-12-22T19.00.19.079Z', save=None)).                                                                                                                 
                             [Errno 21] Is a directory: '[my project path]/data/08_reporting/output_plots/2022-12-22T19.00.19.079Z/output_plots'

Is this as designed? Would be nice to see list of plots displayed by kedro viz as well.

Olivia Lihn

12/22/2022, 9:47 PM

@Steeve Ndjila same situation we are having!

Deepyaman Datta

12/22/2022, 10:02 PM

To be honest, I don't know much on the Viz side, but I'm guessing (without looking at the code) it just wasn't written with the use case of handling a directory of paths in mind. @Tynan or @Rashida Kanchwala or somebody else may know better, although I think most of the core team is off. Let me see if I can take a look at the code for my own knowledge later.

Tynan

12/23/2022, 10:20 AM

@Jaakko which version of Kedro are you using?

Jaakko

12/23/2022, 12:12 PM

@Tynan kedro version 0.18.3

Jaakko

12/23/2022, 12:13 PM

kedro-viz 5.1.1

Tynan

12/23/2022, 12:35 PM

thanks. @Deepyaman Datta is right, Viz isn't written to handle this use case. what we handle is one plot per metric per run, not multiple plots

214 Views

Open in Slack

Previous Next