Hi guys Can we run a jupyter notebooks in a kedro node part Kedro #questions

Hi guys! Can we run a jupyter notebooks in a kedro...

Giovanna Cavali

04/03/2024, 3:30 PM

Hi guys! Can we run a jupyter notebooks in a kedro node part of a pipeline? And if so, how can we manage outputs from a jupyter notebook?

Deepyaman Datta

04/03/2024, 3:30 PM

Why do you want to run a notebook in a pipeline?

😅 2

Juan Luis

04/03/2024, 3:31 PM

I think @marrrcin has an idea about this 😄

😬 2

datajoely

04/03/2024, 3:32 PM

so the short answer is you probably want something like papermill, the longer answer is that we don’t believe this great practice

Giovanna Cavali

04/03/2024, 3:36 PM

The notebook will be used as an interactive reporting layer. And the idea was to track performance & plots using kedro-viz.

Giovanna Cavali

04/03/2024, 3:36 PM

So we would need to store in catalog the printouts of the notebook.

datajoely

04/03/2024, 3:37 PM

I guess you could get the notebook to write to a path that is also defined in the catalog

datajoely

04/03/2024, 3:37 PM

but it’s all a bit fragile

Giovanna Cavali

04/03/2024, 3:41 PM

oh, I see...

Nok Lam Chan

04/03/2024, 3:42 PM

hmm.. in this case I think it's not too bad to run notebook as a reporting layer. Can you explains how this is related?

And the idea was to track performance & plots using kedro-viz.

Giovanna Cavali

04/03/2024, 3:44 PM

So even though the reporting layer is a notebook, we still would want to use the experiment tracking of kedro-viz to track performance...

Nok Lam Chan

04/03/2024, 3:45 PM

What does this notebook actually do?

datajoely

04/03/2024, 3:45 PM

you could do this in an

after_pipeline_run

hook… https://docs.jupyter.org/en/latest/running.html#using-a-command-line-interface

Giovanna Cavali

04/03/2024, 3:47 PM

The notebook calculate performance metrics, and creates plots -> the idea is to have a model report user friendly and easily converted to a html report

Giovanna Cavali

04/03/2024, 3:48 PM

the hook is a good idea! but if we save the metrics/plots to catalog folder, can we still track versions?

Nok Lam Chan

04/03/2024, 3:53 PM

I think it will be tricky to make it work, as you essentially have part of your pipeline written in a notebook.

Giovanna Cavali

04/03/2024, 3:54 PM

Yes..

Nok Lam Chan

04/03/2024, 3:54 PM

what might work it is keep your reporting pipeline in a pipeline, have a hook or something execute the notebook as a hook, and inside the notebook you simply load the metrics/plot)

K 1

Iñigo Hidalgo

04/03/2024, 3:54 PM

Why not output the things you want to visualize aka plotly plots, dataframes etc as nodes in kedro, and then just have a notebook which loads those interactively?

👍 1

👍🏼 1

Nok Lam Chan

04/03/2024, 3:55 PM

And if you absolutely need to run code from the notebook, you can use the regular

session.run

and put that at the top. You should only run it when needed tho.

Nok Lam Chan

04/03/2024, 3:56 PM

So it looks like

Copy code

# Cell 1
%load_ext kedro.ipython
session.run(pipeline="abc")
# cell 2
catalog.load("my_plot") # latest version by default
# cell 3
catalog.load("my_plot2") etc

The documentation of using Kedro in notebook may help: https://docs.kedro.org/en/stable/notebooks_and_ipython/kedro_and_notebooks.html

Giovanna Cavali

04/03/2024, 4:02 PM

You guys have a point here! Let me discuss internally the pros and cons... thank you all!!

👀 1

Artur Dobrogowski

04/03/2024, 5:26 PM

You might also consider using mlflow and putting the reporting objects as artifacts next to the run - you can browse html plots interactively there as well.

👀 1

Florian d

04/03/2024, 5:43 PM

I used the hook plus catalog option in the past for exactly the same use case. All output datasets in the catalog and then a hook that runs papermill plus uses the catalog to load the outputs. Notebook parameters are dataset versions so there is lineage.

🆒 2

K 1

Giovanna Cavali

04/03/2024, 5:55 PM

@Florian d Amazing! Just to see if I understood correctly, you used hooks to run the notebook and the notebooks saved plots and metrics to catalog? I didn't quite get the "use the catalog to load the outputs"...

👀 1

Florian d

04/03/2024, 5:58 PM

Let’s say you have a model train plus eval nodes that writes the model and train test splited datasets using the catalog. Maybe it even writes some metrics to datasets. The hook then runs papermill which in turn runs a notebook. That notebook than uses the catalog to load the previous artefacts and calculates metrics plots etc.

datajoely

04/03/2024, 5:59 PM

I wonder if we should do a blogpost on this pattern, it’s quite interesting

👍 2

Nok Lam Chan

04/03/2024, 5:59 PM

It's the same idea described above, you still run kedro pipeline as is, but with an extra hook to refresh the notebook to load artifacts from the run.

Florian d

04/03/2024, 6:00 PM

In Addition we used nbconvert to convert the executed notebook into html and hosted that with the run version.

👍 1

Florian d

04/03/2024, 6:00 PM

This gave us both an html for analysts to check out and stakeholders. Plus a notebook that DS could use to dive deeper to answer questions or debug unexpected stuff

🙌 1

Giovanna Cavali

04/03/2024, 6:01 PM

Makes sense, great approach! One of the things we also wanted, was to save some plots, printouts from the notebook as catalog entries.... so we can keep track of metrics/plots...

Artur Dobrogowski

04/03/2024, 6:02 PM

you can't do that through api, as catalog does not allow to save things via api. You need to either generate them elsewhere and load them in the notebook or wrap the notebook execution in kedro node and handle the inputs/outputs via kedro imho

👀 2

Nok Lam Chan

04/03/2024, 6:04 PM

Is it not possible to move that saving bit into a pipeline?

Nok Lam Chan

04/03/2024, 6:05 PM

You can save things with catalog but I don't want to diverge the conversation here.

Artur Dobrogowski

04/03/2024, 6:07 PM

I don't like this approach with notebooks and kedro pipeline mixed in general 😄

Artur Dobrogowski

04/03/2024, 6:08 PM

I think it's better to have the functions written as nodes and maybe another notebook on top of that that reads from the same source/your lib as a package

Artur Dobrogowski

04/03/2024, 6:08 PM

so the code is shared

Artur Dobrogowski

04/03/2024, 6:09 PM

if the only thing you want to achieve are interactive plots then there are better solutions

Nok Lam Chan

04/03/2024, 6:09 PM

The general approach above is separating it to 2 steps, kedro pipeline doing the writing while the notebook is a read-only notebook for reporting.

One of the things we also wanted, was to save some plots, printouts from the notebook as catalog entries.... so we can keep track of metrics/plots...

I wonder why you need to save plots from the notebook, is it because someone need to run the code from the notebook instead?

Giovanna Cavali

04/03/2024, 6:13 PM

We used to have the reporting layer as kedro pipeline, but we got some feedback that is was difficult to make changes and it took time to create a reports... so that's why we were going for the notebook approach. But maybe we could build the metrics & plots we want to track in separate nodes and leave the notebook run as a hook at the end of the pipeline...

Nok Lam Chan

04/03/2024, 6:17 PM

> but we got some feedback that is was difficult to make changes and it took time to create a reports Not sure if I am understanding correctly, it sounds like you want to have a "human in the loop" where the rest of the pipeline is run as kedro pipeline, and for the plotting bit because it requires some tweaking and formating so someone would come in to plot things in a notebook instead? Or may I ask if you are using vanila Kedro or some YAML base node/pipeline kedro project? I wonder how notebook make that "difficult to make changes" easier.

Florian d

04/03/2024, 6:19 PM

But maybe we could build the metrics & plots we want to track in separate nodes and leave the notebook run as a hook at the end of the pipeline...

My view is to do that in the notebook. Because otherwise you don’t gain anything if you load the pre-generated metrics plus plots in the notebook as you’d have to rerun the pipeline to get those. So I would do all the heavy lifting that needs to happen in the pipeline there and “most” of the reporting stuff in the notebook.

👍 1

Florian d

04/03/2024, 6:20 PM

Then you can change the visual representation of the data. That way the integrity of the underlying data is the same while the representation can be adjusted

Nok Lam Chan

04/03/2024, 6:20 PM

https://docs.kedro.org/en/stable/notebooks_and_ipython/kedro_and_notebooks.html#load-node-line-magic and a plug to the

%load_node

that may helps retrieving node logic in a notebook.

Artur Dobrogowski

04/03/2024, 6:21 PM

a side note: it's a nice macro but the downside of

%load_node

is that you need to have all of its inputs defined as loadable by data catalog, can't rely on memory datasets

Nok Lam Chan

04/03/2024, 6:22 PM

@Artur Dobrogowski https://github.com/kedro-org/kedro/discussions/3754 can I ask you to comment here? It's something that I want to build but priority is not high because no one has ask about it :) (same for anyone else who are using

%load_node

and have problem with it or you like it, just tag me) It's impossible to load something in memory because by the time you open a notebook it's not there already. What we can do is to figure that what exactly you need to re-run to have those dataset in memory again. You also don't want to keep everything in memory, so in reality you will have different checkpoints in your big pipeline, so the re-run part will be minimal)

Florian d

04/03/2024, 6:23 PM

I’d suggest to try it out at this point and see what works for you. E.g. logic in nodes or notebooks etc. But what would be fantastic is if you write back here to say what you used and how it was received 🙂 this would be great for others who showed interest 🙂

K 2

👍🏼 1

👍 1

Giovanna Cavali

04/03/2024, 6:31 PM

Amazing! Thank you for the support and I will get back to you guys!

Giovanna Cavali

04/10/2024, 4:52 PM

Hi team, following this discussion on running jupyter notebook in a kedro pipeline. We built a node that render the notebook (using

nbconvert

) with the path as input and no ouput. The path to the notebook is set in the parameter's yaml file. And we wanted to know if there is a way to use the path from catalog.yaml file. We wanted to avoid to have paths in parameter.yml

Giovanna Cavali

04/10/2024, 4:52 PM

Hi team, following this discussion on running jupyter notebook in a kedro pipeline. We built a node that render the notebook (using

nbconvert

Juan Luis

04/10/2024, 5:13 PM

hi @Giovanna Cavali! I guess you mean something like:

Copy code

def create_pipeline():
  return pipeline([
    node(
      func=render_notebook_with_nbconvert,
      inputs=["params:notebook_0_filepath"],
      outputs=None,
    )
  ])

am I right? one way you can do it, even if it's a bit unorthodox, is to define your own dataset:

Copy code

# catalog.yml
notebook0:
  type: ipynb_datasets.IPYNBDataset
  filepath: notebooks/notebook0.ipynb

and then you'd need to define a custom dataset

Copy code

@dataclass
class IPYNBNotebook:
  filepath: str


class IPYNBDataset(AbstractDataset):
  def __init__(self, filepath: str):
    self._filepath = filepath

  def _load(self):
    return IPYNBNotebook(self._filepath)

  ...

and then your node would do

Copy code

def render_notebook_with_nbconvert(notebook: IPYNBNotebook):
  return nbconvert.render(notebook.filepath)

all of this is pseudocode but I hope it makes sense. see https://docs.kedro.org/en/stable/extend_kedro/custom_datasets.html for more info on custom datasets

Giovanna Cavali

04/10/2024, 5:27 PM

Thank you! it does make sense. Let me try to implement it here 😊

Giovanna Cavali

04/16/2024, 1:37 PM

@Juan LuisIt did work! Thank you!!! I was just wondering if we could use Jupyter Notebook Custom DataSet to render the notebook and save versions as well.. Because right now we are just using it to provide the path...

Juan Luis

04/16/2024, 1:40 PM

amazing, good to know @Giovanna Cavali! 🙌🏼

Juan Luis

04/16/2024, 1:41 PM

in principle, Kedro encourages you to use datasets for everything that is I/O: loading, saving. the prototype I gave you only had a filepath but it should be possible to extend it with export capabilities

Giovanna Cavali

04/16/2024, 1:56 PM

and just so I can understand, at what point in time is the catalog entry loaded and saved in a pipeline?

Giovanna Cavali

04/16/2024, 1:56 PM

whenever they are the ouput of a node?

Juan Luis

04/16/2024, 1:57 PM

exactly 🎯 it's `_load`ed when it's an input, and `_save`d when it's an output

Giovanna Cavali

04/16/2024, 5:01 PM

great!! one last thing. the node that renders the notebook does not need any data input so as a default it runs at first, but we need it to run at last... Is there a way to hard code this?

56 Views

Open in Slack

Previous Next