Hi guys! Can we run a jupyter notebooks in a kedro...
# questions
g
Hi guys! Can we run a jupyter notebooks in a kedro node part of a pipeline? And if so, how can we manage outputs from a jupyter notebook?
d
Why do you want to run a notebook in a pipeline?
😅 2
j
I think @marrrcin has an idea about this 😄
😬 2
d
so the short answer is you probably want something like papermill, the longer answer is that we don’t believe this great practice
g
The notebook will be used as an interactive reporting layer. And the idea was to track performance & plots using kedro-viz.
So we would need to store in catalog the printouts of the notebook.
d
I guess you could get the notebook to write to a path that is also defined in the catalog
but it’s all a bit fragile
g
oh, I see...
n
hmm.. in this case I think it's not too bad to run notebook as a reporting layer. Can you explains how this is related?
And the idea was to track performance & plots using kedro-viz.
g
So even though the reporting layer is a notebook, we still would want to use the experiment tracking of kedro-viz to track performance...
n
What does this notebook actually do?
d
g
The notebook calculate performance metrics, and creates plots -> the idea is to have a model report user friendly and easily converted to a html report
the hook is a good idea! but if we save the metrics/plots to catalog folder, can we still track versions?
n
I think it will be tricky to make it work, as you essentially have part of your pipeline written in a notebook.
g
Yes..
n
what might work it is keep your reporting pipeline in a pipeline, have a hook or something execute the notebook as a hook, and inside the notebook you simply load the metrics/plot)
K 1
i
Why not output the things you want to visualize aka plotly plots, dataframes etc as nodes in kedro, and then just have a notebook which loads those interactively?
👍 1
👍🏼 1
n
And if you absolutely need to run code from the notebook, you can use the regular
session.run
and put that at the top. You should only run it when needed tho.
So it looks like
Copy code
# Cell 1
%load_ext kedro.ipython
session.run(pipeline="abc")
# cell 2
catalog.load("my_plot") # latest version by default
# cell 3
catalog.load("my_plot2") etc
The documentation of using Kedro in notebook may help: https://docs.kedro.org/en/stable/notebooks_and_ipython/kedro_and_notebooks.html
g
You guys have a point here! Let me discuss internally the pros and cons... thank you all!!
👀 1
a
You might also consider using mlflow and putting the reporting objects as artifacts next to the run - you can browse html plots interactively there as well.
👀 1
f
I used the hook plus catalog option in the past for exactly the same use case. All output datasets in the catalog and then a hook that runs papermill plus uses the catalog to load the outputs. Notebook parameters are dataset versions so there is lineage.
🆒 2
K 1
g
@Florian d Amazing! Just to see if I understood correctly, you used hooks to run the notebook and the notebooks saved plots and metrics to catalog? I didn't quite get the "use the catalog to load the outputs"...
👀 1
f
Let’s say you have a model train plus eval nodes that writes the model and train test splited datasets using the catalog. Maybe it even writes some metrics to datasets. The hook then runs papermill which in turn runs a notebook. That notebook than uses the catalog to load the previous artefacts and calculates metrics plots etc.
d
I wonder if we should do a blogpost on this pattern, it’s quite interesting
👍 2
n
It's the same idea described above, you still run kedro pipeline as is, but with an extra hook to refresh the notebook to load artifacts from the run.
f
In Addition we used nbconvert to convert the executed notebook into html and hosted that with the run version.
👍 1
This gave us both an html for analysts to check out and stakeholders. Plus a notebook that DS could use to dive deeper to answer questions or debug unexpected stuff
🙌 1
g
Makes sense, great approach! One of the things we also wanted, was to save some plots, printouts from the notebook as catalog entries.... so we can keep track of metrics/plots...
a
you can't do that through api, as catalog does not allow to save things via api. You need to either generate them elsewhere and load them in the notebook or wrap the notebook execution in kedro node and handle the inputs/outputs via kedro imho
👀 2
n
Is it not possible to move that saving bit into a pipeline?
You can save things with catalog but I don't want to diverge the conversation here.
a
I don't like this approach with notebooks and kedro pipeline mixed in general 😄
I think it's better to have the functions written as nodes and maybe another notebook on top of that that reads from the same source/your lib as a package
so the code is shared
if the only thing you want to achieve are interactive plots then there are better solutions
n
The general approach above is separating it to 2 steps, kedro pipeline doing the writing while the notebook is a read-only notebook for reporting.
One of the things we also wanted, was to save some plots, printouts from the notebook as catalog entries.... so we can keep track of metrics/plots...
I wonder why you need to save plots from the notebook, is it because someone need to run the code from the notebook instead?
g
We used to have the reporting layer as kedro pipeline, but we got some feedback that is was difficult to make changes and it took time to create a reports... so that's why we were going for the notebook approach. But maybe we could build the metrics & plots we want to track in separate nodes and leave the notebook run as a hook at the end of the pipeline...
n
> but we got some feedback that is was difficult to make changes and it took time to create a reports Not sure if I am understanding correctly, it sounds like you want to have a "human in the loop" where the rest of the pipeline is run as kedro pipeline, and for the plotting bit because it requires some tweaking and formating so someone would come in to plot things in a notebook instead? Or may I ask if you are using vanila Kedro or some YAML base node/pipeline kedro project? I wonder how notebook make that "difficult to make changes" easier.
f
But maybe we could build the metrics & plots we want to track in separate nodes and leave the notebook run as a hook at the end of the pipeline...
My view is to do that in the notebook. Because otherwise you don’t gain anything if you load the pre-generated metrics plus plots in the notebook as you’d have to rerun the pipeline to get those. So I would do all the heavy lifting that needs to happen in the pipeline there and “most” of the reporting stuff in the notebook.
👍 1
Then you can change the visual representation of the data. That way the integrity of the underlying data is the same while the representation can be adjusted
n
https://docs.kedro.org/en/stable/notebooks_and_ipython/kedro_and_notebooks.html#load-node-line-magic and a plug to the
%load_node
that may helps retrieving node logic in a notebook.
a
a side note: it's a nice macro but the downside of
%load_node
is that you need to have all of its inputs defined as loadable by data catalog, can't rely on memory datasets
n
@Artur Dobrogowski https://github.com/kedro-org/kedro/discussions/3754 can I ask you to comment here? It's something that I want to build but priority is not high because no one has ask about it :) (same for anyone else who are using
%load_node
and have problem with it or you like it, just tag me) It's impossible to load something in memory because by the time you open a notebook it's not there already. What we can do is to figure that what exactly you need to re-run to have those dataset in memory again. You also don't want to keep everything in memory, so in reality you will have different checkpoints in your big pipeline, so the re-run part will be minimal)
f
I’d suggest to try it out at this point and see what works for you. E.g. logic in nodes or notebooks etc. But what would be fantastic is if you write back here to say what you used and how it was received 🙂 this would be great for others who showed interest 🙂
K 2
👍🏼 1
👍 1
g
Amazing! Thank you for the support and I will get back to you guys!
Hi team, following this discussion on running jupyter notebook in a kedro pipeline. We built a node that render the notebook (using
nbconvert
) with the path as input and no ouput. The path to the notebook is set in the parameter's yaml file. And we wanted to know if there is a way to use the path from catalog.yaml file. We wanted to avoid to have paths in parameter.yml
Hi team, following this discussion on running jupyter notebook in a kedro pipeline. We built a node that render the notebook (using
nbconvert
) with the path as input and no ouput. The path to the notebook is set in the parameter's yaml file. And we wanted to know if there is a way to use the path from catalog.yaml file. We wanted to avoid to have paths in parameter.yml
j
hi @Giovanna Cavali! I guess you mean something like:
Copy code
def create_pipeline():
  return pipeline([
    node(
      func=render_notebook_with_nbconvert,
      inputs=["params:notebook_0_filepath"],
      outputs=None,
    )
  ])
am I right? one way you can do it, even if it's a bit unorthodox, is to define your own dataset:
Copy code
# catalog.yml
notebook0:
  type: ipynb_datasets.IPYNBDataset
  filepath: notebooks/notebook0.ipynb
and then you'd need to define a custom dataset
Copy code
@dataclass
class IPYNBNotebook:
  filepath: str


class IPYNBDataset(AbstractDataset):
  def __init__(self, filepath: str):
    self._filepath = filepath

  def _load(self):
    return IPYNBNotebook(self._filepath)

  ...
and then your node would do
Copy code
def render_notebook_with_nbconvert(notebook: IPYNBNotebook):
  return nbconvert.render(notebook.filepath)
all of this is pseudocode but I hope it makes sense. see https://docs.kedro.org/en/stable/extend_kedro/custom_datasets.html for more info on custom datasets
g
Thank you! it does make sense. Let me try to implement it here 😊
@Juan LuisIt did work! Thank you!!! I was just wondering if we could use Jupyter Notebook Custom DataSet to render the notebook and save versions as well.. Because right now we are just using it to provide the path...
j
amazing, good to know @Giovanna Cavali! 🙌🏼
in principle, Kedro encourages you to use datasets for everything that is I/O: loading, saving. the prototype I gave you only had a filepath but it should be possible to extend it with export capabilities
g
and just so I can understand, at what point in time is the catalog entry loaded and saved in a pipeline?
whenever they are the ouput of a node?
j
exactly 🎯 it's `_load`ed when it's an input, and `_save`d when it's an output
g
great!! one last thing. the node that renders the notebook does not need any data input so as a default it runs at first, but we need it to run at last... Is there a way to hard code this?