Hello, thanks for a great framework! After having...
# questions
v
Hello, thanks for a great framework! After having set up my pipelines I am trying to develop on jupyter new features (and create new pipelines from there). In that case, from what I understand it is more convenient to run pipelines using manually
SequentialRunner
instead of using the
session
of jupyter. For example I would like to run the same pipeline in a loop with different partitions of a
PartitionedDataSet
and I find it weird to call
%reload_ext kedro.ipython
in a loop. Is this discouraged practice? What is the benefit of having a session in jupyter if you develop interactively? (related but not answering my question: https://kedro-org.slack.com/archives/C03RKP2LW64/p1668423931294329) Thanks a lot!
d
the notebook workflow is more about giving you an interactive way of interfacing with the catalog and running the code, it’s not really a development environment. The recommended workflow is for you to use a proper IDE and run
kedro run
once you’ve moved your jupyter prototypes into your codebase
in the future it’s highly likely we enable more creation on the jupyter side
v
The question is more about the prototyping part: inevitably some development is happening in a notebook if you want to plot results on the fly and quickly improve things iteratively. For example replace a node of a pipeline with a new version and see (interactively) if the results have potential before adding it as a new clean pipeline in the project. So the goal is to have existing clean code (pipelines of the project) and new code in an interactive env like jupyter. But I understand that currently this workflow is not the recommended one.
n
@Vassilis Kalofolias “For example replace a node of a pipeline with a new version and see (interactively)“, To achieve this, you can run your pipeline up to the interested node. Then you develop your new node, simply as a Jupyter cell. If it depends on outputs from previous nodes, you should be able to get it from the
session.run()
or it’s saved on disk (use catalog.load(“xxx”)` to retrieve it. After the code is developed, you can wrap the cell as a function and put it back as a node into the pipeline.
if you do
catalog.load()
a dataset with type PartitionedDataSet, it should return an iterable and you can simply run a for loop on it, the same as what you would do in a node.
v
Thanks! So then, if I want to run a second pipeline after that node I would have to run that part in a manual
SequentialRunner.run()
outside the
session
, right? I mean this scenario:
pipeline_1 -> new_node -> pipeline_2
n
If the output of the new node is persisted in disk, you can just reload the session and run pipeline2 independently. If it’s in memory only it may be a bit clunky at the moment and indeed you may need to user the runner object outside of the session.
🙏🏼 1