Hello, thanks for a great framework! After having...
# questions
Hello, thanks for a great framework! After having set up my pipelines I am trying to develop on jupyter new features (and create new pipelines from there). In that case, from what I understand it is more convenient to run pipelines using manually
instead of using the
of jupyter. For example I would like to run the same pipeline in a loop with different partitions of a
and I find it weird to call
%reload_ext kedro.ipython
in a loop. Is this discouraged practice? What is the benefit of having a session in jupyter if you develop interactively? (related but not answering my question: https://kedro-org.slack.com/archives/C03RKP2LW64/p1668423931294329) Thanks a lot!
the notebook workflow is more about giving you an interactive way of interfacing with the catalog and running the code, it’s not really a development environment. The recommended workflow is for you to use a proper IDE and run
kedro run
once you’ve moved your jupyter prototypes into your codebase
in the future it’s highly likely we enable more creation on the jupyter side
The question is more about the prototyping part: inevitably some development is happening in a notebook if you want to plot results on the fly and quickly improve things iteratively. For example replace a node of a pipeline with a new version and see (interactively) if the results have potential before adding it as a new clean pipeline in the project. So the goal is to have existing clean code (pipelines of the project) and new code in an interactive env like jupyter. But I understand that currently this workflow is not the recommended one.
@Vassilis Kalofolias “For example replace a node of a pipeline with a new version and see (interactively)“, To achieve this, you can run your pipeline up to the interested node. Then you develop your new node, simply as a Jupyter cell. If it depends on outputs from previous nodes, you should be able to get it from the
or it’s saved on disk (use catalog.load(“xxx”)` to retrieve it. After the code is developed, you can wrap the cell as a function and put it back as a node into the pipeline.
if you do
a dataset with type PartitionedDataSet, it should return an iterable and you can simply run a for loop on it, the same as what you would do in a node.
Thanks! So then, if I want to run a second pipeline after that node I would have to run that part in a manual
outside the
, right? I mean this scenario:
pipeline_1 -> new_node -> pipeline_2
If the output of the new node is persisted in disk, you can just reload the session and run pipeline2 independently. If it’s in memory only it may be a bit clunky at the moment and indeed you may need to user the runner object outside of the session.
🙏🏼 1