Hello thanks for a great framework After having set up my pi Kedro #questions

Hello, thanks for a great framework! After having...

Vassilis Kalofolias

02/07/2023, 3:15 PM

Hello, thanks for a great framework! After having set up my pipelines I am trying to develop on jupyter new features (and create new pipelines from there). In that case, from what I understand it is more convenient to run pipelines using manually

SequentialRunner

instead of using the

session

of jupyter. For example I would like to run the same pipeline in a loop with different partitions of a

PartitionedDataSet

and I find it weird to call

%reload_ext kedro.ipython

in a loop. Is this discouraged practice? What is the benefit of having a session in jupyter if you develop interactively? (related but not answering my question: https://kedro-org.slack.com/archives/C03RKP2LW64/p1668423931294329) Thanks a lot!

datajoely

02/07/2023, 3:17 PM

the notebook workflow is more about giving you an interactive way of interfacing with the catalog and running the code, it’s not really a development environment. The recommended workflow is for you to use a proper IDE and run

kedro run

once you’ve moved your jupyter prototypes into your codebase

datajoely

02/07/2023, 3:17 PM

in the future it’s highly likely we enable more creation on the jupyter side

Vassilis Kalofolias

02/07/2023, 3:28 PM

The question is more about the prototyping part: inevitably some development is happening in a notebook if you want to plot results on the fly and quickly improve things iteratively. For example replace a node of a pipeline with a new version and see (interactively) if the results have potential before adding it as a new clean pipeline in the project. So the goal is to have existing clean code (pipelines of the project) and new code in an interactive env like jupyter. But I understand that currently this workflow is not the recommended one.

Nok Lam Chan

02/07/2023, 3:40 PM

@Vassilis Kalofolias “For example replace a node of a pipeline with a new version and see (interactively)“, To achieve this, you can run your pipeline up to the interested node. Then you develop your new node, simply as a Jupyter cell. If it depends on outputs from previous nodes, you should be able to get it from the

session.run()

or it’s saved on disk (use catalog.load(“xxx”)` to retrieve it. After the code is developed, you can wrap the cell as a function and put it back as a node into the pipeline.

Nok Lam Chan

02/07/2023, 3:42 PM

if you do

catalog.load()

a dataset with type PartitionedDataSet, it should return an iterable and you can simply run a for loop on it, the same as what you would do in a node.

Vassilis Kalofolias

02/07/2023, 4:00 PM

Thanks! So then, if I want to run a second pipeline after that node I would have to run that part in a manual

SequentialRunner.run()

outside the

session

, right? I mean this scenario:

pipeline_1 -> new_node -> pipeline_2

Nok Lam Chan

02/09/2023, 5:41 AM

If the output of the new node is persisted in disk, you can just reload the session and run pipeline2 independently. If it’s in memory only it may be a bit clunky at the moment and indeed you may need to user the runner object outside of the session.

🙏🏼 1

6 Views

Open in Slack

Previous Next