Hi all quick newbie Kedro question here If wanted to call ca Kedro #questions

Hi all quick newbie Kedro question here. If wanted...

Dotun O

03/21/2023, 1:28 PM

Hi all quick newbie Kedro question here. If wanted to call catalog.load directly within the pipeline to observe the dataframes, how do I get the current catalog in the pipeline run. I see that kedro has kedro.io import DataCatalog but not sure how to get the specific catalog context

datajoely

03/21/2023, 1:28 PM

so we specifically don’t encourage this

datajoely

03/21/2023, 1:28 PM

when building pipelines we want deterministic, stateless functions which enable reproducibility

datajoely

03/21/2023, 1:29 PM

for that matter nodes can’t do IO, it’s done for them - they just receive and produce data

datajoely

03/21/2023, 1:29 PM

Now you can inspect the live catalog objects during a run lifecycle using hooks

datajoely

03/21/2023, 1:29 PM

but you can’t do it within a node

Dotun O

03/21/2023, 1:29 PM

ok that makes sense.

datajoely

03/21/2023, 1:29 PM

https://kedro.readthedocs.io/en/stable/hooks/introduction.html

Dotun O

03/21/2023, 1:31 PM

Thank you. Ideally, I would like to observe the catalog even before the hooks/nodes are called. For that case, I have created a test.py file. Is there a way to observe the catalog file even before using the hooks?

datajoely

03/21/2023, 1:32 PM

in a live context or as part of a pipeline run?

datajoely

03/21/2023, 1:32 PM

you can use

kedro ipython

kedro jupyter notebook

to get the live catalog outside of a run too?

Dotun O

03/21/2023, 1:32 PM

As a pipeline run. But before the pipeline are instantiated

Dotun O

03/21/2023, 1:33 PM

The point here is that there are some of the catalog dataframe that are empty but might not be in future runs

datajoely

03/21/2023, 1:33 PM

I mean the

after_catalog_created

or the

after_context_created

hooks are the earliest point you can do that

datajoely

03/21/2023, 1:33 PM

and you can mutate the catalog live at that point if you’d like?

Dotun O

03/21/2023, 1:33 PM

ok that makes sense.

Dotun O

03/21/2023, 1:35 PM

I tried mutating after_catalog_created but it seemed like the groups that did not have an empty dataframe were not removed. I tried printing the line to see if it was being called but no output came about. Hence why I thought doing the mutating in a different file might work. Should I be thinking about this separately?

datajoely

03/21/2023, 1:39 PM

after_catalog_created

gives you a live catalog which you can mutate with

catalog.add(xxxx, replace=True)

you can also use the

_exists()

private method for your puproses https://docs.kedro.org/en/stable/kedro.framework.hooks.specs.DataCatalogSpecs.html

Dotun O

03/21/2023, 1:43 PM

ok. Thanks. So high level, If I wanted to remove certain groups from running based on the available data, that will be done in this function? Will there be a way to print if this actually happened? Before the pipeline run? I just want to make sure that I can validate the right groups

datajoely

03/21/2023, 1:46 PM

yeah you can do all sorts of things with hooks - but we’re getting into territory where Kedro may not be the right tool since it’s built intentionally to force reproducibility

Dotun O

03/21/2023, 2:27 PM

Thanks for the help. Is there a way to load a few rows from the data catalog within kedro with catalog.load, instead of loading the entire dataframe?

datajoely

03/21/2023, 2:28 PM

if you’re using Pandas - not really, it will eagerly load it all. If you’re using Spark then yes

load().limit(5)

6 Views

Open in Slack

Previous Next