https://kedro.org/ logo
#questions
Title
# questions
d

Dotun O

03/21/2023, 1:28 PM
Hi all quick newbie Kedro question here. If wanted to call catalog.load directly within the pipeline to observe the dataframes, how do I get the current catalog in the pipeline run. I see that kedro has kedro.io import DataCatalog but not sure how to get the specific catalog context
d

datajoely

03/21/2023, 1:28 PM
so we specifically don’t encourage this
when building pipelines we want deterministic, stateless functions which enable reproducibility
for that matter nodes can’t do IO, it’s done for them - they just receive and produce data
Now you can inspect the live catalog objects during a run lifecycle using hooks
but you can’t do it within a node
d

Dotun O

03/21/2023, 1:29 PM
ok that makes sense.
d

Dotun O

03/21/2023, 1:31 PM
Thank you. Ideally, I would like to observe the catalog even before the hooks/nodes are called. For that case, I have created a test.py file. Is there a way to observe the catalog file even before using the hooks?
d

datajoely

03/21/2023, 1:32 PM
in a live context or as part of a pipeline run?
you can use
kedro ipython
or
kedro jupyter notebook
to get the live catalog outside of a run too?
d

Dotun O

03/21/2023, 1:32 PM
As a pipeline run. But before the pipeline are instantiated
The point here is that there are some of the catalog dataframe that are empty but might not be in future runs
d

datajoely

03/21/2023, 1:33 PM
I mean the
after_catalog_created
or the
after_context_created
hooks are the earliest point you can do that
and you can mutate the catalog live at that point if you’d like?
d

Dotun O

03/21/2023, 1:33 PM
ok that makes sense.
I tried mutating after_catalog_created but it seemed like the groups that did not have an empty dataframe were not removed. I tried printing the line to see if it was being called but no output came about. Hence why I thought doing the mutating in a different file might work. Should I be thinking about this separately?
d

datajoely

03/21/2023, 1:39 PM
after_catalog_created
gives you a live catalog which you can mutate with
catalog.add(xxxx, replace=True)
you can also use the
_exists()
private method for your puproses https://docs.kedro.org/en/stable/kedro.framework.hooks.specs.DataCatalogSpecs.html
d

Dotun O

03/21/2023, 1:43 PM
ok. Thanks. So high level, If I wanted to remove certain groups from running based on the available data, that will be done in this function? Will there be a way to print if this actually happened? Before the pipeline run? I just want to make sure that I can validate the right groups
d

datajoely

03/21/2023, 1:46 PM
yeah you can do all sorts of things with hooks - but we’re getting into territory where Kedro may not be the right tool since it’s built intentionally to force reproducibility
d

Dotun O

03/21/2023, 2:27 PM
Thanks for the help. Is there a way to load a few rows from the data catalog within kedro with catalog.load, instead of loading the entire dataframe?
d

datajoely

03/21/2023, 2:28 PM
if you’re using Pandas - not really, it will eagerly load it all. If you’re using Spark then yes
load().limit(5)
.
2 Views