Hi all quick newbie Kedro question here. If wanted...
# questions
d
Hi all quick newbie Kedro question here. If wanted to call catalog.load directly within the pipeline to observe the dataframes, how do I get the current catalog in the pipeline run. I see that kedro has kedro.io import DataCatalog but not sure how to get the specific catalog context
d
so we specifically don’t encourage this
when building pipelines we want deterministic, stateless functions which enable reproducibility
for that matter nodes can’t do IO, it’s done for them - they just receive and produce data
Now you can inspect the live catalog objects during a run lifecycle using hooks
but you can’t do it within a node
d
ok that makes sense.
d
Thank you. Ideally, I would like to observe the catalog even before the hooks/nodes are called. For that case, I have created a test.py file. Is there a way to observe the catalog file even before using the hooks?
d
in a live context or as part of a pipeline run?
you can use
kedro ipython
or
kedro jupyter notebook
to get the live catalog outside of a run too?
d
As a pipeline run. But before the pipeline are instantiated
The point here is that there are some of the catalog dataframe that are empty but might not be in future runs
d
I mean the
after_catalog_created
or the
after_context_created
hooks are the earliest point you can do that
and you can mutate the catalog live at that point if you’d like?
d
ok that makes sense.
I tried mutating after_catalog_created but it seemed like the groups that did not have an empty dataframe were not removed. I tried printing the line to see if it was being called but no output came about. Hence why I thought doing the mutating in a different file might work. Should I be thinking about this separately?
d
after_catalog_created
gives you a live catalog which you can mutate with
catalog.add(xxxx, replace=True)
you can also use the
_exists()
private method for your puproses https://docs.kedro.org/en/stable/kedro.framework.hooks.specs.DataCatalogSpecs.html
d
ok. Thanks. So high level, If I wanted to remove certain groups from running based on the available data, that will be done in this function? Will there be a way to print if this actually happened? Before the pipeline run? I just want to make sure that I can validate the right groups
d
yeah you can do all sorts of things with hooks - but we’re getting into territory where Kedro may not be the right tool since it’s built intentionally to force reproducibility
d
Thanks for the help. Is there a way to load a few rows from the data catalog within kedro with catalog.load, instead of loading the entire dataframe?
d
if you’re using Pandas - not really, it will eagerly load it all. If you’re using Spark then yes
load().limit(5)
.