Hi all quick newbie Kedro question here. If wanted...
# questions
Hi all quick newbie Kedro question here. If wanted to call catalog.load directly within the pipeline to observe the dataframes, how do I get the current catalog in the pipeline run. I see that kedro has kedro.io import DataCatalog but not sure how to get the specific catalog context
so we specifically don’t encourage this
when building pipelines we want deterministic, stateless functions which enable reproducibility
for that matter nodes can’t do IO, it’s done for them - they just receive and produce data
Now you can inspect the live catalog objects during a run lifecycle using hooks
but you can’t do it within a node
ok that makes sense.
Thank you. Ideally, I would like to observe the catalog even before the hooks/nodes are called. For that case, I have created a test.py file. Is there a way to observe the catalog file even before using the hooks?
in a live context or as part of a pipeline run?
you can use
kedro ipython
kedro jupyter notebook
to get the live catalog outside of a run too?
As a pipeline run. But before the pipeline are instantiated
The point here is that there are some of the catalog dataframe that are empty but might not be in future runs
I mean the
or the
hooks are the earliest point you can do that
and you can mutate the catalog live at that point if you’d like?
ok that makes sense.
I tried mutating after_catalog_created but it seemed like the groups that did not have an empty dataframe were not removed. I tried printing the line to see if it was being called but no output came about. Hence why I thought doing the mutating in a different file might work. Should I be thinking about this separately?
gives you a live catalog which you can mutate with
catalog.add(xxxx, replace=True)
you can also use the
private method for your puproses https://docs.kedro.org/en/stable/kedro.framework.hooks.specs.DataCatalogSpecs.html
ok. Thanks. So high level, If I wanted to remove certain groups from running based on the available data, that will be done in this function? Will there be a way to print if this actually happened? Before the pipeline run? I just want to make sure that I can validate the right groups
yeah you can do all sorts of things with hooks - but we’re getting into territory where Kedro may not be the right tool since it’s built intentionally to force reproducibility
Thanks for the help. Is there a way to load a few rows from the data catalog within kedro with catalog.load, instead of loading the entire dataframe?
if you’re using Pandas - not really, it will eagerly load it all. If you’re using Spark then yes