Hi, I've refactored a project and I'm unsure if ev...
# questions
k
Hi, I've refactored a project and I'm unsure if every one of my catalog entries is used in any my pipelines. Is there a way to output a list of unused datasets?
πŸ‘πŸΌ 1
d
Something like...
Copy code
from functools import reduce
from operator import or_

from <project_name>.pipeline_registry import register_pipelines

used_datasets = reduce(or_, (x.datasets() for x in register_pipelines().values()))

# Create a session
# <https://docs.kedro.org/en/stable/kedro_project_setup/session.html#create-a-session>
from kedro.framework.session import KedroSession
from kedro.framework.startup import bootstrap_project
from pathlib import Path

bootstrap_project(Path.cwd())
with KedroSession.create() as session:
    context = session.load_context()
    catalog = context.catalog
    unused_datasets = set(catalog.list()) - used_datasets
πŸ‘ 1
Should probably work?
(I haven't tested/tried it)
j
this should be easier - @Kelsey Sorrels would you like to open a feature request? https://github.com/kedro-org/kedro/issues
i
I feel like I remember an old plugin by either Waylon Walker or DataEngineerOne that went along these lines, but I can't find it here https://docs.kedro.org/en/0.18.13/extend_kedro/plugins.html#community-developed-plugins Might be misremembering some combination of kedro-wings and steel-toes though
πŸ‘πŸΌ 1
y
You should be able to use
kedro catalog list
to tell you which datasets aren't used by your pipelines. Links to docs: https://docs.kedro.org/en/stable/development/commands_reference.html#list-datasets-per-pipeline-per-type
πŸ‘ 2
πŸ‘πŸΌ 2
k
Thank you
d
this should be easier
@Juan Luis I agree, but I also think it's not that complicated as is. In the code I wrote, the "complex" part (or at least most of the code) is to get the
catalog
object; I considered writing this for run in
kedro ipython
, which would mean the
catalog
object is already available. The rest is fairly straightforward; the code for
used_datasets
might have been nicer if I didn't use
reduce
and
or_
. 🀷
j
still,
kedro catalog list
is one command πŸ˜„
d
Sorry, just to be clear--you're saying, it would be nice if
kedro catalog list
also had a section for unused datasets? (Also, I just realized, I've never seen this
kedro catalog list
, lot more information in there than I remember.)
j
I believe
kedro catalog list
already addresses @Kelsey Sorrels original question? (sorry, it's Friday afternoon and I'm fried)
🍳 1
d
Is there a way to output a list of unused datasets?
Not directly I think. πŸ™‚
πŸ‘πŸΌ 1
k
kedro catalog list
worked. Even though I had to hunt a bit for it in the output, it satisfied my goal
πŸ‘πŸΌ 2
πŸš€ 2
n
Copy code
used_datasets = reduce(or_, (x.datasets() for x in register_pipelines()))
should be
Copy code
used_datasets = reduce(or_, (x.datasets() for x in register_pipelines().values())
πŸ™ 1
d
Thanks! Updated in case anybody comes back to it.
n
And it works - though I think it would be nicer to have something more direct. This also mix "parameters" with "datasets", which may not match people expectations
πŸ™Œ 1
You will always have
parameters
there because no one use it, and if you use
params:model_options.feature
,
params:model_options
will be "unused" with this definition
πŸ‘€ 1
so there is subtle difference of "direct usage" and "implicit usage", it's harder to tell if someone use **kwargs
d
Yeah. I was also curious, how this will work with dataset factories; my intuitive guess is that it won't work directly.
n
Do you mean by patterns that aren't used?
One thing that I notice:
```from <project_name>.pipeline_registry import register_pipelines
used_datasets = reduce(or_, (x.datasets() for x in register_pipelines().values()))```
could further simplified as
from kedro.framework.project import pipelines
used_datasets = reduce(or_, x.datasets() for x in pipelines.values()))
But it has to be after
bootstrap_project
πŸ™Œ 1