Kelsey Sorrels
02/01/2024, 11:47 PMDeepyaman Datta
02/02/2024, 5:13 AMfrom functools import reduce
from operator import or_
from <project_name>.pipeline_registry import register_pipelines
used_datasets = reduce(or_, (x.datasets() for x in register_pipelines().values()))
# Create a session
# <https://docs.kedro.org/en/stable/kedro_project_setup/session.html#create-a-session>
from kedro.framework.session import KedroSession
from kedro.framework.startup import bootstrap_project
from pathlib import Path
bootstrap_project(Path.cwd())
with KedroSession.create() as session:
context = session.load_context()
catalog = context.catalog
unused_datasets = set(catalog.list()) - used_datasets
Deepyaman Datta
02/02/2024, 5:13 AMDeepyaman Datta
02/02/2024, 5:26 AMJuan Luis
02/02/2024, 6:54 AMIΓ±igo Hidalgo
02/02/2024, 9:29 AMYetunde
02/02/2024, 10:01 AMkedro catalog list
to tell you which datasets aren't used by your pipelines.
Links to docs: https://docs.kedro.org/en/stable/development/commands_reference.html#list-datasets-per-pipeline-per-typeKelsey Sorrels
02/02/2024, 4:09 PMDeepyaman Datta
02/02/2024, 4:17 PMthis should be easier@Juan Luis I agree, but I also think it's not that complicated as is. In the code I wrote, the "complex" part (or at least most of the code) is to get the
catalog
object; I considered writing this for run in kedro ipython
, which would mean the catalog
object is already available.
The rest is fairly straightforward; the code for used_datasets
might have been nicer if I didn't use reduce
and or_
. π€·Juan Luis
02/02/2024, 4:18 PMkedro catalog list
is one command πDeepyaman Datta
02/02/2024, 4:28 PMkedro catalog list
also had a section for unused datasets?
(Also, I just realized, I've never seen this kedro catalog list
, lot more information in there than I remember.)Juan Luis
02/02/2024, 4:29 PMkedro catalog list
already addresses @Kelsey Sorrels original question? (sorry, it's Friday afternoon and I'm fried)Deepyaman Datta
02/02/2024, 4:29 PMIs there a way to output a list of unused datasets?Not directly I think. π
Kelsey Sorrels
02/02/2024, 4:30 PMkedro catalog list
worked. Even though I had to hunt a bit for it in the output, it satisfied my goalNok Lam Chan
02/02/2024, 5:02 PMshould beCopy codeused_datasets = reduce(or_, (x.datasets() for x in register_pipelines()))
Copy codeused_datasets = reduce(or_, (x.datasets() for x in register_pipelines().values())
Deepyaman Datta
02/02/2024, 5:04 PMNok Lam Chan
02/02/2024, 5:04 PMNok Lam Chan
02/02/2024, 5:05 PMparameters
there because no one use it, and if you use params:model_options.feature
, params:model_options
will be "unused" with this definitionNok Lam Chan
02/02/2024, 5:07 PMDeepyaman Datta
02/02/2024, 5:12 PMNok Lam Chan
02/02/2024, 5:19 PMNok Lam Chan
02/02/2024, 10:41 PM```from <project_name>.pipeline_registry import register_pipelines
used_datasets = reduce(or_, (x.datasets() for x in register_pipelines().values()))```
could further simplified as
from kedro.framework.project import pipelines
used_datasets = reduce(or_, x.datasets() for x in pipelines.values()))But it has to be after
bootstrap_project