Hi I ve refactored a project and I m unsure if every one of Kedro #questions

Hi, I've refactored a project and I'm unsure if ev...

Kelsey Sorrels

02/01/2024, 11:47 PM

Hi, I've refactored a project and I'm unsure if every one of my catalog entries is used in any my pipelines. Is there a way to output a list of unused datasets?

👍🏼 1

Deepyaman Datta

02/02/2024, 5:13 AM

Something like...

Copy code

from functools import reduce
from operator import or_

from <project_name>.pipeline_registry import register_pipelines

used_datasets = reduce(or_, (x.datasets() for x in register_pipelines().values()))

# Create a session
# <https://docs.kedro.org/en/stable/kedro_project_setup/session.html#create-a-session>
from kedro.framework.session import KedroSession
from kedro.framework.startup import bootstrap_project
from pathlib import Path

bootstrap_project(Path.cwd())
with KedroSession.create() as session:
    context = session.load_context()
    catalog = context.catalog
    unused_datasets = set(catalog.list()) - used_datasets

👍 1

Deepyaman Datta

02/02/2024, 5:13 AM

Should probably work?

Deepyaman Datta

02/02/2024, 5:26 AM

(I haven't tested/tried it)

Juan Luis

02/02/2024, 6:54 AM

this should be easier - @Kelsey Sorrels would you like to open a feature request? https://github.com/kedro-org/kedro/issues

Iñigo Hidalgo

02/02/2024, 9:29 AM

I feel like I remember an old plugin by either Waylon Walker or DataEngineerOne that went along these lines, but I can't find it here https://docs.kedro.org/en/0.18.13/extend_kedro/plugins.html#community-developed-plugins Might be misremembering some combination of kedro-wings and steel-toes though

👍🏼 1

Yetunde

02/02/2024, 10:01 AM

You should be able to use

kedro catalog list

to tell you which datasets aren't used by your pipelines. Links to docs: https://docs.kedro.org/en/stable/development/commands_reference.html#list-datasets-per-pipeline-per-type

👍 2

👍🏼 2

Kelsey Sorrels

02/02/2024, 4:09 PM

Thank you

Deepyaman Datta

02/02/2024, 4:17 PM

this should be easier

@Juan Luis I agree, but I also think it's not that complicated as is. In the code I wrote, the "complex" part (or at least most of the code) is to get the

catalog

object; I considered writing this for run in

kedro ipython

, which would mean the

catalog

object is already available. The rest is fairly straightforward; the code for

used_datasets

might have been nicer if I didn't use

reduce

and

or_

. 🤷

Juan Luis

02/02/2024, 4:18 PM

still,

kedro catalog list

is one command 😄

Deepyaman Datta

02/02/2024, 4:28 PM

Sorry, just to be clear--you're saying, it would be nice if

kedro catalog list

also had a section for unused datasets? (Also, I just realized, I've never seen this

kedro catalog list

, lot more information in there than I remember.)

Juan Luis

02/02/2024, 4:29 PM

I believe

kedro catalog list

already addresses @Kelsey Sorrels original question? (sorry, it's Friday afternoon and I'm fried)

🍳 1

Deepyaman Datta

02/02/2024, 4:29 PM

Is there a way to output a list of unused datasets?

Not directly I think. 🙂

👍🏼 1

Kelsey Sorrels

02/02/2024, 4:30 PM

kedro catalog list

worked. Even though I had to hunt a bit for it in the output, it satisfied my goal

🚀 2

👍🏼 2

Nok Lam Chan

02/02/2024, 5:02 PM

Copy code

used_datasets = reduce(or_, (x.datasets() for x in register_pipelines()))

should be

Copy code

used_datasets = reduce(or_, (x.datasets() for x in register_pipelines().values())

🙏 1

Deepyaman Datta

02/02/2024, 5:04 PM

Thanks! Updated in case anybody comes back to it.

Nok Lam Chan

02/02/2024, 5:04 PM

And it works - though I think it would be nicer to have something more direct. This also mix "parameters" with "datasets", which may not match people expectations

🙌 1

Nok Lam Chan

02/02/2024, 5:05 PM

You will always have

parameters

there because no one use it, and if you use

params:model_options.feature

params:model_options

will be "unused" with this definition

👀 1

Nok Lam Chan

02/02/2024, 5:07 PM

so there is subtle difference of "direct usage" and "implicit usage", it's harder to tell if someone use **kwargs

Deepyaman Datta

02/02/2024, 5:12 PM

Yeah. I was also curious, how this will work with dataset factories; my intuitive guess is that it won't work directly.

Nok Lam Chan

02/02/2024, 5:19 PM

Do you mean by patterns that aren't used?

Nok Lam Chan

02/02/2024, 10:41 PM

One thing that I notice:

```from <project_name>.pipeline_registry import register_pipelines

used_datasets = reduce(or_, (x.datasets() for x in register_pipelines().values()))```

could further simplified as

from kedro.framework.project import pipelines

used_datasets = reduce(or_, x.datasets() for x in pipelines.values()))

But it has to be after

bootstrap_project

🙌 1

12 Views

Open in Slack

Previous Next