Hey everyone There is some way to get latest version of a da Kedro #questions

Hey everyone, There is some way to get latest ver...

Gleydson Silva

12/06/2023, 8:37 PM

Hey everyone, There is some way to get latest version of a dataset? I'm currently getting the version in a hook trigged when catalog is created, but, this only works if I explicitly inform the load version in kedro run command. I would like to get the version of a loaded dataset even when the version is the latest.

Deepyaman Datta

12/06/2023, 8:38 PM

What does your hook look like? (Just to have a starting point)

Gleydson Silva

12/06/2023, 8:39 PM

like this:

Copy code

@hook_impl
    def after_catalog_created(
        self,
        catalog: DataCatalog,
        load_versions: Dict[str, str]
    ):
        <http://self._logger.info|self._logger.info>(f"catalog: {catalog.datasets.__dict__}")
        <http://self._logger.info|self._logger.info>(f"load_versions: {load_versions}")

Gleydson Silva

12/06/2023, 8:43 PM

when I run without inform the version, this load_versions field is empty. Seems like when kedro understands that the version is "latest', I didn't fill this field. What I actually want to do is to save the version of a model that I used to score a dataset.

Gleydson Silva

12/06/2023, 8:43 PM

Kedro version 0.18.14

Gleydson Silva

12/06/2023, 8:48 PM

When I run

kedro run --pipeline=scorer

it uses the latest version of my trained model, but I can't get the version of my hook or elsewhere. When I run

kerdo run --pipeline=scorer --load-version=myversion

I can get the version on my hook. I would like to get the version of the loaded dataset in both cases.

Deepyaman Datta

12/06/2023, 8:54 PM

Sorry, I was looking through the code, since I'm not very fresh on it. The

load_versions

in the hook is basically going to pick whatever was passed to the catalog creation, which is what you're passing in the CLI; it doesn't actually pass the actual loaded versions. I also don't think it could pull the actual versions at that time, since the actual "later" version is only figured out when you go to load the dataset (by parsing the file structure).

Deepyaman Datta

12/06/2023, 8:56 PM

What you could do, in a hook, is get the catalog object, and get the config associated with your model dataset. From there, you could basically replicate Kedro's version discovery process to figure out what version will get pulled.

Gleydson Silva

12/06/2023, 8:57 PM

@Deepyaman Datta where can I find an example of what you describe?

Deepyaman Datta

12/06/2023, 9:04 PM

Copy code

@hook_impl
    def after_catalog_created(
        self,
        catalog: DataCatalog,
    ):
        <http://self._logger.info|self._logger.info>(f"catalog: {catalog.datasets.__dict__}")
        model = catalog.datasets["my_model"]
        <http://self._logger.info|self._logger.info>(f"load_versions: {load_versions}")
        <http://self._logger.info|self._logger.info>(f"model load version: {model.resolve_load_version()}")

try something like this? i'm writing it just reading the code, so may need to tweak something if it doesn't work 🙂

Deepyaman Datta

12/06/2023, 9:04 PM

i think this should hypothetically work whether you pass in the load version explicitly or not

Gleydson Silva

12/06/2023, 9:05 PM

Thankyou. I will try it and came back here to tell you if it works.

Nok Lam Chan

12/07/2023, 9:39 AM

From vague memory the load version isn’t propagated to any user facing API. https://github.com/kedro-org/kedro/issues/1580. Please let us know if you find a way, we are open to changes to kedro core if it’s necessary.

Deepyaman Datta

12/07/2023, 1:59 PM

@Nok Lam Chan why wouldn't calling

resolve_load_version

be sufficient in this case? I see it mentioned in the issue, as well. In @Gleydson Silva’s case, specifically, he's using it in a

scorer

pipeline, so you don't even need to worry about a new version of the model being produced before execution.

Nok Lam Chan

12/07/2023, 2:07 PM

@Deepyaman Datta

resolve_load_version

should be fine. My point (The github issue) is that this is not currently expose in any public API. The load version is fetch when needed and it didn’t update the dataset definition. Thus you will still have

self.load_version=None

in the dataset object itself. It will probably works if you try to call

resolve_load_version

in the hook manually

👍 2

Deepyaman Datta

12/07/2023, 2:12 PM

The load version is fetch when needed and it didn’t update the dataset definition.

I think this is fine TBH, or at least "working as intended".

👍🏼 1

Nok Lam Chan

12/07/2023, 2:16 PM

I think it is fine too, even better if we can expose this in the log (or hook spec)

👍 1

Gleydson Silva

12/08/2023, 8:15 PM

Copy code

@hook_impl
    def after_catalog_created(
        self,
        catalog: DataCatalog,
    ):
        <http://self._logger.info|self._logger.info>(f"catalog: {catalog.datasets.__dict__}")
        model = catalog.datasets.__dict__["my_model"]
        <http://self._logger.info|self._logger.info>(f"load_versions: {load_versions}")
        <http://self._logger.info|self._logger.info>(f"model load version: {model.resolve_load_version()}")

This seems to have solved my problem. I just add this

__dict__

to your example @Deepyaman Datta. Thank you Guys 🙂 IMHO, this information should be in the docs. I think its not unusual people keeping track of what version of a model was used in a score pipeline.

🙌 1

Deepyaman Datta

12/08/2023, 11:25 PM

Ah! Maybe

<http://catalog.datasets.my|catalog.datasets.my>_model

getattr(catalog.datasets, "my_model")

would work slightly more cleanly; I don't 100% remember (it used to), just trying to avoid accessing non-public parts of the API. 🙂

12 Views

Open in Slack

Previous Next