Hey everyone, There is some way to get latest ver...
# questions
g
Hey everyone, There is some way to get latest version of a dataset? I'm currently getting the version in a hook trigged when catalog is created, but, this only works if I explicitly inform the load version in kedro run command. I would like to get the version of a loaded dataset even when the version is the latest.
d
What does your hook look like? (Just to have a starting point)
g
like this:
Copy code
@hook_impl
    def after_catalog_created(
        self,
        catalog: DataCatalog,
        load_versions: Dict[str, str]
    ):
        <http://self._logger.info|self._logger.info>(f"catalog: {catalog.datasets.__dict__}")
        <http://self._logger.info|self._logger.info>(f"load_versions: {load_versions}")
when I run without inform the version, this load_versions field is empty. Seems like when kedro understands that the version is "latest', I didn't fill this field. What I actually want to do is to save the version of a model that I used to score a dataset.
Kedro version 0.18.14
When I run
kedro run --pipeline=scorer
it uses the latest version of my trained model, but I can't get the version of my hook or elsewhere. When I run
kerdo run --pipeline=scorer --load-version=myversion
I can get the version on my hook. I would like to get the version of the loaded dataset in both cases.
d
Sorry, I was looking through the code, since I'm not very fresh on it. The
load_versions
in the hook is basically going to pick whatever was passed to the catalog creation, which is what you're passing in the CLI; it doesn't actually pass the actual loaded versions. I also don't think it could pull the actual versions at that time, since the actual "later" version is only figured out when you go to load the dataset (by parsing the file structure).
What you could do, in a hook, is get the catalog object, and get the config associated with your model dataset. From there, you could basically replicate Kedro's version discovery process to figure out what version will get pulled.
g
@Deepyaman Datta where can I find an example of what you describe?
d
Copy code
@hook_impl
    def after_catalog_created(
        self,
        catalog: DataCatalog,
    ):
        <http://self._logger.info|self._logger.info>(f"catalog: {catalog.datasets.__dict__}")
        model = catalog.datasets["my_model"]
        <http://self._logger.info|self._logger.info>(f"load_versions: {load_versions}")
        <http://self._logger.info|self._logger.info>(f"model load version: {model.resolve_load_version()}")
try something like this? i'm writing it just reading the code, so may need to tweak something if it doesn't work 🙂
i think this should hypothetically work whether you pass in the load version explicitly or not
g
Thankyou. I will try it and came back here to tell you if it works.
n
From vague memory the load version isn’t propagated to any user facing API. https://github.com/kedro-org/kedro/issues/1580. Please let us know if you find a way, we are open to changes to kedro core if it’s necessary.
d
@Nok Lam Chan why wouldn't calling
resolve_load_version
be sufficient in this case? I see it mentioned in the issue, as well. In @Gleydson Silva’s case, specifically, he's using it in a
scorer
pipeline, so you don't even need to worry about a new version of the model being produced before execution.
n
@Deepyaman Datta
resolve_load_version
should be fine. My point (The github issue) is that this is not currently expose in any public API. The load version is fetch when needed and it didn’t update the dataset definition. Thus you will still have
self.load_version=None
in the dataset object itself. It will probably works if you try to call
resolve_load_version
in the hook manually
👍 2
d
The load version is fetch when needed and it didn’t update the dataset definition.
I think this is fine TBH, or at least "working as intended".
👍🏼 1
n
I think it is fine too, even better if we can expose this in the log (or hook spec)
👍 1
g
Copy code
@hook_impl
    def after_catalog_created(
        self,
        catalog: DataCatalog,
    ):
        <http://self._logger.info|self._logger.info>(f"catalog: {catalog.datasets.__dict__}")
        model = catalog.datasets.__dict__["my_model"]
        <http://self._logger.info|self._logger.info>(f"load_versions: {load_versions}")
        <http://self._logger.info|self._logger.info>(f"model load version: {model.resolve_load_version()}")
This seems to have solved my problem. I just add this
__dict__
to your example @Deepyaman Datta. Thank you Guys 🙂 IMHO, this information should be in the docs. I think its not unusual people keeping track of what version of a model was used in a score pipeline.
🙌 1
d
Ah! Maybe
<http://catalog.datasets.my|catalog.datasets.my>_model
or
getattr(catalog.datasets, "my_model")
would work slightly more cleanly; I don't 100% remember (it used to), just trying to avoid accessing non-public parts of the API. 🙂