https://kedro.org/ logo
#questions
Title
# questions
c

Cãtãlin Alexandru

11/21/2023, 3:22 PM
Hello Kedro team 🙌🏻 Overall TLDR: How can I run a piece of code after a node finished and a dataset was saved, while still being able to obtain some information about the node which run and where the dataset was saved. Long form: I am trying to use Weights and biases (WandB) with Kedro to track the output of dataset nodes. WandB creates an artifact of a dataset by getting its hash and if this changes then it’s a new version of that dataset. This means the dataset needs to exist at the time I want to log it in WandB. As at the end of a node in Kedro the data is not yet saved, I looked into the
after_node_run
hook, but even at this stage it’s still not saved. The
after_dataset_saved
hook could work but it does not have the catalog information and therefore I do not know where the dataset was saved to get its hash with WandB.
after_pipeline_run
hook would work but this runs very late and I want to track each node individually. The only working options I found, is to create custom datasets from
AbstractDataset
which will log each file individually after it was saved. But this would add a lot of overhead to write for each file type. Is there another way? Thank you very much!
🤔 1
🔥 1
m

marrrcin

11/21/2023, 3:28 PM
That’s a nice one!
😄 1
n

Nok Lam Chan

11/21/2023, 4:46 PM
See • https://github.com/kedro-org/kedro/pull/2296https://github.com/kedro-org/kedro/issues/2258 Is the dataset hook not enough? #2258 was created exactly to extend the use case for wandb. There was some discussion to create kedro-wandb extension previously IIRC.
There is an example WandbHook in the issue as well.
c

Cãtãlin Alexandru

11/21/2023, 5:10 PM
Thank you very much for your reply! This is useful but wont work exactly for my use case. In that PR, the user is saving the full data as an artifact, while I am mostly looking for where the data was saved (the path in the catalog) to get its hash. I am not quite sure if I can obtain the catalog information from the
node
variable of
after_dataset_saved
. The reason I am doing this is because I do not want to upload the entire data to WandB 🙂
n

Nok Lam Chan

11/21/2023, 5:12 PM
What exactly do you need from catalog?
c

Cãtãlin Alexandru

11/21/2023, 5:16 PM
Copy code
something_about_data:
  type: pandas.CSVDataSet
  filepath: path/to/file.csv
The
filepath
for the specific dataset I just saved out of a node. The above will be the output of a node. In the above it would be 1 file, but for different nodes I will have a folder of large data. I want to make WandB log the hash of a file or directory after it was saved and after a node finished.
after_dataset_saved()
is a great place to run my wandb logger, but I am not sure how to obtain the filepath.
n

Nok Lam Chan

11/21/2023, 5:17 PM
If some argument is not available, you can always use the fact that Hook is a stateful object.
Copy code
class MyHook:

@hook_impl
def after_catalog_create(self, catalog):
  self.catalog = catalog

@hook_impl
def after_dataset_saved(self, dataset):
   self.catalog  # it doesn't exist in the signature but because of the execution order, this is available
👍 1
I don’t remember if this is included in our docs already, I check GH quickly but I couldn’t find the issue Cc @Juan Luis
c

Cãtãlin Alexandru

11/21/2023, 5:18 PM
Oh! This would do it! Trying now
K 1
j

Juan Luis

11/21/2023, 5:25 PM
c

Cãtãlin Alexandru

11/21/2023, 5:31 PM
The above catalog would give me the entry names only of the catalog using
.list()
and can load the data using
.load()
but I cant see to be able to get the filepath of an entry?
got it!
n

Nok Lam Chan

11/22/2023, 7:15 AM
@Cãtãlin Alexandru Can you share what you did at the end? It will help other users in the future, we also have our Slack archive which hopefully at some point can be googled.
this 1
👍 2
c

Cãtãlin Alexandru

11/22/2023, 9:52 AM
Will do and get back to this in afternoon! For the above message crossed, it was just me not knowing where to find the data in the catalog object. Got it with the debugger. Will write a more comprehensive msg later
🥳 1
Thank you everyone for your help! The issue was that I could not run a piece of code (weights and biases) after a node ends and get a hash of the dataset created. The best way to approach this is to use the
after_dataset_saved
hook. It runs after a node ends and AFTER the dataset was saved, which is the important part for me. The
after_node_ends
hook will run after a node ends but before dataset is saved which won’t work for me.
after_dataset_saved
hook doesn’t have the catalog unfortunately to obtain the path of where the dataset was saved, but this can be fixed with Nok’s suggestion. See below something which works for me and how to obtain the file path of the dataset saved:
Copy code
class WandBLogging:
    """Uses Weights and Biases to log artifacts after a node/run ends."""

    def __init__(self):
        self.catalog = None

    @hook_impl
    def after_catalog_created(self, catalog):
        self.catalog = catalog

    @hook_impl
    def after_dataset_saved(self, dataset_name, data, node):
        filepath = str(self.catalog.datasets.__dict__[dataset_name]._filepath)
        do_something_wandb(filepath)
K 2
n

Nok Lam Chan

11/22/2023, 1:47 PM
This is awesome! Thanks for sharing this back to the community :) Optionally you can use the c`catalog.__get_dataset` method
🙌 1
c

Cãtãlin Alexandru

11/22/2023, 1:49 PM
Thank you! Will test and update the above when i can if its simpler 🙂 what i wrote was the first thing i saw in the debugger
👍🏼 1