Cãtãlin Alexandru
11/21/2023, 3:22 PMafter_node_run
hook, but even at this stage it’s still not saved. The after_dataset_saved
hook could work but it does not have the catalog information and therefore I do not know where the dataset was saved to get its hash with WandB. after_pipeline_run
hook would work but this runs very late and I want to track each node individually.
The only working options I found, is to create custom datasets from AbstractDataset
which will log each file individually after it was saved. But this would add a lot of overhead to write for each file type.
Is there another way? Thank you very much!marrrcin
11/21/2023, 3:28 PMNok Lam Chan
11/21/2023, 4:46 PMNok Lam Chan
11/21/2023, 4:47 PMCãtãlin Alexandru
11/21/2023, 5:10 PMnode
variable of after_dataset_saved
.
The reason I am doing this is because I do not want to upload the entire data to WandB 🙂Nok Lam Chan
11/21/2023, 5:12 PMCãtãlin Alexandru
11/21/2023, 5:16 PMsomething_about_data:
type: pandas.CSVDataSet
filepath: path/to/file.csv
The filepath
for the specific dataset I just saved out of a node. The above will be the output of a node.
In the above it would be 1 file, but for different nodes I will have a folder of large data. I want to make WandB log the hash of a file or directory after it was saved and after a node finished.
after_dataset_saved()
is a great place to run my wandb logger, but I am not sure how to obtain the filepath.Nok Lam Chan
11/21/2023, 5:17 PMclass MyHook:
@hook_impl
def after_catalog_create(self, catalog):
self.catalog = catalog
@hook_impl
def after_dataset_saved(self, dataset):
self.catalog # it doesn't exist in the signature but because of the execution order, this is available
Nok Lam Chan
11/21/2023, 5:18 PMCãtãlin Alexandru
11/21/2023, 5:18 PMJuan Luis
11/21/2023, 5:25 PMCãtãlin Alexandru
11/21/2023, 5:31 PM.list()
and can load the data using .load()
but I cant see to be able to get the filepath of an entry?Nok Lam Chan
11/22/2023, 7:15 AMCãtãlin Alexandru
11/22/2023, 9:52 AMCãtãlin Alexandru
11/22/2023, 1:44 PMafter_dataset_saved
hook. It runs after a node ends and AFTER the dataset was saved, which is the important part for me. The after_node_ends
hook will run after a node ends but before dataset is saved which won’t work for me.
after_dataset_saved
hook doesn’t have the catalog unfortunately to obtain the path of where the dataset was saved, but this can be fixed with Nok’s suggestion.
See below something which works for me and how to obtain the file path of the dataset saved:
class WandBLogging:
"""Uses Weights and Biases to log artifacts after a node/run ends."""
def __init__(self):
self.catalog = None
@hook_impl
def after_catalog_created(self, catalog):
self.catalog = catalog
@hook_impl
def after_dataset_saved(self, dataset_name, data, node):
filepath = str(self.catalog.datasets.__dict__[dataset_name]._filepath)
do_something_wandb(filepath)
Nok Lam Chan
11/22/2023, 1:47 PMCãtãlin Alexandru
11/22/2023, 1:49 PM