Galen Seilis
01/12/2024, 4:35 PMtarget_datasets = [...]
for dataset in catalog.datasets:
if dataset in target_datasets:
nodes.append(
node(dataset.describe(), inputs=dataset, outputs=f'{dataset}_describe')
)
Merel
01/12/2024, 4:37 PMafter_dataset_loaded
or after_catalog_created
could help here.Galen Seilis
01/12/2024, 4:46 PMMerel
01/12/2024, 4:49 PMGalen Seilis
01/12/2024, 4:58 PM@hook_spec
def after_catalog_created(
self,
catalog: DataCatalog,
conf_catalog: Dict[str, Any],
conf_creds: Dict[str, Any],
save_version: str,
load_versions: Dict[str, str],
) -> None:
pass
I am not sure where @hook_spec
is supposed to be imported from, or how it is intended that I implement functionality into after_catalog_created
.
For example, if I wanted to call pandas.DataFrame.describe
on each data set in the catalog and then include those tabular results back into the catalog I am unsure how it is intended for me to do that. Do I just monkey patch catalog
or is there a method available for instances of DataCatalog
?
It just isn't concrete for me yet 😅Merel
01/12/2024, 5:03 PMMerel
01/12/2024, 5:04 PMGalen Seilis
01/12/2024, 5:04 PMGalen Seilis
01/12/2024, 5:23 PMGalen Seilis
01/12/2024, 5:40 PMclass PrintCatalog:
@hook_impl
def after_catalog_created(self, catalog: DataCatalog, conf_catalog: Dict[str, Any]) -> None:
for thing in dir(catalog.datasets):
try:
print(getattr(catalog.datasets, thing).load().describe())
except Exception as e:
print(e)
quit()
Naturally the catch-all Exception is not great, but not permanent. Just trying to take a look at what I am accessing.
How would I add new entries to the data catalog which in turn would get saved while the hook is executing? Like if I wanted to write the results of describe
to a the datasets folder in my Kedro project.Galen Seilis
01/12/2024, 5:52 PMadd
method on DataCatalog
.
Help on function add in module kedro.io.data_catalog:
add(self, dataset_name: 'str', dataset: 'AbstractDataset', replace: 'bool' = False) -> 'None'
Adds a new ``AbstractDataset`` object to the ``DataCatalog``.
Args:
dataset_name: A unique data set name which has not been
registered yet.
dataset: A data set object to be associated with the given data
set name.
replace: Specifies whether to replace an existing dataset
with the same name is allowed.
Raises:
DatasetAlreadyExistsError: When a data set with the same name
has already been registered.
Example:
::
>>> from kedro_datasets.pandas import CSVDataset
>>>
>>> io = DataCatalog(datasets={
>>> 'cars': CSVDataset(filepath="cars.csv")
>>> })
>>>
>>> io.add("boats", CSVDataset(filepath="boats.csv"))