Hey Kedro community, I'm currently working on a pr...
# questions
p
Hey Kedro community, I'm currently working on a project trying to use
kedro_mlfow
to store
kedro_datasets_experimental.netcdf
as artifacts. Unfortunatly I can't make it work. The problem seems to be path related:
Copy code
kedro.io.core.DatasetError: 
Failed while saving data to dataset MlflowNetCDFDataset(filepath=S:/…/data/07_model_output/D2-24-25/idata.nc, load_args={'decode_times': False}, protocol=file, save_args={'mode': w}).
'str' object has no attribute 'as_posix'
I tried to investigate it to the best of my abilities and it seems to have to do with the initialization of
NetCDFDataset
. Most Datasets inherit from
AbstractVersionedDataset
and will call
__init__
with its _filepath as str.
NetCDFDataset
is missing it and so the
PurePosixPath
is not created. If this should be the problem in the end I don’t know but it is the point where other datasets have its path set. In the meantime I thought it might be because mlflow isn't capable of tracking Datasets which don't inherit from
AbstractVersionedDataset
but in kedro-mlfow documentation it says
MlflowArtifactDataset
is a wrapper for all
AbstractDatasets
. I tried to set the
self._filepath = PurePosixPath(filepath)
myself in the sitepackage but getting a Permission error on saving and that’s were my journey has to end. Would have been too good if this oneline would have made it^^ Thank you guys for your help here some reduced code for what I'm trying to achive. catalog.yml
Copy code
"{dataset}.idata":
  type: kedro_mlflow.io.artifacts.MlflowArtifactDataset
  dataset:
    type: kedro_datasets_experimental.netcdf.NetCDFDataset
    filepath: data/07_model_output/{dataset}/idata.nc
    save_args:
      mode: a
    load_args:
      decode_times: False
node.py
Copy code
def predict(model, x_data):

    idata = model.predict(x_data)

    return az.convert_to_dataset(idata)
pipeline.py
Copy code
pipeline_inference = pipeline(
            [
                node(
                    func=predict,
                    inputs={
                        "model": f"{dataset}.model",
                        "x_data": f"{dataset}.x_data",
                    },
                    outputs=f"{dataset}.idata",
                    name=f"{dataset}.predict_node",
                    tags=["training"],
                ),
            ]
        )
h
Someone will reply to you shortly. In the meantime, this might help:
j
hi @Philipp Dahlke, sorry you had a bumpy experience! I have a couple of questions: 1. is
MlflowNetCDFDataset
a custom dataset you created? (from the first error you reported) 2. when you used
NetCDFDataset
inside
MlflowArtifactDataset
(second code snippet), what error did you get? Could you share the full traceback?
👀 1
y
Hi @Philipp Dahlke, sorry for the bad experience. Unfortunately kedro abstract (unversioned) dataset don't necesarily have the _filepath attribute, and neither its format or access is standardized. See this issue (and maybe report your bug here to help prioritizing -> this is something tangentially related to our Datacatalog refactoring but for datasets) : https://github.com/kedro-org/kedro/discussions/3753. kedro-mlflow has focused a lot on `AbstractVersionedDataset`s and it may have some flaws for such unusual datasets. I think your fix attempt is the right one. In its init, the
NetCDFDataset
should convert the filepath to a Path. Can you : 1. try to use
pathlib.Path
instead of
pathlib.PurePosixPath
and see if it works? 2. In case it does not, can you share a minimal reproductible sample of data in the correct format you can load and save with
NetCDFDataset
so that I can try on my own ? @Juan Luis
MlflowNetCDFDataset
is created under the hood by the
MlflowArtifactDataset
✔️ 1
r
i am also tagging @Riley Brady; the author of the
NetCDFDataset
Riley, is there a reason we don't set the
self._filepath = PurePosixPath(filepath)
. If not, can we make the change to the dataset to handle it.
p
Thanks for your help. I discovered that missing folders in my workspaces, which are declared for the dataset in the catalog.yml, raise the permission error. See first post. After I created those manually I can save
NetCDFDataset
with the change made to
self._filepath
in
NetCDFDataset. __ init __
. @Yolan Honoré-Rougé Both versions seem to work.
pathlib.Path
and
pathlib.PurePosixPath
I declared them either like in other classes after
self.metadata
or at the end after
self._ismultifile
. I didnt want to disturb the is_multifile logic by creating it before hand but it seems like
PurePosixPath
can handle getting its own type passed. A minimal sample:
Copy code
import numpy as np
import arviz as az

def test_netCDF():
    size = 100
    dataset = az.convert_to_inference_data(np.random.randn(size))

    return az.convert_to_dataset(dataset)
@Juan Luis 1. As mentioned by Yolan this class is created by mlfow and is not implemented by me 2. see below for both traces Traceback for missing missing _filepath as instance of Path:
Copy code
Traceback (most recent call last):
  File "H:\Programs\Anaconda\envs\.conda_ba_env\Lib\site-packages\kedro\io\core.py", line 271, in save
    save_func(self, data)
  File "H:\Programs\Anaconda\envs\.conda_ba_env\Lib\site-packages\kedro_mlflow\io\artifacts\mlflow_artifact_dataset.py", line 63, in _save
    local_path = local_path.as_posix()
                 ^^^^^^^^^^^^^^^^^^^
AttributeError: 'str' object has no attribute 'as_posix'

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "<frozen runpy>", line 198, in _run_module_as_main
  File "<frozen runpy>", line 88, in _run_code
  File "H:\Programs\Anaconda\envs\.conda_ba_env\Scripts\kedro.exe\main.py", line 7, in <module>
  File "H:\Programs\Anaconda\envs\.conda_ba_env\Lib\site-packages\kedro\framework\cli\cli.py", line 263, in main
    cli_collection()
  File "H:\Programs\Anaconda\envs\.conda_ba_env\Lib\site-packages\click\core.py", line 1157, in call
    return self.main(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "H:\Programs\Anaconda\envs\.conda_ba_env\Lib\site-packages\kedro\framework\cli\cli.py", line 163, in main
    super().main(
  File "H:\Programs\Anaconda\envs\.conda_ba_env\Lib\site-packages\click\core.py", line 1078, in main
    rv = self.invoke(ctx)
         ^^^^^^^^^^^^^^^^
  File "H:\Programs\Anaconda\envs\.conda_ba_env\Lib\site-packages\click\core.py", line 1688, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
                           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "H:\Programs\Anaconda\envs\.conda_ba_env\Lib\site-packages\click\core.py", line 1434, in invoke
    return ctx.invoke(self.callback, **ctx.params)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "H:\Programs\Anaconda\envs\.conda_ba_env\Lib\site-packages\click\core.py", line 783, in invoke
    return __callback(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "H:\Programs\Anaconda\envs\.conda_ba_env\Lib\site-packages\kedro\framework\cli\project.py", line 228, in run
    return session.run(
           ^^^^^^^^^^^^
  File "H:\Programs\Anaconda\envs\.conda_ba_env\Lib\site-packages\kedro\framework\session\session.py", line 399, in run
    run_result = runner.run(
                 ^^^^^^^^^^^
  File "H:\Programs\Anaconda\envs\.conda_ba_env\Lib\site-packages\kedro\runner\runner.py", line 113, in run
    self._run(pipeline, catalog, hook_or_null_manager, session_id)  # type: ignore[arg-type]
    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "H:\Programs\Anaconda\envs\.conda_ba_env\Lib\site-packages\kedro\runner\sequential_runner.py", line 85, in _run
    ).execute()
      ^^^^^^^^^
  File "H:\Programs\Anaconda\envs\.conda_ba_env\Lib\site-packages\kedro\runner\task.py", line 88, in execute
    node = self._run_node_sequential(
           ^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "H:\Programs\Anaconda\envs\.conda_ba_env\Lib\site-packages\kedro\runner\task.py", line 186, in _run_node_sequential
    catalog.save(name, data)
  File "H:\Programs\Anaconda\envs\.conda_ba_env\Lib\site-packages\kedro\io\data_catalog.py", line 438, in save
    dataset.save(data)
  File "H:\Programs\Anaconda\envs\.conda_ba_env\Lib\site-packages\kedro\io\core.py", line 276, in save
    raise DatasetError(message) from exc
kedro.io.core.DatasetError: Failed while saving data to dataset MlflowNetCDFDataset(filepath=S:/___Studium/Bachelor_Arbeit/ba_env/bundesliga/data/07_model_output/D1-24-25/pymc/idata_fit.nc, load_args={'decode_times': False}, protocol=file, save_args={'mode': a}).
'str' object has no attribute 'as_posix'
Traceback for _filepath set to Path but missing folders:
Copy code
Traceback (most recent call last):
  File "H:\Programs\Anaconda\envs\.conda_ba_env\Lib\site-packages\xarray\backends\file_manager.py", line 211, in _acquire_with_cache_info
    file = self._cache[self._key]
           ~~~~~~~~~~~^^^^^^^^^^^
  File "H:\Programs\Anaconda\envs\.conda_ba_env\Lib\site-packages\xarray\backends\lru_cache.py", line 56, in __getitem__
    value = self._cache[key]
            ~~~~~~~~~~~^^^^^
KeyError: [<class 'netCDF4._netCDF4.Dataset'>, ('S:\\___Studium\\Bachelor_Arbeit\\ba_env\\bundesliga\\data\\07_model_output\\D1-24-25\\pymc\\idata_fit.nc',), 'a', (('clobber', True), ('diskless', False), ('format', 'NETCDF4'), ('persist', False)), '8aa8dfaa-e6a7-47e2-8b44-b700e528ffb8']

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "H:\Programs\Anaconda\envs\.conda_ba_env\Lib\site-packages\kedro\io\core.py", line 271, in save
    save_func(self, data)
  File "H:\Programs\Anaconda\envs\.conda_ba_env\Lib\site-packages\kedro_mlflow\io\artifacts\mlflow_artifact_dataset.py", line 66, in _save
    super().save.__wrapped__(self, data)
  File "H:\Programs\Anaconda\envs\.conda_ba_env\Lib\site-packages\kedro_datasets_experimental\netcdf\netcdf_dataset.py", line 172, in save
    data.to_netcdf(path=self._filepath, **self._save_args)
  File "H:\Programs\Anaconda\envs\.conda_ba_env\Lib\site-packages\xarray\core\dataset.py", line 2372, in to_netcdf
    return to_netcdf(  # type: ignore[return-value]  # mypy cannot resolve the overloads:(
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "H:\Programs\Anaconda\envs\.conda_ba_env\Lib\site-packages\xarray\backends\api.py", line 1856, in to_netcdf
    store = store_open(target, mode, format, group, **kwargs)
            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "H:\Programs\Anaconda\envs\.conda_ba_env\Lib\site-packages\xarray\backends\netCDF4_.py", line 452, in open
    return cls(manager, group=group, mode=mode, lock=lock, autoclose=autoclose)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "H:\Programs\Anaconda\envs\.conda_ba_env\Lib\site-packages\xarray\backends\netCDF4_.py", line 393, in __init__
    self.format = self.ds.data_model
                  ^^^^^^^
  File "H:\Programs\Anaconda\envs\.conda_ba_env\Lib\site-packages\xarray\backends\netCDF4_.py", line 461, in ds
    return self._acquire()
           ^^^^^^^^^^^^^^^
  File "H:\Programs\Anaconda\envs\.conda_ba_env\Lib\site-packages\xarray\backends\netCDF4_.py", line 455, in _acquire
    with self._manager.acquire_context(needs_lock) as root:
         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "H:\Programs\Anaconda\envs\.conda_ba_env\Lib\contextlib.py", line 137, in __enter__
    return next(self.gen)
           ^^^^^^^^^^^^^^
  File "H:\Programs\Anaconda\envs\.conda_ba_env\Lib\site-packages\xarray\backends\file_manager.py", line 199, in acquire_context
    file, cached = self._acquire_with_cache_info(needs_lock)
                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "H:\Programs\Anaconda\envs\.conda_ba_env\Lib\site-packages\xarray\backends\file_manager.py", line 217, in _acquire_with_cache_info
    file = self._opener(*self._args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "src\\netCDF4\\_netCDF4.pyx", line 2521, in netCDF4._netCDF4.Dataset.__init__
  File "src\\netCDF4\\_netCDF4.pyx", line 2158, in netCDF4._netCDF4._ensure_nc_success
PermissionError: [Errno 13] Permission denied: 'S:\\___Studium\\Bachelor_Arbeit\\ba_env\\bundesliga\\data\\07_model_output\\D1-24-25\\pymc\\idata_fit.nc'  

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "<frozen runpy>", line 198, in _run_module_as_main
  File "<frozen runpy>", line 88, in _run_code
  File "H:\Programs\Anaconda\envs\.conda_ba_env\Scripts\kedro.exe\__main__.py", line 7, in <module>
  File "H:\Programs\Anaconda\envs\.conda_ba_env\Lib\site-packages\kedro\framework\cli\cli.py", line 263, in main
    cli_collection()
  File "H:\Programs\Anaconda\envs\.conda_ba_env\Lib\site-packages\click\core.py", line 1157, in __call__
    return self.main(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "H:\Programs\Anaconda\envs\.conda_ba_env\Lib\site-packages\kedro\framework\cli\cli.py", line 163, in main
    super().main(
  File "H:\Programs\Anaconda\envs\.conda_ba_env\Lib\site-packages\click\core.py", line 1078, in main
    rv = self.invoke(ctx)
         ^^^^^^^^^^^^^^^^
  File "H:\Programs\Anaconda\envs\.conda_ba_env\Lib\site-packages\click\core.py", line 1688, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
                           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "H:\Programs\Anaconda\envs\.conda_ba_env\Lib\site-packages\click\core.py", line 1434, in invoke
    return ctx.invoke(self.callback, **ctx.params)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "H:\Programs\Anaconda\envs\.conda_ba_env\Lib\site-packages\click\core.py", line 783, in invoke
    return __callback(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "H:\Programs\Anaconda\envs\.conda_ba_env\Lib\site-packages\kedro\framework\cli\project.py", line 228, in run
    return session.run(
           ^^^^^^^^^^^^
  File "H:\Programs\Anaconda\envs\.conda_ba_env\Lib\site-packages\kedro\framework\session\session.py", line 399, in run
    run_result = runner.run(
                 ^^^^^^^^^^^
  File "H:\Programs\Anaconda\envs\.conda_ba_env\Lib\site-packages\kedro\runner\runner.py", line 113, in run
    self._run(pipeline, catalog, hook_or_null_manager, session_id)  # type: ignore[arg-type]
    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "H:\Programs\Anaconda\envs\.conda_ba_env\Lib\site-packages\kedro\runner\sequential_runner.py", line 85, in _run
    ).execute()
      ^^^^^^^^^
  File "H:\Programs\Anaconda\envs\.conda_ba_env\Lib\site-packages\kedro\runner\task.py", line 88, in execute
    node = self._run_node_sequential(
           ^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "H:\Programs\Anaconda\envs\.conda_ba_env\Lib\site-packages\kedro\runner\task.py", line 186, in _run_node_sequential
    catalog.save(name, data)
  File "H:\Programs\Anaconda\envs\.conda_ba_env\Lib\site-packages\kedro\io\data_catalog.py", line 438, in save
    dataset.save(data)
  File "H:\Programs\Anaconda\envs\.conda_ba_env\Lib\site-packages\kedro\io\core.py", line 276, in save
    raise DatasetError(message) from exc
kedro.io.core.DatasetError: Failed while saving data to dataset MlflowNetCDFDataset(filepath=S:/___Studium/Bachelor_Arbeit/ba_env/bundesliga/data/07_model_output/D1-24-25/pymc/idata_fit.nc, load_args={'decode_times': False}, protocol=file, save_args={'mode': a}).
[Errno 13] Permission denied: 'S:\\___Studium\\Bachelor_Arbeit\\ba_env\\bundesliga\\data\\07_model_output\\D1-24-25\\pymc\\idata_fit.nc'