Are (Abstract)Datasets definable as read-only (eve...
# questions
f
Are (Abstract)Datasets definable as read-only (even they have a write function implemented)? If no, that would be a nice feature to easily protect data from getting overwritten.
n
You simply not implementing any
_load
method and it shouldn't be able to write at all.
Ah sorry I missed the description. What should happen when a file exist? Should it append, create a new file?
Can you elaborate a bit your use case?
f
> You simply not implementing any
_load
method and it shouldn't be able to write at all. yes, if I implement/inherit my own custom dataset. But "locking" any dataset from being overwritten unintentionally at the config layer would be a nice feature. Every situation where one wants to "lock" a dataset because it surely is always a pipeline input only (e.g. raw data) may be a use case. Currently I do it be setting
versioned: true
, but then I also have to do sanity-check if the latest version is unmodified (e.g. no other version exists).
i
you could probably achieve this functionality using
before_dataset_saved
hook (https://docs.kedro.org/en/stable/hooks/common_use_cases.html#use-hooks-to-customise-the-dataset-load-and-save-methods) combined with some metadata in the catalog, that's a new feature I haven't used myself so probably somebody from the kedro team can bring up the relevant docu for that side
👍 1
👍🏼 1
f
good idea. So while custom implementations to achieve that behaviour are possible, this still would be a nice kedro feature on the
AbstractDataset
level. I forgot to reply to @Nok Lam Chan's question about what should happen: The user simply should be informed that the current pipeline definition is wrong, because it defines a dataset as output which is configured as read-only.
👍🏼 1
n
I think hook should be a good candidate here. either the before_dataset_saved hook suggested or the before_pipeline_run. I think you should be able to slice the pipeline and parse if it contains anything that shouldn't be an output.
👍 1
i
oh right, yeah if you want that "type-checking" functionality, not just skipping the save, then before_pipeline_run or one of those is probably better-suited. I think that hook also has access to the catalog whereas the dataset_save one doesn't
n
If you like you can make a small plugin just to do this kind of pipeline validation thing. You can make use of the
metadata
field of dataset so there is no change in the core api.
i
So while custom implementations to achieve that behaviour are possible, this still would be a nice kedro feature on the
AbstractDataset
level.
thinking about this, what would be the cleanest way to subclass abstractdataset in order to modify the behavior of all datasets which inherit from it? is that even possible/desirable?
y
This is a bit off topic but an interesting question I don't have a clear answer to @Iñigo Hidalgo. Kedro-mlflow handles with a
__new__
https://github.com/Galileo-Galilei/kedro-mlflow/blob/master/kedro_mlflow%2Fio%2Fartifacts%2Fmlflow_artifact_dataset.py#L16-L147 method to decorate all save / load method automatically, but this looks very hacky.
🧠 2
i
i figured if it was possible at all metaclassing would be the only way, but even in your case, you're only modifying the kedro-mlflow datasets, right? Not any dataset which inherits from abstractdataset