Are Abstract Datasets definable as read only even they have Kedro #questions

Are (Abstract)Datasets definable as read-only (eve...

fmfreeze

12/20/2023, 2:41 PM

Are (Abstract)Datasets definable as read-only (even they have a write function implemented)? If no, that would be a nice feature to easily protect data from getting overwritten.

Nok Lam Chan

12/20/2023, 2:46 PM

You simply not implementing any

_load

method and it shouldn't be able to write at all.

Nok Lam Chan

12/20/2023, 2:47 PM

Ah sorry I missed the description. What should happen when a file exist? Should it append, create a new file?

Nok Lam Chan

12/20/2023, 2:49 PM

Can you elaborate a bit your use case?

fmfreeze

12/20/2023, 3:04 PM

> You simply not implementing any

_load

method and it shouldn't be able to write at all. yes, if I implement/inherit my own custom dataset. But "locking" any dataset from being overwritten unintentionally at the config layer would be a nice feature. Every situation where one wants to "lock" a dataset because it surely is always a pipeline input only (e.g. raw data) may be a use case. Currently I do it be setting

versioned: true

, but then I also have to do sanity-check if the latest version is unmodified (e.g. no other version exists).

Iñigo Hidalgo

12/20/2023, 3:14 PM

you could probably achieve this functionality using

before_dataset_saved

hook (https://docs.kedro.org/en/stable/hooks/common_use_cases.html#use-hooks-to-customise-the-dataset-load-and-save-methods) combined with some metadata in the catalog, that's a new feature I haven't used myself so probably somebody from the kedro team can bring up the relevant docu for that side

👍 1

👍🏼 1

fmfreeze

12/20/2023, 3:26 PM

good idea. So while custom implementations to achieve that behaviour are possible, this still would be a nice kedro feature on the

AbstractDataset

level. I forgot to reply to @Nok Lam Chan's question about what should happen: The user simply should be informed that the current pipeline definition is wrong, because it defines a dataset as output which is configured as read-only.

👍🏼 1

Nok Lam Chan

12/20/2023, 3:41 PM

I think hook should be a good candidate here. either the before_dataset_saved hook suggested or the before_pipeline_run. I think you should be able to slice the pipeline and parse if it contains anything that shouldn't be an output.

👍 1

Iñigo Hidalgo

12/20/2023, 3:45 PM

oh right, yeah if you want that "type-checking" functionality, not just skipping the save, then before_pipeline_run or one of those is probably better-suited. I think that hook also has access to the catalog whereas the dataset_save one doesn't

Nok Lam Chan

12/20/2023, 3:49 PM

If you like you can make a small plugin just to do this kind of pipeline validation thing. You can make use of the

metadata

field of dataset so there is no change in the core api.

Iñigo Hidalgo

12/20/2023, 5:10 PM

So while custom implementations to achieve that behaviour are possible, this still would be a nice kedro feature on the
AbstractDataset
level.

thinking about this, what would be the cleanest way to subclass abstractdataset in order to modify the behavior of all datasets which inherit from it? is that even possible/desirable?

Yolan Honoré-Rougé

12/20/2023, 8:26 PM

This is a bit off topic but an interesting question I don't have a clear answer to @Iñigo Hidalgo. Kedro-mlflow handles with a

__new__

https://github.com/Galileo-Galilei/kedro-mlflow/blob/master/kedro_mlflow%2Fio%2Fartifacts%2Fmlflow_artifact_dataset.py#L16-L147 method to decorate all save / load method automatically, but this looks very hacky.

🧠 2

Iñigo Hidalgo

12/21/2023, 8:23 AM

i figured if it was possible at all metaclassing would be the only way, but even in your case, you're only modifying the kedro-mlflow datasets, right? Not any dataset which inherits from abstractdataset

Open in Slack

Previous Next