https://kedro.org/ logo
#questions
Title
# questions
h

Hugo Evers

11/22/2023, 3:04 PM
hi all, a client uses openstack swift instead of s3 storage due to their cloud provider (medical data, so everything is “local”), anyway, since kedro datasets use fsspec as a backend, we thought it would be a good idea to use
swiftspec
https://github.com/fsspec/swiftspec, however, if we would like to inject/override the fsspec client this requires python code, but these arguments are passed to the client through yaml as follows:
self._fs = fsspec.filesystem(self._protocol, **self._storage_options)
Does anyone have experience working with openstack Swift and kedro? If so, did you use swiftspec? and secondly, if we would like to add features to the swiftspec client like automatic retry (which requires python code) what approach would you recommend? a dataset hook? any suggestions would be very welcome, thanks! to illustrate the following is mentioned on swiftspec: _Sometimes reading or writing from / to a swift storage might fail occasionally. If many objects are accessed, occasional failures can be extremely annoying and could be fixed relatively easily by retrying the request. Fortunately the
aiohttp_retry
package can help out in these situations.
aiohttp_retry
provides a wrapper around an
aiohttp
Client, which will automatically retry requests based on some user-provided rules. You can inject this client into the
swiftspec
filesystem using the
get_client
argument. First you’ll have to define an async
get_client
function, which configures the
RetryClient
according to your preferences, e.g.:_
Copy code
async def get_client(**kwargs):
    import aiohttp
    import aiohttp_retry
    retry_options = aiohttp_retry.ExponentialRetry(
            attempts=3,
            exceptions={OSError, aiohttp.ServerDisconnectedError})
    retry_client = aiohttp_retry.RetryClient(raise_for_status=False, retry_options=retry_options)
    return retry_client
afterwards, you can use this function like:
Copy code
with fsspec.open("<swift://server/account/container/object.txt>", "r", get_client=get_client) as f:
    print(f.read())
d

datajoely

11/22/2023, 3:08 PM
so this is the first I’ve heard about this!
so I think you would need to do a custom dataset as you’d need to provide the client object getter but it shouldn’t be too hard . If you look at the implementation for our CSVDataSet (which is similar for most others) • The constructor accepts an
fs_args
dictionary • We then bind these to the
pandas
native concept “storage options”
```self._storage_options = {**_credentials, **_fs_args}
self._fs = fsspec.filesystem(self._protocol, **self._storage_options)```
• Both the write an read methods then get these available for example
```pd.read_csv(
load_path, storage_options=self._storage_options, **self._load_args
)```
----- In summary - • prove you can get regular pandas to work with the native storage options • then you can make a custom dataset / subclass which can do your
get_client
bit • You may be able to do this without a custom class but an OmegaConf resolver but I’m not 100% sure here
h

Hugo Evers

11/22/2023, 3:19 PM
okay, thanks! My main goal would be to prevent us from having to make custom datasets for every kedro-dataset. So either i could make an over-arching dataset, like the versioned or partioneddataset like:
Copy code
SwiftDataset:
  Dataset: Pandas.ParquetDataset 
  Args: ...
  Credentials: ...
But id rather just use the kedro-datasets and override them using the configloader. Ill run through it and report back if i hit any snags. If it works nicely, Would you like to include the example in the docs? It could be included in this section: https://docs.kedro.org/en/stable/data/data_catalog.html#dataset-filepath
d

datajoely

11/22/2023, 3:21 PM
Some thoughts - 1. Priorities testing the omegaconfig resolvers let us do this neatly 2. Wrapper dataset may be a good shout - PartitionedDatset / CachedDataSet may be inspiration here 3.
pandas.GenericDataSet
may be helpful if that’s the majority of your work