hi all a client uses openstack swift instead of s3 storage d Kedro #questions

hi all, a client uses openstack swift instead of s...

Hugo Evers

11/22/2023, 3:04 PM

hi all, a client uses openstack swift instead of s3 storage due to their cloud provider (medical data, so everything is “local”), anyway, since kedro datasets use fsspec as a backend, we thought it would be a good idea to use

swiftspec

https://github.com/fsspec/swiftspec, however, if we would like to inject/override the fsspec client this requires python code, but these arguments are passed to the client through yaml as follows:

self._fs = fsspec.filesystem(self._protocol, **self._storage_options)

Does anyone have experience working with openstack Swift and kedro? If so, did you use swiftspec? and secondly, if we would like to add features to the swiftspec client like automatic retry (which requires python code) what approach would you recommend? a dataset hook? any suggestions would be very welcome, thanks! to illustrate the following is mentioned on swiftspec: _Sometimes reading or writing from / to a swift storage might fail occasionally. If many objects are accessed, occasional failures can be extremely annoying and could be fixed relatively easily by retrying the request. Fortunately the

aiohttp_retry

package can help out in these situations.

aiohttp_retry

provides a wrapper around an

aiohttp

Client, which will automatically retry requests based on some user-provided rules. You can inject this client into the

swiftspec

filesystem using the

get_client

argument. First you’ll have to define an async

get_client

function, which configures the

RetryClient

according to your preferences, e.g.:_

Copy code

async def get_client(**kwargs):
    import aiohttp
    import aiohttp_retry
    retry_options = aiohttp_retry.ExponentialRetry(
            attempts=3,
            exceptions={OSError, aiohttp.ServerDisconnectedError})
    retry_client = aiohttp_retry.RetryClient(raise_for_status=False, retry_options=retry_options)
    return retry_client

afterwards, you can use this function like:

Copy code

with fsspec.open("<swift://server/account/container/object.txt>", "r", get_client=get_client) as f:
    print(f.read())

datajoely

11/22/2023, 3:08 PM

so this is the first I’ve heard about this!

datajoely

11/22/2023, 3:14 PM

so I think you would need to do a custom dataset as you’d need to provide the client object getter but it shouldn’t be too hard . If you look at the implementation for our CSVDataSet (which is similar for most others) • The constructor accepts an

fs_args

dictionary • We then bind these to the

pandas

native concept “storage options”

```self._storage_options = {**_credentials, **_fs_args}

self._fs = fsspec.filesystem(self._protocol, **self._storage_options)```

• Both the write an read methods then get these available for example

```pd.read_csv(

load_path, storage_options=self._storage_options, **self._load_args

)```

----- In summary - • prove you can get regular pandas to work with the native storage options • then you can make a custom dataset / subclass which can do your

get_client

bit • You may be able to do this without a custom class but an OmegaConf resolver but I’m not 100% sure here

Hugo Evers

11/22/2023, 3:19 PM

okay, thanks! My main goal would be to prevent us from having to make custom datasets for every kedro-dataset. So either i could make an over-arching dataset, like the versioned or partioneddataset like:

Copy code

SwiftDataset:
  Dataset: Pandas.ParquetDataset 
  Args: ...
  Credentials: ...

But id rather just use the kedro-datasets and override them using the configloader. Ill run through it and report back if i hit any snags. If it works nicely, Would you like to include the example in the docs? It could be included in this section: https://docs.kedro.org/en/stable/data/data_catalog.html#dataset-filepath

datajoely

11/22/2023, 3:21 PM

Some thoughts - 1. Priorities testing the omegaconfig resolvers let us do this neatly 2. Wrapper dataset may be a good shout - PartitionedDatset / CachedDataSet may be inspiration here 3.

pandas.GenericDataSet

may be helpful if that’s the majority of your work

Open in Slack

Previous Next