Hugo Evers
11/22/2023, 3:04 PMswiftspec
https://github.com/fsspec/swiftspec, however, if we would like to inject/override the fsspec client this requires python code, but these arguments are passed to the client through yaml as follows:
self._fs = fsspec.filesystem(self._protocol, **self._storage_options)
Does anyone have experience working with openstack Swift and kedro? If so, did you use swiftspec? and secondly, if we would like to add features to the swiftspec client like automatic retry (which requires python code) what approach would you recommend? a dataset hook? any suggestions would be very welcome, thanks!
to illustrate the following is mentioned on swiftspec:
_Sometimes reading or writing from / to a swift storage might fail occasionally. If many objects are accessed, occasional failures can be extremely annoying and could be fixed relatively easily by retrying the request. Fortunately the aiohttp_retry
package can help out in these situations. aiohttp_retry
provides a wrapper around an aiohttp
Client, which will automatically retry requests based on some user-provided rules. You can inject this client into the swiftspec
filesystem using the get_client
argument. First you’ll have to define an async get_client
function, which configures the RetryClient
according to your preferences, e.g.:_
async def get_client(**kwargs):
import aiohttp
import aiohttp_retry
retry_options = aiohttp_retry.ExponentialRetry(
attempts=3,
exceptions={OSError, aiohttp.ServerDisconnectedError})
retry_client = aiohttp_retry.RetryClient(raise_for_status=False, retry_options=retry_options)
return retry_client
afterwards, you can use this function like:
with fsspec.open("<swift://server/account/container/object.txt>", "r", get_client=get_client) as f:
print(f.read())
datajoely
11/22/2023, 3:08 PMdatajoely
11/22/2023, 3:14 PMfs_args
dictionary
• We then bind these to the pandas
native concept “storage options”
```self._storage_options = {**_credentials, **_fs_args}
self._fs = fsspec.filesystem(self._protocol, **self._storage_options)```• Both the write an read methods then get these available for example
```pd.read_csv(
load_path, storage_options=self._storage_options, **self._load_args
)```----- In summary - • prove you can get regular pandas to work with the native storage options • then you can make a custom dataset / subclass which can do your
get_client
bit
• You may be able to do this without a custom class but an OmegaConf resolver but I’m not 100% sure hereHugo Evers
11/22/2023, 3:19 PMSwiftDataset:
Dataset: Pandas.ParquetDataset
Args: ...
Credentials: ...
But id rather just use the kedro-datasets and override them using the configloader. Ill run through it and report back if i hit any snags.
If it works nicely, Would you like to include the example in the docs? It could be included in this section: https://docs.kedro.org/en/stable/data/data_catalog.html#dataset-filepathdatajoely
11/22/2023, 3:21 PMpandas.GenericDataSet
may be helpful if that’s the majority of your work