Hello, first time Kedro user here. I have started ...
# questions
b
Hello, first time Kedro user here. I have started experimenting with my own Data Sources and I am facing some issues. I have some pandas DataFrame that I would like to save in a parquet file inside my NetApp StorageGrid S3. Everything goes smootly until the next node in the pipeline try to load the file from s3. I can see the file is present in the bucket. However I get this expection:
DataSetError: Failed while loading data from data set
ParquetDataSet(filepath=<my file_path>,
load_args={'engine': pyarrow}, protocol=s3, save_args={'engine': pyarrow}).
AioSession.__init__() got an unexpected keyword argument 'target_options'
. I have tried with different versions of fsspec, s3fs, kedro and python and I get the same issue. Here is what I am using currently: Python 3.10.10, Kedro 0.18.7, s3fs 2023.3.0, fsspec 2023.3.0, aiobotocore 2.4.2, pandas 1.5.3. Pip check does not show any broken requirements. Has anyone experienced this problem before? Extensive googling didn't show any result....
This is the data set yaml that I have configured:
pm_raw_data_parquet:
type: pandas.ParquetDataSet
filepath: "s3://${input_bucket}/${s3_object_key}_pq"
credentials: my_s3_credentials
load_args:
engine: pyarrow
save_args:
engine: pyarrow
n
This looks like incompatible dependencies
b
My guess as well 😞 I have yet to figure out what dependency is broken tho. This is what I have installed. Note is different from what I posted above since I tried different cobination to see If could get past this error.
Copy code
Package                Version
---------------------- ---------
aiobotocore            2.4.2
aiohttp                3.8.4
aioitertools           0.11.0
aiosignal              1.3.1
antlr4-python3-runtime 4.9.3
anyconfig              0.10.1
anyio                  3.6.2
argon2-cffi            21.3.0
argon2-cffi-bindings   21.2.0
arrow                  1.2.3
async-timeout          4.0.2
asynctest              0.13.0
attrs                  23.1.0
Babel                  2.12.1
backcall               0.2.0
beautifulsoup4         4.12.2
binaryornot            0.4.4
black                  22.12.0
bleach                 6.0.0
botocore               1.27.59
build                  0.10.0
cachetools             5.3.0
certifi                2022.12.7
cffi                   1.15.1
chardet                5.1.0
charset-normalizer     3.1.0
click                  8.1.3
cookiecutter           2.1.1
coverage               7.2.3
debugpy                1.6.7
decorator              5.1.1
defusedxml             0.7.1
dynaconf               3.1.11
entrypoints            0.4
exceptiongroup         1.1.1
fastjsonschema         2.16.3
flake8                 4.0.1
frozenlist             1.3.3
fsspec                 2023.1.0
gitdb                  4.0.10
GitPython              3.1.31
greenlet               2.0.2
idna                   3.4
importlib-metadata     4.2.0
importlib-resources    5.12.0
iniconfig              2.0.0
ipykernel              6.16.2
ipython                7.34.0
ipython-genutils       0.2.0
ipywidgets             8.0.6
isort                  5.11.5
jedi                   0.18.2
Jinja2                 3.1.2
jinja2-time            0.2.0
jmespath               0.10.0
json5                  0.9.11
jsonschema             4.17.3
jupyter                1.0.0
jupyter_client         7.4.9
jupyter-console        6.6.3
jupyter_core           4.12.0
jupyter-server         1.24.0
jupyterlab             3.5.3
jupyterlab-pygments    0.2.2
jupyterlab_server      2.15.2
jupyterlab-widgets     3.0.7
kedro                  0.18.7
kedro-datasets         1.2.0
kedro-telemetry        0.2.4
markdown-it-py         2.2.0
MarkupSafe             2.1.2
matplotlib-inline      0.1.6
mccabe                 0.6.1
mdurl                  0.1.2
mistune                2.0.5
more-itertools         9.1.0
multidict              6.0.4
mypy-extensions        1.0.0
nbclassic              0.5.5
nbclient               0.7.3
nbconvert              7.3.1
nbformat               5.8.0
nbstripout             0.6.1
nest-asyncio           1.5.6
notebook               6.5.4
notebook_shim          0.2.2
numpy                  1.21.6
omegaconf              2.3.0
packaging              23.1
pandas                 1.3.5
pandocfilters          1.5.0
parso                  0.8.3
pathspec               0.11.1
pexpect                4.8.0
pickleshare            0.7.5
pip                    23.1
pip-tools              6.13.0
pkgutil_resolve_name   1.3.10
platformdirs           3.2.0
pluggy                 1.0.0
prometheus-client      0.16.0
prompt-toolkit         3.0.38
psutil                 5.9.5
psycopg2               2.9.6
ptyprocess             0.7.0
pyarrow                11.0.0
pycodestyle            2.8.0
pycparser              2.21
pyflakes               2.4.0
Pygments               2.15.1
pyproject_hooks        1.0.0
pyrsistent             0.19.3
pytest                 7.3.1
pytest-cov             3.0.0
pytest-mock            1.13.0
python-dateutil        2.8.2
python-slugify         8.0.1
pytoolconfig           1.2.5
pytz                   2023.3
PyYAML                 6.0
pyzmq                  25.0.2
qtconsole              5.4.2
QtPy                   2.3.1
requests               2.28.2
rich                   13.3.4
rope                   1.7.0
s3fs                   2023.1.0
Send2Trash             1.8.0
setuptools             67.7.1
six                    1.16.0
smmap                  5.0.0
sniffio                1.3.0
soupsieve              2.4.1
SQLAlchemy             1.4.47
terminado              0.17.1
text-unidecode         1.3
tinycss2               1.2.1
toml                   0.10.2
tomli                  2.0.1
toposort               1.10
tornado                6.2
traitlets              5.9.0
typed-ast              1.5.4
typing_extensions      4.5.0
urllib3                1.26.15
wcwidth                0.2.6
webencodings           0.5.1
websocket-client       1.5.1
wheel                  0.40.0
widgetsnbextension     4.0.7
wrapt                  1.15.0
yarl                   1.8.2
zipp                   3.15.0
n
Would it work if you read it with pd.read_parquet and pure respect?
Fsspec*
b
I tried in a new virtualenvironemnt and I get same result using pandas.read_parquet, fsspec and s3fs. At least it shows that has nothing to do with Kedro :)
n
I think pandas under the hood also using
fsspec
, I suspect you have some incompatible version of
fsspec
s3fs
aiocore
maybe try downgrading fsspec and s3fs a little bit
b
Copy code
Traceback (most recent call last):
  File "test.py", line 3, in <module>
    df=pd.read_parquet(
  File ".venv/lib/python3.9/site-packages/pandas/io/parquet.py", line 509, in read_parquet
    return impl.read(
  File ".venv/lib/python3.9/site-packages/pandas/io/parquet.py", line 338, in read
    parquet_file = self.api.ParquetFile(path, **parquet_kwargs)
  File ".venv/lib/python3.9/site-packages/fastparquet/api.py", line 147, in __init__
    elif "*" in fn or fs.isdir(fn):
  File ".venv/lib/python3.9/site-packages/fsspec/asyn.py", line 88, in wrapper
    return sync(self.loop, func, *args, **kwargs)
  File ".venv/lib/python3.9/site-packages/fsspec/asyn.py", line 69, in sync
    raise result[0]
  File ".venv/lib/python3.9/site-packages/fsspec/asyn.py", line 25, in _runner
    result[0] = await coro
  File ".venv/lib/python3.9/site-packages/s3fs/core.py", line 1161, in _isdir
    return bool(await self._lsdir(path))
  File ".venv/lib/python3.9/site-packages/s3fs/core.py", line 559, in _lsdir
    await self.set_session()
  File ".venv/lib/python3.9/site-packages/s3fs/core.py", line 378, in set_session
    self.session = aiobotocore.AioSession(**self.kwargs)
TypeError: __init__() got an unexpected keyword argument 'target_options'