Hello fellow Kedroids :kedro::robot_face:! ...
# questions
j
Hello fellow Kedroids K๐Ÿค–! . I'm having a very strange issue when saving a file to parquet. I'm getting this error:
DataSetError: Failed while saving data to data set ParquetDataSet(filepath=/Users/jose_darnott/PycharmProjects/planta-litio/data/01_raw/data_sql.parquet, load_args={'engine': pyarrow}, protocol=file, save_args={'engine': pyarrow}). Duplicate column names found:  ['timestamp', 'lims_BFIL CO3', 'lims_BFIL Ca %', ...]
# It's basically showing all the columns inside the dataframe, (here I'm showing only 3 of them) . My catalog entry looks like this:
Copy code
data_sql:
  type: pandas.ParquetDataSet
  filepath: data/01_raw/data_sql.parquet
  load_args:
    engine: pyarrow
  save_args:
    engine: pyarrow
  layer: III
. I'm using: kedro==0.18.8 pandas==2.0.1 pyarrow==12.0.0 . The problem is quite similiar to this issue from 2022: https://github.com/kedro-org/kedro/discussions/1286 but in my case removing the load and save args as the OP mentions won't solve my problem. . This is quite puzzling, since I just did a df.to_clipboard() inside the node before returning my output, open it on a jupyter notebook and I see no problems with the dataframe, I can even save it to parquet without any issues. So that makes me thing the problem comes from kedro (?) . Anyways, as a workaround I'm saving the dataframe as csv and it's working just fine. But I'd like to find a way to make the parquet work again since this is a huge file. Thanks in advance ๐Ÿฆœ!
n
Can you post the full traceback or try to scroll up the error a bit? This is raised by Kedro but there should be an error throw by either pandas or pyarrow I suspect.
And do you have
kedro-datasets
installed?
j
yes, kedro-datasets==1.2.0
Copy code
[06/12/23 11:38:31] INFO     Kedro project planta-litio                                                                                                       session.py:360
[06/12/23 11:38:34] INFO     Running node: get_local_credentials: get_sql_credentials(None) -> [sql_credentials]                                                 node.py:329
                    INFO     Saving data to 'sql_credentials' (_SharedMemoryDataSet)...                                                                  data_catalog.py:382
[06/12/23 11:38:35] INFO     Loading data from 'diccionario_raw' (ExcelDataSet)...                                                                       data_catalog.py:343
                    INFO     Loading data from 'sql_credentials' (_SharedMemoryDataSet)...                                                               data_catalog.py:343
                    INFO     Loading data from 'params:sql_data.date_init' (MemoryDataSet)...                                                            data_catalog.py:343
                    INFO     Loading data from 'params:sql_data.date_end' (MemoryDataSet)...                                                             data_catalog.py:343
                    INFO     Running node: download_data_sql_azure:                                                                                              node.py:329
                             download_data_azure([diccionario_raw,sql_credentials,params:sql_data.date_init,params:sql_data.date_end]) -> [data_sql]                        
[06/12/23 11:38:36] WARNING  /Users/jose_darnott/PycharmProjects/planta-litio/src/planta_litio/pipelines/pre_processing/nodes.py:344: UserWarning: pandas    warnings.py:109
                             only supports SQLAlchemy connectable (engine/connection) or database string URI or sqlite3 DBAPI2 connection. Other DBAPI2                     
                             objects are not tested. Please consider using SQLAlchemy.                                                                                      
                               df = pd.read_sql(query_data, db_connection)                                                                                                  
                                                                                                                                                                            
[06/12/23 11:38:41] INFO     Saving data to 'data_sql' (ParquetDataSet)...                                                                               data_catalog.py:382
_RemoteTraceback: 
"""
Traceback (most recent call last):
  File "/Users/jose_darnott/opt/miniconda3/envs/planta-litio/lib/python3.8/site-packages/kedro/io/core.py", line 214, in save
    self._save(data)
  File "/Users/jose_darnott/opt/miniconda3/envs/planta-litio/lib/python3.8/site-packages/kedro_datasets/pandas/parquet_dataset.py", line 188, in _save
    data.to_parquet(bytes_buffer, **self._save_args)
  File "/Users/jose_darnott/opt/miniconda3/envs/planta-litio/lib/python3.8/site-packages/pandas/core/frame.py", line 2889, in to_parquet
    return to_parquet(
  File "/Users/jose_darnott/opt/miniconda3/envs/planta-litio/lib/python3.8/site-packages/pandas/io/parquet.py", line 411, in to_parquet
    impl.write(
  File "/Users/jose_darnott/opt/miniconda3/envs/planta-litio/lib/python3.8/site-packages/pandas/io/parquet.py", line 159, in write
    table = self.api.Table.from_pandas(df, **from_pandas_kwargs)
  File "pyarrow/table.pxi", line 3681, in pyarrow.lib.Table.from_pandas
  File "/Users/jose_darnott/opt/miniconda3/envs/planta-litio/lib/python3.8/site-packages/pyarrow/pandas_compat.py", line 570, in dataframe_to_arrays
    convert_fields) = _get_columns_to_convert(df, schema, preserve_index,
  File "/Users/jose_darnott/opt/miniconda3/envs/planta-litio/lib/python3.8/site-packages/pyarrow/pandas_compat.py", line 352, in _get_columns_to_convert
    raise ValueError(
ValueError: Duplicate column names found: ['timestamp', ...]

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/Users/jose_darnott/opt/miniconda3/envs/planta-litio/lib/python3.8/concurrent/futures/process.py", line 239, in _process_worker
    r = call_item.fn(*call_item.args, **call_item.kwargs)
  File "/Users/jose_darnott/opt/miniconda3/envs/planta-litio/lib/python3.8/site-packages/kedro/runner/parallel_runner.py", line 122, in _run_node_synchronization
    return run_node(node, catalog, hook_manager, is_async, session_id)
  File "/Users/jose_darnott/opt/miniconda3/envs/planta-litio/lib/python3.8/site-packages/kedro/runner/runner.py", line 319, in run_node
    node = _run_node_sequential(node, catalog, hook_manager, session_id)
  File "/Users/jose_darnott/opt/miniconda3/envs/planta-litio/lib/python3.8/site-packages/kedro/runner/runner.py", line 435, in _run_node_sequential
    catalog.save(name, data)
  File "/Users/jose_darnott/opt/miniconda3/envs/planta-litio/lib/python3.8/site-packages/kedro/io/data_catalog.py", line 384, in save
    dataset.save(data)
  File "/Users/jose_darnott/opt/miniconda3/envs/planta-litio/lib/python3.8/site-packages/kedro/io/core.py", line 613, in save
    super().save(data)
  File "/Users/jose_darnott/opt/miniconda3/envs/planta-litio/lib/python3.8/site-packages/kedro/io/core.py", line 221, in save
    raise DataSetError(message) from exc
kedro.io.core.DataSetError: Failed while saving data to data set ParquetDataSet(filepath=/Users/jose_darnott/PycharmProjects/planta-litio/data/01_raw/data_sql.parquet, 
load_args={'engine': pyarrow}, protocol=file, save_args={'engine': pyarrow}).
Duplicate column names found: ['timestamp' ...]

The above exception was the direct cause of the following exception:
Copy code
โ•ญโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€ Traceback (most recent call last) โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ•ฎ
โ”‚ /Users/jose_darnott/opt/miniconda3/envs/planta-litio/bin/kedro:8 in <module>                     โ”‚
โ”‚                                                                                                  โ”‚
โ”‚ /Users/jose_darnott/opt/miniconda3/envs/planta-litio/lib/python3.8/site-packages/kedro/framework โ”‚
โ”‚ /cli/cli.py:211 in main                                                                          โ”‚
โ”‚                                                                                                  โ”‚
โ”‚   208 โ”‚   """                                                                                    โ”‚
โ”‚   209 โ”‚   _init_plugins()                                                                        โ”‚
โ”‚   210 โ”‚   cli_collection = KedroCLI(project_path=Path.cwd())                                     โ”‚
โ”‚ โฑ 211 โ”‚   cli_collection()                                                                       โ”‚
โ”‚   212                                                                                            โ”‚
โ”‚                                                                                                  โ”‚
โ”‚ /Users/jose_darnott/opt/miniconda3/envs/planta-litio/lib/python3.8/site-packages/click/core.py:1 โ”‚
โ”‚ 130 in __call__                                                                                  โ”‚
โ”‚                                                                                                  โ”‚
โ”‚ /Users/jose_darnott/opt/miniconda3/envs/planta-litio/lib/python3.8/site-packages/kedro/framework โ”‚
โ”‚ /cli/cli.py:139 in main                                                                          โ”‚
โ”‚                                                                                                  โ”‚
โ”‚   136 โ”‚   โ”‚   )                                                                                  โ”‚
โ”‚   137 โ”‚   โ”‚                                                                                      โ”‚
โ”‚   138 โ”‚   โ”‚   try:                                                                               โ”‚
โ”‚ โฑ 139 โ”‚   โ”‚   โ”‚   super().main(                                                                  โ”‚
โ”‚   140 โ”‚   โ”‚   โ”‚   โ”‚   args=args,                                                                 โ”‚
โ”‚   141 โ”‚   โ”‚   โ”‚   โ”‚   prog_name=prog_name,                                                       โ”‚
โ”‚   142 โ”‚   โ”‚   โ”‚   โ”‚   complete_var=complete_var,                                                 โ”‚
โ”‚                                                                                                  โ”‚
โ”‚ /Users/jose_darnott/opt/miniconda3/envs/planta-litio/lib/python3.8/site-packages/click/core.py:1 โ”‚
โ”‚ 055 in main                                                                                      โ”‚
โ”‚                                                                                                  โ”‚
โ”‚ /Users/jose_darnott/opt/miniconda3/envs/planta-litio/lib/python3.8/site-packages/click/core.py:1 โ”‚
โ”‚ 657 in invoke                                                                                    โ”‚
โ”‚                                                                                                  โ”‚
โ”‚ /Users/jose_darnott/opt/miniconda3/envs/planta-litio/lib/python3.8/site-packages/click/core.py:1 โ”‚
โ”‚ 404 in invoke                                                                                    โ”‚
โ”‚                                                                                                  โ”‚
โ”‚ /Users/jose_darnott/opt/miniconda3/envs/planta-litio/lib/python3.8/site-packages/click/core.py:7 โ”‚
โ”‚ 60 in invoke                                                                                     โ”‚
โ”‚                                                                                                  โ”‚
โ”‚ /Users/jose_darnott/opt/miniconda3/envs/planta-litio/lib/python3.8/site-packages/kedro/framework โ”‚
โ”‚ /cli/project.py:472 in run                                                                       โ”‚
โ”‚                                                                                                  โ”‚
โ”‚   469 โ”‚   with KedroSession.create(                                                              โ”‚
โ”‚   470 โ”‚   โ”‚   env=env, conf_source=conf_source, extra_params=params                              โ”‚
โ”‚   471 โ”‚   ) as session:                                                                          โ”‚
โ”‚ โฑ 472 โ”‚   โ”‚   session.run(                                                                       โ”‚
โ”‚   473 โ”‚   โ”‚   โ”‚   tags=tag,                                                                      โ”‚
โ”‚   474 โ”‚   โ”‚   โ”‚   runner=runner(is_async=is_async),                                              โ”‚
โ”‚   475 โ”‚   โ”‚   โ”‚   node_names=node_names,                                                         โ”‚
โ”‚                                                                                                  โ”‚
โ”‚ /Users/jose_darnott/opt/miniconda3/envs/planta-litio/lib/python3.8/site-packages/kedro/framework โ”‚
โ”‚ /session/session.py:426 in run                                                                   โ”‚
โ”‚                                                                                                  โ”‚
โ”‚   423 โ”‚   โ”‚   )                                                                                  โ”‚
โ”‚   424 โ”‚   โ”‚                                                                                      โ”‚
โ”‚   425 โ”‚   โ”‚   try:                                                                               โ”‚
โ”‚ โฑ 426 โ”‚   โ”‚   โ”‚   run_result = runner.run(                                                       โ”‚
โ”‚   427 โ”‚   โ”‚   โ”‚   โ”‚   filtered_pipeline, catalog, hook_manager, session_id                       โ”‚
โ”‚   428 โ”‚   โ”‚   โ”‚   )                                                                              โ”‚
โ”‚   429 โ”‚   โ”‚   โ”‚   self._run_called = True                                                        โ”‚
โ”‚                                                                                                  โ”‚
โ”‚ /Users/jose_darnott/opt/miniconda3/envs/planta-litio/lib/python3.8/site-packages/kedro/runner/ru โ”‚
โ”‚ nner.py:91 in run                                                                                โ”‚
โ”‚                                                                                                  โ”‚
โ”‚    88 โ”‚   โ”‚   โ”‚   <http://self._logger.info|self._logger.info>(                                                             โ”‚
โ”‚    89 โ”‚   โ”‚   โ”‚   โ”‚   "Asynchronous mode is enabled for loading and saving data"                 โ”‚
โ”‚    90 โ”‚   โ”‚   โ”‚   )                                                                              โ”‚
โ”‚ โฑ  91 โ”‚   โ”‚   self._run(pipeline, catalog, hook_manager, session_id)                             โ”‚
โ”‚    92 โ”‚   โ”‚                                                                                      โ”‚
โ”‚    93 โ”‚   โ”‚   <http://self._logger.info|self._logger.info>("Pipeline execution completed successfully.")                    โ”‚
โ”‚    94                                                                                            โ”‚
โ”‚                                                                                                  โ”‚
โ”‚ /Users/jose_darnott/opt/miniconda3/envs/planta-litio/lib/python3.8/site-packages/kedro/runner/pa โ”‚
โ”‚ rallel_runner.py:334 in _run                                                                     โ”‚
โ”‚                                                                                                  โ”‚
โ”‚   331 โ”‚   โ”‚   โ”‚   โ”‚   โ”‚   break  # pragma: no cover                                              โ”‚
โ”‚   332 โ”‚   โ”‚   โ”‚   โ”‚   done, futures = wait(futures, return_when=FIRST_COMPLETED)                 โ”‚
โ”‚   333 โ”‚   โ”‚   โ”‚   โ”‚   for future in done:                                                        โ”‚
โ”‚ โฑ 334 โ”‚   โ”‚   โ”‚   โ”‚   โ”‚   node = future.result()                                                 โ”‚
โ”‚   335 โ”‚   โ”‚   โ”‚   โ”‚   โ”‚   done_nodes.add(node)                                                   โ”‚
โ”‚   336 โ”‚   โ”‚   โ”‚   โ”‚   โ”‚                                                                          โ”‚
โ”‚   337 โ”‚   โ”‚   โ”‚   โ”‚   โ”‚   # Decrement load counts, and release any datasets we                   โ”‚
โ”‚                                                                                                  โ”‚
โ”‚ /Users/jose_darnott/opt/miniconda3/envs/planta-litio/lib/python3.8/concurrent/futures/_base.py:4 โ”‚
โ”‚ 37 in result                                                                                     โ”‚
โ”‚                                                                                                  โ”‚
โ”‚   434 โ”‚   โ”‚   โ”‚   โ”‚   if self._state in [CANCELLED, CANCELLED_AND_NOTIFIED]:                     โ”‚
โ”‚   435 โ”‚   โ”‚   โ”‚   โ”‚   โ”‚   raise CancelledError()                                                 โ”‚
โ”‚   436 โ”‚   โ”‚   โ”‚   โ”‚   elif self._state == FINISHED:                                              โ”‚
โ”‚ โฑ 437 โ”‚   โ”‚   โ”‚   โ”‚   โ”‚   return self.__get_result()                                             โ”‚
โ”‚   438 โ”‚   โ”‚   โ”‚   โ”‚                                                                              โ”‚
โ”‚   439 โ”‚   โ”‚   โ”‚   โ”‚   self._condition.wait(timeout)                                              โ”‚
โ”‚   440                                                                                            โ”‚
โ”‚                                                                                                  โ”‚
โ”‚ /Users/jose_darnott/opt/miniconda3/envs/planta-litio/lib/python3.8/concurrent/futures/_base.py:3 โ”‚
โ”‚ 89 in __get_result                                                                               โ”‚
โ”‚                                                                                                  โ”‚
โ”‚   386 โ”‚   def __get_result(self):                                                                โ”‚
โ”‚   387 โ”‚   โ”‚   if self._exception:                                                                โ”‚
โ”‚   388 โ”‚   โ”‚   โ”‚   try:                                                                           โ”‚
โ”‚ โฑ 389 โ”‚   โ”‚   โ”‚   โ”‚   raise self._exception                                                      โ”‚
โ”‚   390 โ”‚   โ”‚   โ”‚   finally:                                                                       โ”‚
โ”‚   391 โ”‚   โ”‚   โ”‚   โ”‚   # Break a reference cycle with the exception in self._exception            โ”‚
โ”‚   392 โ”‚   โ”‚   โ”‚   โ”‚   self = None                                                                โ”‚
โ•ฐโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ•ฏ
DataSetError: Failed while saving data to data set ParquetDataSet(filepath=/Users/jose_darnott/PycharmProjects/planta-litio/data/01_raw/data_sql.parquet, 
load_args={'engine': pyarrow}, protocol=file, save_args={'engine': pyarrow}).
Duplicate column names found: ['timestamp', ...]
n
Inteesting, can you double check if it works with
<http://df.to|df.to>_parquet
?
Copy code
bytes_buffer = BytesIO()
        data.to_parquet(bytes_buffer, **self._save_args)
This is literally what Kedroโ€™s does behind the scene, we are using the same logic whatever pandas is doing, Kedro handles the path but nothing more.
This is weird. Please try to do this. 1.
<http://df.to|df.to>_parquet
- see if it works 2. If 1 works, then try
Copy code
from io import BytesIO
bytes_buffer = BytesIO()
data.to_parquet(bytes_buffer)
If 1 works but 2 fail, then I would suggest to create a minimal reproducible example and open an issue on either
pandas
or
pyarrow
j
I just added the df.to_parquet just above the return of the node, and it fails (in same fashion as above)... so you right, this is probably not a direct problem with kedro
n
Maybe try to pickle it and load it from a notebook. There is good chance that there are just duplicate columns.
Or if you are Pycharm/VSCode user, it maybe easier to attach a debugger and step through the breakpoints. https://docs.kedro.org/en/stable/development/set_up_pycharm.html#debugging