Jose Nuรฑez
06/12/2023, 3:32 PMDataSetError: Failed while saving data to data set ParquetDataSet(filepath=/Users/jose_darnott/PycharmProjects/planta-litio/data/01_raw/data_sql.parquet, load_args={'engine': pyarrow}, protocol=file, save_args={'engine': pyarrow}). Duplicate column names found: ['timestamp', 'lims_BFIL CO3', 'lims_BFIL Ca %', ...]
# It's basically showing all the columns inside the dataframe, (here I'm showing only 3 of them)
.
My catalog entry looks like this:
data_sql:
type: pandas.ParquetDataSet
filepath: data/01_raw/data_sql.parquet
load_args:
engine: pyarrow
save_args:
engine: pyarrow
layer: III
.
I'm using:
kedro==0.18.8
pandas==2.0.1
pyarrow==12.0.0
.
The problem is quite similiar to this issue from 2022: https://github.com/kedro-org/kedro/discussions/1286 but in my case removing the load and save args as the OP mentions won't solve my problem.
.
This is quite puzzling, since I just did a df.to_clipboard() inside the node before returning my output, open it on a jupyter notebook and I see no problems with the dataframe, I can even save it to parquet without any issues. So that makes me thing the problem comes from kedro (?)
.
Anyways, as a workaround I'm saving the dataframe as csv and it's working just fine. But I'd like to find a way to make the parquet work again since this is a huge file.
Thanks in advance ๐ฆ!Nok Lam Chan
06/12/2023, 3:36 PMNok Lam Chan
06/12/2023, 3:37 PMkedro-datasets
installed?Jose Nuรฑez
06/12/2023, 3:40 PMJose Nuรฑez
06/12/2023, 3:44 PM[06/12/23 11:38:31] INFO Kedro project planta-litio session.py:360
[06/12/23 11:38:34] INFO Running node: get_local_credentials: get_sql_credentials(None) -> [sql_credentials] node.py:329
INFO Saving data to 'sql_credentials' (_SharedMemoryDataSet)... data_catalog.py:382
[06/12/23 11:38:35] INFO Loading data from 'diccionario_raw' (ExcelDataSet)... data_catalog.py:343
INFO Loading data from 'sql_credentials' (_SharedMemoryDataSet)... data_catalog.py:343
INFO Loading data from 'params:sql_data.date_init' (MemoryDataSet)... data_catalog.py:343
INFO Loading data from 'params:sql_data.date_end' (MemoryDataSet)... data_catalog.py:343
INFO Running node: download_data_sql_azure: node.py:329
download_data_azure([diccionario_raw,sql_credentials,params:sql_data.date_init,params:sql_data.date_end]) -> [data_sql]
[06/12/23 11:38:36] WARNING /Users/jose_darnott/PycharmProjects/planta-litio/src/planta_litio/pipelines/pre_processing/nodes.py:344: UserWarning: pandas warnings.py:109
only supports SQLAlchemy connectable (engine/connection) or database string URI or sqlite3 DBAPI2 connection. Other DBAPI2
objects are not tested. Please consider using SQLAlchemy.
df = pd.read_sql(query_data, db_connection)
[06/12/23 11:38:41] INFO Saving data to 'data_sql' (ParquetDataSet)... data_catalog.py:382
_RemoteTraceback:
"""
Traceback (most recent call last):
File "/Users/jose_darnott/opt/miniconda3/envs/planta-litio/lib/python3.8/site-packages/kedro/io/core.py", line 214, in save
self._save(data)
File "/Users/jose_darnott/opt/miniconda3/envs/planta-litio/lib/python3.8/site-packages/kedro_datasets/pandas/parquet_dataset.py", line 188, in _save
data.to_parquet(bytes_buffer, **self._save_args)
File "/Users/jose_darnott/opt/miniconda3/envs/planta-litio/lib/python3.8/site-packages/pandas/core/frame.py", line 2889, in to_parquet
return to_parquet(
File "/Users/jose_darnott/opt/miniconda3/envs/planta-litio/lib/python3.8/site-packages/pandas/io/parquet.py", line 411, in to_parquet
impl.write(
File "/Users/jose_darnott/opt/miniconda3/envs/planta-litio/lib/python3.8/site-packages/pandas/io/parquet.py", line 159, in write
table = self.api.Table.from_pandas(df, **from_pandas_kwargs)
File "pyarrow/table.pxi", line 3681, in pyarrow.lib.Table.from_pandas
File "/Users/jose_darnott/opt/miniconda3/envs/planta-litio/lib/python3.8/site-packages/pyarrow/pandas_compat.py", line 570, in dataframe_to_arrays
convert_fields) = _get_columns_to_convert(df, schema, preserve_index,
File "/Users/jose_darnott/opt/miniconda3/envs/planta-litio/lib/python3.8/site-packages/pyarrow/pandas_compat.py", line 352, in _get_columns_to_convert
raise ValueError(
ValueError: Duplicate column names found: ['timestamp', ...]
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "/Users/jose_darnott/opt/miniconda3/envs/planta-litio/lib/python3.8/concurrent/futures/process.py", line 239, in _process_worker
r = call_item.fn(*call_item.args, **call_item.kwargs)
File "/Users/jose_darnott/opt/miniconda3/envs/planta-litio/lib/python3.8/site-packages/kedro/runner/parallel_runner.py", line 122, in _run_node_synchronization
return run_node(node, catalog, hook_manager, is_async, session_id)
File "/Users/jose_darnott/opt/miniconda3/envs/planta-litio/lib/python3.8/site-packages/kedro/runner/runner.py", line 319, in run_node
node = _run_node_sequential(node, catalog, hook_manager, session_id)
File "/Users/jose_darnott/opt/miniconda3/envs/planta-litio/lib/python3.8/site-packages/kedro/runner/runner.py", line 435, in _run_node_sequential
catalog.save(name, data)
File "/Users/jose_darnott/opt/miniconda3/envs/planta-litio/lib/python3.8/site-packages/kedro/io/data_catalog.py", line 384, in save
dataset.save(data)
File "/Users/jose_darnott/opt/miniconda3/envs/planta-litio/lib/python3.8/site-packages/kedro/io/core.py", line 613, in save
super().save(data)
File "/Users/jose_darnott/opt/miniconda3/envs/planta-litio/lib/python3.8/site-packages/kedro/io/core.py", line 221, in save
raise DataSetError(message) from exc
kedro.io.core.DataSetError: Failed while saving data to data set ParquetDataSet(filepath=/Users/jose_darnott/PycharmProjects/planta-litio/data/01_raw/data_sql.parquet,
load_args={'engine': pyarrow}, protocol=file, save_args={'engine': pyarrow}).
Duplicate column names found: ['timestamp' ...]
The above exception was the direct cause of the following exception:
Jose Nuรฑez
06/12/2023, 3:44 PMโญโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ Traceback (most recent call last) โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโฎ
โ /Users/jose_darnott/opt/miniconda3/envs/planta-litio/bin/kedro:8 in <module> โ
โ โ
โ /Users/jose_darnott/opt/miniconda3/envs/planta-litio/lib/python3.8/site-packages/kedro/framework โ
โ /cli/cli.py:211 in main โ
โ โ
โ 208 โ """ โ
โ 209 โ _init_plugins() โ
โ 210 โ cli_collection = KedroCLI(project_path=Path.cwd()) โ
โ โฑ 211 โ cli_collection() โ
โ 212 โ
โ โ
โ /Users/jose_darnott/opt/miniconda3/envs/planta-litio/lib/python3.8/site-packages/click/core.py:1 โ
โ 130 in __call__ โ
โ โ
โ /Users/jose_darnott/opt/miniconda3/envs/planta-litio/lib/python3.8/site-packages/kedro/framework โ
โ /cli/cli.py:139 in main โ
โ โ
โ 136 โ โ ) โ
โ 137 โ โ โ
โ 138 โ โ try: โ
โ โฑ 139 โ โ โ super().main( โ
โ 140 โ โ โ โ args=args, โ
โ 141 โ โ โ โ prog_name=prog_name, โ
โ 142 โ โ โ โ complete_var=complete_var, โ
โ โ
โ /Users/jose_darnott/opt/miniconda3/envs/planta-litio/lib/python3.8/site-packages/click/core.py:1 โ
โ 055 in main โ
โ โ
โ /Users/jose_darnott/opt/miniconda3/envs/planta-litio/lib/python3.8/site-packages/click/core.py:1 โ
โ 657 in invoke โ
โ โ
โ /Users/jose_darnott/opt/miniconda3/envs/planta-litio/lib/python3.8/site-packages/click/core.py:1 โ
โ 404 in invoke โ
โ โ
โ /Users/jose_darnott/opt/miniconda3/envs/planta-litio/lib/python3.8/site-packages/click/core.py:7 โ
โ 60 in invoke โ
โ โ
โ /Users/jose_darnott/opt/miniconda3/envs/planta-litio/lib/python3.8/site-packages/kedro/framework โ
โ /cli/project.py:472 in run โ
โ โ
โ 469 โ with KedroSession.create( โ
โ 470 โ โ env=env, conf_source=conf_source, extra_params=params โ
โ 471 โ ) as session: โ
โ โฑ 472 โ โ session.run( โ
โ 473 โ โ โ tags=tag, โ
โ 474 โ โ โ runner=runner(is_async=is_async), โ
โ 475 โ โ โ node_names=node_names, โ
โ โ
โ /Users/jose_darnott/opt/miniconda3/envs/planta-litio/lib/python3.8/site-packages/kedro/framework โ
โ /session/session.py:426 in run โ
โ โ
โ 423 โ โ ) โ
โ 424 โ โ โ
โ 425 โ โ try: โ
โ โฑ 426 โ โ โ run_result = runner.run( โ
โ 427 โ โ โ โ filtered_pipeline, catalog, hook_manager, session_id โ
โ 428 โ โ โ ) โ
โ 429 โ โ โ self._run_called = True โ
โ โ
โ /Users/jose_darnott/opt/miniconda3/envs/planta-litio/lib/python3.8/site-packages/kedro/runner/ru โ
โ nner.py:91 in run โ
โ โ
โ 88 โ โ โ <http://self._logger.info|self._logger.info>( โ
โ 89 โ โ โ โ "Asynchronous mode is enabled for loading and saving data" โ
โ 90 โ โ โ ) โ
โ โฑ 91 โ โ self._run(pipeline, catalog, hook_manager, session_id) โ
โ 92 โ โ โ
โ 93 โ โ <http://self._logger.info|self._logger.info>("Pipeline execution completed successfully.") โ
โ 94 โ
โ โ
โ /Users/jose_darnott/opt/miniconda3/envs/planta-litio/lib/python3.8/site-packages/kedro/runner/pa โ
โ rallel_runner.py:334 in _run โ
โ โ
โ 331 โ โ โ โ โ break # pragma: no cover โ
โ 332 โ โ โ โ done, futures = wait(futures, return_when=FIRST_COMPLETED) โ
โ 333 โ โ โ โ for future in done: โ
โ โฑ 334 โ โ โ โ โ node = future.result() โ
โ 335 โ โ โ โ โ done_nodes.add(node) โ
โ 336 โ โ โ โ โ โ
โ 337 โ โ โ โ โ # Decrement load counts, and release any datasets we โ
โ โ
โ /Users/jose_darnott/opt/miniconda3/envs/planta-litio/lib/python3.8/concurrent/futures/_base.py:4 โ
โ 37 in result โ
โ โ
โ 434 โ โ โ โ if self._state in [CANCELLED, CANCELLED_AND_NOTIFIED]: โ
โ 435 โ โ โ โ โ raise CancelledError() โ
โ 436 โ โ โ โ elif self._state == FINISHED: โ
โ โฑ 437 โ โ โ โ โ return self.__get_result() โ
โ 438 โ โ โ โ โ
โ 439 โ โ โ โ self._condition.wait(timeout) โ
โ 440 โ
โ โ
โ /Users/jose_darnott/opt/miniconda3/envs/planta-litio/lib/python3.8/concurrent/futures/_base.py:3 โ
โ 89 in __get_result โ
โ โ
โ 386 โ def __get_result(self): โ
โ 387 โ โ if self._exception: โ
โ 388 โ โ โ try: โ
โ โฑ 389 โ โ โ โ raise self._exception โ
โ 390 โ โ โ finally: โ
โ 391 โ โ โ โ # Break a reference cycle with the exception in self._exception โ
โ 392 โ โ โ โ self = None โ
โฐโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโฏ
DataSetError: Failed while saving data to data set ParquetDataSet(filepath=/Users/jose_darnott/PycharmProjects/planta-litio/data/01_raw/data_sql.parquet,
load_args={'engine': pyarrow}, protocol=file, save_args={'engine': pyarrow}).
Duplicate column names found: ['timestamp', ...]
Nok Lam Chan
06/12/2023, 3:44 PM<http://df.to|df.to>_parquet
?
bytes_buffer = BytesIO()
data.to_parquet(bytes_buffer, **self._save_args)
This is literally what Kedroโs does behind the scene, we are using the same logic whatever pandas is doing, Kedro handles the path but nothing more.Nok Lam Chan
06/12/2023, 3:46 PM<http://df.to|df.to>_parquet
- see if it works
2. If 1 works, then try
from io import BytesIO
bytes_buffer = BytesIO()
data.to_parquet(bytes_buffer)
Nok Lam Chan
06/12/2023, 3:47 PMpandas
or pyarrow
Jose Nuรฑez
06/12/2023, 3:51 PMNok Lam Chan
06/12/2023, 3:58 PMNok Lam Chan
06/12/2023, 3:59 PM