Hi, I'm encountering an issue where Kedro adds ext...
# questions
m
Hi, I'm encountering an issue where Kedro adds extra empty lines between each data row when saving a CSV file using the pandas.CSVDataset type. This results in empty rows when opening the file as text. Do you encounter this issue and know how can I prevent these extra lines from being added during the save process? It looks similar to this issue - https://github.com/kedro-org/kedro/issues/492
👀 1
e
Hey, can you please share how you’ve set your dataset or/and a minimum example to reproduce?
m
Hi, thanks for your reply. catalog.yml:
Copy code
# Here you can define all your datasets by using simple YAML syntax.
#
# Documentation for this file format can be found in "The Data Catalog"
# Link: <https://docs.kedro.org/en/stable/data/data_catalog.html>

test_csv:
  type: pandas.CSVDataset
  filepath: data/df_test.tsv
  load_args:
    sep: "\t"
    keep_default_na: False
    encoding: "utf-8"
  save_args:
    index: false
    sep: "\t"
    header: False
Example pipeline.py:
Copy code
"""
This is a boilerplate pipeline 'create_df'
generated using Kedro 0.19.10
"""

from kedro.pipeline import Pipeline, pipeline, node

from .nodes import create_df


def create_pipeline(**kwargs) -> Pipeline:
    return pipeline([
        node(
            func=create_df,
            inputs=None,
            outputs='test_csv',
            name='create_df',
        )
    ])
nodes.py:
Copy code
"""
This is a boilerplate pipeline 'create_df'
generated using Kedro 0.19.10
"""
import pandas as pd

def create_df():
    data = {
        "label": ["A", "B", "C"],
        "country": ["GB", "ES", "FR"]
    }
    df = pd.DataFrame(data)
    return df
So
kedro run
creates the csv file. This file contains extra
\n
between the lines. I tried to adjust the lineterminator parameter but it does not help. I'm using windows11.
Copy code
data_path = 'C:/csv-extra-line/data/df_test.tsv'
with open(data_path, 'r', encoding='utf8') as file:
    lines = file.readlines()

for i, line in enumerate(lines[:5]):
    line_str = f"Line {i + 1}: '{line}'"
    print(repr(line_str))
Code output:
n
Which version of
kedro-datasets
and
fsspec
you have?
e
It also looks ok on MacOs
@Mikołaj Tym can you please double-check that if you write the same df with just pandas without kedro doesn’t produce these empty lines?
m
@Nok Lam Chan and kedro version 0.19.12
@Elena Khaustova Using
<http://df.to|df.to>_csv
output is correct (without extra lines).
e
I can’t check it on Windows and it works fine on MacOs, but it looks like a bug - something with how the tab separator is encoded. Feel free to create an issue so we can investigate and address it. Sorry about such an experience 😔
m
I really appreciate your help! I discovered that adding
lineterminator: "\n"
parameter to
save_args
solves the issue, so there is no extra line anymore. Yesterday, I added this parameter using single quotes which caused YAML to misinterpret it and save the whole file as a single line on windows. Using double quotes fixes the problem. However, it is not default parameter and these extra lines are unexpected output. If you think it is worth investigating further, I can create an issue - otherwise, there are some workaround with this
lineterminator
parameter.
e
Thanks for sharing your workaround, in this case, we don’t need a separate issue for that. Hope that your further experience will be smoother!
👍 1