I have an issue with Kedro and SparkDatasets. I am...
# questions
h
I have an issue with Kedro and SparkDatasets. I am using a
PartitionedDataSet
to read many CSVs into Spark DataFrames. I just found an issue where, apparently, Spark automatically appends the column position to the column name (as read from the header) to create the actual final name. See example in image. As this sometimes is done for deduplications, I investigated whether this was something close, and sure enough there is another dataset in this same
PartitionedDataSet
that reads another column of the same name. This could "explain" this funky behavior of Spark of thinking it is a duplicate. Of course, though, these are two separate DataFrames. Has anyone stumbled upon this issue before? I can't find any references online. Thank you! EDIT: Solved! It was due to Spark's default setting of case insensitiveness.
1
Adding an
.alias()
to the table does not help with this, unfortunately.
The weird thing is: I just tried to do this in a
ipython
session to test if that's Spark's default behavior, and it doesn't seem like it:
Copy code
In [15]: df1 = spark.read.csv(filepath, sep=',', header=True)

In [16]: df1
Out[16]: DataFrame[col1: string, col2: string, col3: string]

In [17]: df2 = spark.read.csv(filepath, sep=',', header=True)

In [18]: df2
Out[18]: DataFrame[col1: string, col2: string, col3: string]
Could it be a bug in
PartitionedDataSet
?
🤔
Welp, I missed that the same CSV had both
min(rank)
and
min(Rank)
columns, so I eventually found out it's due to Spark's default case insensitiveness. Not Kedro-related at all! Problem solved 😄
n
@Higor Carmanini Nice debugging! Thank you for sharing back your findings!
🙏 1