Higor Carmanini
05/25/2023, 10:31 PMPartitionedDataSet
to read many CSVs into Spark DataFrames. I just found an issue where, apparently, Spark automatically appends the column position to the column name (as read from the header) to create the actual final name. See example in image.
As this sometimes is done for deduplications, I investigated whether this was something close, and sure enough there is another dataset in this same PartitionedDataSet
that reads another column of the same name. This could "explain" this funky behavior of Spark of thinking it is a duplicate. Of course, though, these are two separate DataFrames.
Has anyone stumbled upon this issue before? I can't find any references online. Thank you!
EDIT: Solved! It was due to Spark's default setting of case insensitiveness..alias()
to the table does not help with this, unfortunately.ipython
session to test if that's Spark's default behavior, and it doesn't seem like it:In [15]: df1 = spark.read.csv(filepath, sep=',', header=True)
In [16]: df1
Out[16]: DataFrame[col1: string, col2: string, col3: string]
In [17]: df2 = spark.read.csv(filepath, sep=',', header=True)
In [18]: df2
Out[18]: DataFrame[col1: string, col2: string, col3: string]
PartitionedDataSet
?min(rank)
and min(Rank)
columns, so I eventually found out it's due to Spark's default case insensitiveness. Not Kedro-related at all! Problem solved 😄Nok Lam Chan
05/26/2023, 11:52 AM