Higor Carmanini05/25/2023, 10:31 PM
to read many CSVs into Spark DataFrames. I just found an issue where, apparently, Spark automatically appends the column position to the column name (as read from the header) to create the actual final name. See example in image. As this sometimes is done for deduplications, I investigated whether this was something close, and sure enough there is another dataset in this same
that reads another column of the same name. This could "explain" this funky behavior of Spark of thinking it is a duplicate. Of course, though, these are two separate DataFrames. Has anyone stumbled upon this issue before? I can't find any references online. Thank you! EDIT: Solved! It was due to Spark's default setting of case insensitiveness.
to the table does not help with this, unfortunately.
session to test if that's Spark's default behavior, and it doesn't seem like it:
In : df1 = spark.read.csv(filepath, sep=',', header=True) In : df1 Out: DataFrame[col1: string, col2: string, col3: string] In : df2 = spark.read.csv(filepath, sep=',', header=True) In : df2 Out: DataFrame[col1: string, col2: string, col3: string]
columns, so I eventually found out it's due to Spark's default case insensitiveness. Not Kedro-related at all! Problem solved 😄
Nok Lam Chan05/26/2023, 11:52 AM