Hi Team, is it possible to include the filepath in...
# questions
t
Hi Team, is it possible to include the filepath in the catalog? Recursively loading csv file, so it can't be hard coded.
d
What do you mean "recursively loading csv file"? And
filepath
is in (most) data catalog entries, including for CSV datasets?
j
Hey Deepyaman, I'm looking at this with Tingting now. The challenge is to read multiple (though not all) CSV files in a directory into a single spark dataframe.
­čĹŹ 2
t
There are multiple csv files under 1 folder, each file is located in:
dbfs:/mnt/weather/forecast/forecast_<date>
having similar schema (not exact same, some file are missing 1 field for example), I am interested in
<date>
part as I can do filtering within each file, therefore I would like to have a column containing the source path for each row
d
If you haven't, try
df.withColumn('input_file', input_file_name())
in your node.
m
t
df.withColumn('input_file', input_file_name())
works, but how can I apply the similar logic in Kedro?
@Michał Madej I am not sure if I am getting it correctly, right now I am using
spark.SparkDataSet
, and args
Copy code
load_args:
  header: False
  recursiveFileLookup: True
it works by passing
Copy code
filepath: dbfs:/mnt/weather/forecast/*
I want to add a column having exact source path name, e.g.,
dbfs:/mnt/weather/forecast/forecast_2022-1-2
,
dbfs:/mnt/weather/forecast/forecast_2022-1-3, dbfs:/mnt/weather/forecast/forecast_2022-1-4
etc.,
d
You put it in the node, or you can create a hook after the dataset is loaded
t
Thanks @Deepyaman Datta, sorry 1 more question you mean put in the node replacing loading through yaml?
I got it resolved! thanks for the help all!