https://kedro.org/ logo
#questions
Title
# questions
t

tingting wan

01/04/2023, 4:25 PM
Hi Team, is it possible to include the filepath in the catalog? Recursively loading csv file, so it can't be hard coded.
d

Deepyaman Datta

01/04/2023, 4:56 PM
What do you mean "recursively loading csv file"? And
filepath
is in (most) data catalog entries, including for CSV datasets?
j

Jannic Holzer

01/04/2023, 5:18 PM
Hey Deepyaman, I'm looking at this with Tingting now. The challenge is to read multiple (though not all) CSV files in a directory into a single spark dataframe.
👍 2
t

tingting wan

01/04/2023, 5:19 PM
There are multiple csv files under 1 folder, each file is located in:
dbfs:/mnt/weather/forecast/forecast_<date>
having similar schema (not exact same, some file are missing 1 field for example), I am interested in
<date>
part as I can do filtering within each file, therefore I would like to have a column containing the source path for each row
d

Deepyaman Datta

01/04/2023, 5:24 PM
If you haven't, try
df.withColumn('input_file', input_file_name())
in your node.
m

Michał Madej

01/04/2023, 5:36 PM
t

tingting wan

01/04/2023, 6:10 PM
df.withColumn('input_file', input_file_name())
works, but how can I apply the similar logic in Kedro?
@Michał Madej I am not sure if I am getting it correctly, right now I am using
spark.SparkDataSet
, and args
Copy code
load_args:
  header: False
  recursiveFileLookup: True
it works by passing
Copy code
filepath: dbfs:/mnt/weather/forecast/*
I want to add a column having exact source path name, e.g.,
dbfs:/mnt/weather/forecast/forecast_2022-1-2
,
dbfs:/mnt/weather/forecast/forecast_2022-1-3, dbfs:/mnt/weather/forecast/forecast_2022-1-4
etc.,
d

Deepyaman Datta

01/04/2023, 7:17 PM
You put it in the node, or you can create a hook after the dataset is loaded
t

tingting wan

01/06/2023, 9:36 AM
Thanks @Deepyaman Datta, sorry 1 more question you mean put in the node replacing loading through yaml?
I got it resolved! thanks for the help all!
6 Views