Hi Team is it possible to include the filepath in the catalo Kedro #questions

Hi Team, is it possible to include the filepath in...

tingting wan

01/04/2023, 4:25 PM

Hi Team, is it possible to include the filepath in the catalog? Recursively loading csv file, so it can't be hard coded.

Deepyaman Datta

01/04/2023, 4:56 PM

What do you mean "recursively loading csv file"? And

filepath

is in (most) data catalog entries, including for CSV datasets?

Jannic Holzer

01/04/2023, 5:18 PM

Hey Deepyaman, I'm looking at this with Tingting now. The challenge is to read multiple (though not all) CSV files in a directory into a single spark dataframe.

👍 2

tingting wan

01/04/2023, 5:19 PM

There are multiple csv files under 1 folder, each file is located in:

dbfs:/mnt/weather/forecast/forecast_<date>

having similar schema (not exact same, some file are missing 1 field for example), I am interested in

<date>

part as I can do filtering within each file, therefore I would like to have a column containing the source path for each row

Deepyaman Datta

01/04/2023, 5:24 PM

If you haven't, try

df.withColumn('input_file', input_file_name())

in your node.

Michał Madej

01/04/2023, 5:36 PM

Can you use https://kedro.readthedocs.io/en/stable/kedro.extras.datasets.spark.SparkDataSet.html? Specify filepath and all should be ok

tingting wan

01/04/2023, 6:10 PM

df.withColumn('input_file', input_file_name())

works, but how can I apply the similar logic in Kedro?

tingting wan

01/04/2023, 6:21 PM

@Michał Madej I am not sure if I am getting it correctly, right now I am using

spark.SparkDataSet

, and args

Copy code

load_args:
  header: False
  recursiveFileLookup: True

it works by passing

Copy code

filepath: dbfs:/mnt/weather/forecast/*

I want to add a column having exact source path name, e.g.,

dbfs:/mnt/weather/forecast/forecast_2022-1-2

dbfs:/mnt/weather/forecast/forecast_2022-1-3, dbfs:/mnt/weather/forecast/forecast_2022-1-4

etc.,

Deepyaman Datta

01/04/2023, 7:17 PM

You put it in the node, or you can create a hook after the dataset is loaded

tingting wan

01/06/2023, 9:36 AM

Thanks @Deepyaman Datta, sorry 1 more question you mean put in the node replacing loading through yaml?

tingting wan

01/09/2023, 11:03 AM

I got it resolved! thanks for the help all!

26 Views

Open in Slack

Previous Next