Hi all I am starting with kedro and I have a lot of doubt Wh Kedro #questions

Hi all, I am starting with kedro and I have a lot ...

Eduardo Romero López

07/09/2023, 11:32 AM

Hi all, I am starting with kedro and I have a lot of doubt. Which is it the recommended way of save and load data in the structure of the project. For example: I have a node that read raw data, does data wrangling and save intermediate data in parquet format. Is it a good practice do it thus? or is it better using "from kedro.io import data_catalog" how I show in the image?

Juan Luis

07/09/2023, 11:34 AM

hi @Eduardo Romero López! thanks for bringing this question, it's an important one. you don't need to manually use the catalog from the functions. in fact, your functions only need to know how to receive data as inputs, and how to

return

the result

Juan Luis

07/09/2023, 11:36 AM

then, when you define the pipeline, you declare the nodes and map names in the catalog with inputs and outputs. for example,

Copy code

pipeline([
  node(
    func=intermediate_data,
    inputs=["raw_dataframe"],
    output="intermediate_dataframe"
  )
])

where "raw_dataframe" and "intermediate_dataframe" are defined in

conf/base/catalog.yml

Juan Luis

07/09/2023, 11:36 AM

then, Kedro takes care of mapping those catalog entries to the function inputs and outputs

Juan Luis

07/09/2023, 11:36 AM

let me know if that helps

Eduardo Romero López

07/09/2023, 11:40 AM

ok thank very much 🙂 , ok, but if necessary add "df.to_parquet" in the function from node to save "data intermediate" in "./data/02 intermediate"?

Juan Luis

07/09/2023, 11:45 AM

nope, you can

Copy code

return df

and then declare the dataset in

catalog.yml

Copy code

intermediate_df:
  type: pandas.ParquetDataset
  filepath: data/02_intermediate/preprocessed_queries.pq

(pseudocode, didn't test it but you get the idea)

Eduardo Romero López

07/09/2023, 11:48 AM

ok but the data will not be saved in the "02_intermediate" folder, they will only be in memory, right?

Eduardo Romero López

07/09/2023, 11:59 AM

In a first moment, I would like save each return in its folder and that the data persist.

Nok Lam Chan

07/09/2023, 1:46 PM

It will be save in the folder you declared

Eduardo Romero López

07/09/2023, 4:25 PM

Thanks to both 😉

Eduardo Romero López

07/09/2023, 4:25 PM

it works now!!!

🙌 1

🙌🏼 1

2 Views

Open in Slack

Previous Next