https://kedro.org/ logo
#questions
Title
# questions
e

Eduardo Romero López

07/09/2023, 11:32 AM
Hi all, I am starting with kedro and I have a lot of doubt. Which is it the recommended way of save and load data in the structure of the project. For example: I have a node that read raw data, does data wrangling and save intermediate data in parquet format. Is it a good practice do it thus? or is it better using "from kedro.io import data_catalog" how I show in the image?
j

Juan Luis

07/09/2023, 11:34 AM
hi @Eduardo Romero López! thanks for bringing this question, it's an important one. you don't need to manually use the catalog from the functions. in fact, your functions only need to know how to receive data as inputs, and how to
return
the result
then, when you define the pipeline, you declare the nodes and map names in the catalog with inputs and outputs. for example,
Copy code
pipeline([
  node(
    func=intermediate_data,
    inputs=["raw_dataframe"],
    output="intermediate_dataframe"
  )
])
where "raw_dataframe" and "intermediate_dataframe" are defined in
conf/base/catalog.yml
then, Kedro takes care of mapping those catalog entries to the function inputs and outputs
let me know if that helps
e

Eduardo Romero López

07/09/2023, 11:40 AM
ok thank very much 🙂 , ok, but if necessary add "df.to_parquet" in the function from node to save "data intermediate" in "./data/02 intermediate"?
j

Juan Luis

07/09/2023, 11:45 AM
nope, you can
Copy code
return df
and then declare the dataset in
catalog.yml
as
Copy code
intermediate_df:
  type: pandas.ParquetDataset
  filepath: data/02_intermediate/preprocessed_queries.pq
(pseudocode, didn't test it but you get the idea)
e

Eduardo Romero López

07/09/2023, 11:48 AM
ok but the data will not be saved in the "02_intermediate" folder, they will only be in memory, right?
In a first moment, I would like save each return in its folder and that the data persist.
n

Nok Lam Chan

07/09/2023, 1:46 PM
It will be save in the folder you declared
e

Eduardo Romero López

07/09/2023, 4:25 PM
Thanks to both 😉
it works now!!!
🙌 1
🙌🏼 1
2 Views