Hi all, I am starting with kedro and I have a lot ...
# questions
e
Hi all, I am starting with kedro and I have a lot of doubt. Which is it the recommended way of save and load data in the structure of the project. For example: I have a node that read raw data, does data wrangling and save intermediate data in parquet format. Is it a good practice do it thus? or is it better using "from kedro.io import data_catalog" how I show in the image?
j
hi @Eduardo Romero L贸pez! thanks for bringing this question, it's an important one. you don't need to manually use the catalog from the functions. in fact, your functions only need to know how to receive data as inputs, and how to
return
the result
then, when you define the pipeline, you declare the nodes and map names in the catalog with inputs and outputs. for example,
Copy code
pipeline([
  node(
    func=intermediate_data,
    inputs=["raw_dataframe"],
    output="intermediate_dataframe"
  )
])
where "raw_dataframe" and "intermediate_dataframe" are defined in
conf/base/catalog.yml
then, Kedro takes care of mapping those catalog entries to the function inputs and outputs
let me know if that helps
e
ok thank very much 馃檪 , ok, but if necessary add "df.to_parquet" in the function from node to save "data intermediate" in "./data/02 intermediate"?
j
nope, you can
Copy code
return df
and then declare the dataset in
catalog.yml
as
Copy code
intermediate_df:
  type: pandas.ParquetDataset
  filepath: data/02_intermediate/preprocessed_queries.pq
(pseudocode, didn't test it but you get the idea)
e
ok but the data will not be saved in the "02_intermediate" folder, they will only be in memory, right?
In a first moment, I would like save each return in its folder and that the data persist.
n
It will be save in the folder you declared
e
Thanks to both 馃槈
it works now!!!
馃檶 1
馃檶馃徏 1