Hi team Hope this message finds you well I try to add a list Kedro #questions

Hi team, Hope this message finds you well. I try t...

Qiuyi Chen

11/30/2022, 6:35 PM

Hi team, Hope this message finds you well. I try to add a list of dataframes as an input to kedro pipeline, here is what I did, but it is not working when I try to put multiple dataframes, can you help me with it? Thank you very much

Copy code

from pyspark.sql import DataFrame

def function_a(params: Dict, *df_lst: List[DataFrame]):

     report = pd.Dataframe()
     for df  in df_lst:
           temp = function(df,params)
     report = pd.concat([report,temp])

     return report

I can run function like this

Copy code

Function_a(params, df1,df2,df3)

But in the pipeline, how can I define the node and catalog in this situation. Here is what I did, please let me know where I did it wrong

Copy code

def create_pipeline(**kwargs):
   return Pipeline(
      [ node( function = function_a,
              Inputs = ["params", "df_lst"],
              outputs= "report",
      ]
   )

Catalog =  DataCatalog(
       data_sets={"df_lst": df1},
       feed_dict={"params":params, },
   )

I can only run the pipeline when df_lst is just one dataframe, but I do want it do be something like “df_lst”: df_1,df_2,df_3 …df_n(n>3)

datajoely

11/30/2022, 6:45 PM

So if do

["params", "input1", "input2", "intput3"…]

it should be passed to the node with a function signature like

*df_list

This is an older project but I did something similar in this function and this node

Qiuyi Chen

11/30/2022, 6:54 PM

Thank you for your reply. What about if number of dataframes is defined at the last step, when I try to call the pipeline? My understanding is that if I defined [“params”,“input1", “input2”, “intput3"], then I can only use three dataframes

Qiuyi Chen

11/30/2022, 8:19 PM

Hi, I just found out a way to do what I want, thank you

Fabian

12/01/2022, 10:55 AM

Hi, may i ask how you did it?

4 Views

Open in Slack

Previous Next