Hi team, Hope this message finds you well. I try t...
# questions
q
Hi team, Hope this message finds you well. I try to add a list of dataframes as an input to kedro pipeline, here is what I did, but it is not working when I try to put multiple dataframes, can you help me with it? Thank you very much
Copy code
from pyspark.sql import DataFrame

def function_a(params: Dict, *df_lst: List[DataFrame]):

     report = pd.Dataframe()
     for df  in df_lst:
           temp = function(df,params)
     report = pd.concat([report,temp])

     return report
I can run function like this
Copy code
Function_a(params, df1,df2,df3)
But in the pipeline, how can I define the node and catalog in this situation. Here is what I did, please let me know where I did it wrong
Copy code
def create_pipeline(**kwargs):
   return Pipeline(
      [ node( function = function_a,
              Inputs = ["params", "df_lst"],
              outputs= "report",
      ]
   )

Catalog =  DataCatalog(
       data_sets={"df_lst": df1},
       feed_dict={"params":params, },
   )
I can only run the pipeline when df_lst is just one dataframe, but I do want it do be something like “df_lst”: df_1,df_2,df_3 …df_n(n>3)
d
So if do
["params", "input1", "input2", "intput3"…]
it should be passed to the node with a function signature like
*df_list
This is an older project but I did something similar in this function and this node
q
Thank you for your reply. What about if number of dataframes is defined at the last step, when I try to call the pipeline? My understanding is that if I defined [“params”,“input1", “input2”, “intput3"], then I can only use three dataframes
Hi, I just found out a way to do what I want, thank you
f
Hi, may i ask how you did it?