Hi everyone, I'm a novice to Kedro, experimenting ...
# questions
s
Hi everyone, I'm a novice to Kedro, experimenting with my first implementation. Trying to parametrize every function to take the maximum advantage of the platform. While attempting to access parameters defined in the 'parameters_xxx.yml' file, say for example 'data_processing' pipeline, I have two questions. But first, a glimpse into my
parameters_data_processing.yml
file:
Copy code
column_rename_params: # Suffix to be added to overlapping columns
    skip_cols: ['Date'] # Columns to skip while renaming
    co2: '_co2'
    motion: '_motion'
    presence: '_presence'

data_clean_params:
  V2_motion: {
        condition: '<0',
        new_val: 0
        }
  V2_presence: {
        condition: '<0',
        new_val: 0
        }

  infinite_values:
    infinite_val_remove: true
    infinite_val_conditions:
      - column_name: V2_motion
        lower_bound: -1e10
        upper_bound: 1e10
      - column_name: V2_presence
        lower_bound: -1e10
        upper_bound: 1e10
I am experimenting with different parameter styles: dictionaries of dictionaries, dictionary of lists etc. So the two questions are as following: 1. How do I pass the second or third level dictionary parameters to a node? e.g. how do I pass
column_rename_params['co2']
key's value to one node, and
column_rename_params['motion']
key's value to another? My attempt of passing inputs to a node as
inputs=['co2_processed', 'params:column_rename_params:co2', 'params:column_rename_params:skip_cols']
, has returned
"not found in the DataCatalog"
error. Do I need to define these parameters in
catalog.yml
? Since, the parameters are not defined in the catalog.yml, yet I can access the
"params:column_rename_params"
dictionary, I guess there must be a way to access the next level as well. As a workaround, I have simplified the dictionary, keeping everything on the base level (not nested dictionaries). 2. Curiousity: Why do we write
'params:<key>'
instead of
'parameters:<key>'
? Just curious because I do not remember to have defined any object as 'params'. I was just following the tutorial. Thanks ahead, and also thanks for Kedro and this slack workspace.
d
So this is very possible but I would argue it’s pretty dangerous to do in kedros current state with no native parameter validation. To answer your question you use the dot snyrax to access this nested attributes. Also check out omegaconf resolver section of the docs for extra power here I have this open issue which proposes native Pydantic support, if you were to comment your thoughts it would be helpful.
πŸ‘ 1
m
Curiousity: Why do we write
'params:<key>'
instead of
'parameters:<key>'
? Just curious because I do not remember to have defined any object as 'params'. I was just following the tutorial.
The unofficial answer is that it's just always been this way πŸ˜„ When you want to use all parameters you just reference
parameters
and otherwise
params:<key>
my guess is this might be just for convenience since
params
is shorter to write than
parameters
πŸ’‘ 1
s
@datajoely Thanks. Checking the page. Seems like your proposal is more exhaustive. @Merel That actually makes sense because initially I had written
parameters
and it was still working. Then following the tutorial, wrote
params:xxx
and it worked as well.
πŸ‘ 1
The parameters issue is now becoming critical as I am not able to perform split. Tried simplifying the parameters but seems like missing something. Even tried calling the whole dictionary by using
parameters
. My parameters.yml is now with two options:
Copy code
split_params:
  test_size: 0.2
  random_state: 42
  features:
    - V2_presence
    - V2_motion
  target: 
    - V17_co2 
split_param_features:
  - V2_presence
  - V2_motion
split_param_target:
  - V17_co2
split_param_test_size: 0.2
split_param_random_state: 42
pipeline.py contains:
Copy code
def create_pipeline(**kwargs) -> Pipeline:
    return pipeline([
        node(
            func=split_data,
            inputs=["presence_motion_co2_combined","params:split_param_features", "params:split_param_target", "params:split_param_test_size", "params:split_param_random_state"],
            outputs=["X_train", "X_test", "y_train", "y_test"],
            name="split_data_node"
        ),
and the nodes.py contains:
Copy code
def split_data(df: pd.DataFrame, features, target, test_size, random_state) -> t.Tuple:
    X = df[features]
    y = df[target]
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=test_size, random_state=random_state)
    return X_train, X_test, y_train, y_test
Yet, getting an error while running the pipeline:
Copy code
KeyError: "None of [Index(['V2_presence', 'V2_motion'], dtype='object')] are in the [columns]"
The parquet file generated in the data_processing pipeline has (should have) the columns. Is there a way to run the pipeline in debug mode so that I can check the exact dataframe being passed?
Updates: βœ… the above issue is resolved. It was a bug in the code: redundant concatenation of suffix string caused unexpected column names. The logger.info() strings helped debug it. Would like to know if there is any better debug method for kedro pipelines. βœ… Also, the parameters issue is resolved by passing higher level dictionary in the nodes input, and unfolding it inside the function. Thank you both!
πŸ₯³ 1
d
We're going to discuss this as a Kedro team next week so if you have any capacity to add your thoughts to the GH Issue it would be really useufl
πŸ‘ 1
s
Yes, I will. Finishing my first trial project, and will share all the feedback there. So whatever is useful/relevant/applicable can be considered for the next Kedro update.