Shah
10/02/2025, 10:49 AMparameters_data_processing.yml file:
column_rename_params: # Suffix to be added to overlapping columns
skip_cols: ['Date'] # Columns to skip while renaming
co2: '_co2'
motion: '_motion'
presence: '_presence'
data_clean_params:
V2_motion: {
condition: '<0',
new_val: 0
}
V2_presence: {
condition: '<0',
new_val: 0
}
infinite_values:
infinite_val_remove: true
infinite_val_conditions:
- column_name: V2_motion
lower_bound: -1e10
upper_bound: 1e10
- column_name: V2_presence
lower_bound: -1e10
upper_bound: 1e10
I am experimenting with different parameter styles: dictionaries of dictionaries, dictionary of lists etc. So the two questions are as following:
1. How do I pass the second or third level dictionary parameters to a node? e.g. how do I pass column_rename_params['co2'] key's value to one node, and column_rename_params['motion'] key's value to another? My attempt of passing inputs to a node as inputs=['co2_processed', 'params:column_rename_params:co2', 'params:column_rename_params:skip_cols'], has returned "not found in the DataCatalog" error. Do I need to define these parameters in catalog.yml? Since, the parameters are not defined in the catalog.yml, yet I can access the "params:column_rename_params" dictionary, I guess there must be a way to access the next level as well. As a workaround, I have simplified the dictionary, keeping everything on the base level (not nested dictionaries).
2. Curiousity: Why do we write 'params:<key>' instead of 'parameters:<key>'? Just curious because I do not remember to have defined any object as 'params'. I was just following the tutorial.
Thanks ahead, and also thanks for Kedro and this slack workspace.datajoely
10/02/2025, 11:17 AMdatajoely
10/02/2025, 11:18 AMMerel
10/02/2025, 11:26 AMCuriousity: Why do we writeThe unofficial answer is that it's just always been this way π When you want to use all parameters you just referenceinstead of'params:<key>'? Just curious because I do not remember to have defined any object as 'params'. I was just following the tutorial.'parameters:<key>'
parameters and otherwise params:<key> my guess is this might be just for convenience since params is shorter to write than parametersShah
10/02/2025, 11:44 AMparameters and it was still working. Then following the tutorial, wrote params:xxx and it worked as well.Shah
10/02/2025, 11:51 AMparameters.
My parameters.yml is now with two options:
split_params:
test_size: 0.2
random_state: 42
features:
- V2_presence
- V2_motion
target:
- V17_co2
split_param_features:
- V2_presence
- V2_motion
split_param_target:
- V17_co2
split_param_test_size: 0.2
split_param_random_state: 42
pipeline.py contains:
def create_pipeline(**kwargs) -> Pipeline:
return pipeline([
node(
func=split_data,
inputs=["presence_motion_co2_combined","params:split_param_features", "params:split_param_target", "params:split_param_test_size", "params:split_param_random_state"],
outputs=["X_train", "X_test", "y_train", "y_test"],
name="split_data_node"
),
and the nodes.py contains:
def split_data(df: pd.DataFrame, features, target, test_size, random_state) -> t.Tuple:
X = df[features]
y = df[target]
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=test_size, random_state=random_state)
return X_train, X_test, y_train, y_test
Yet, getting an error while running the pipeline:
KeyError: "None of [Index(['V2_presence', 'V2_motion'], dtype='object')] are in the [columns]"
The parquet file generated in the data_processing pipeline has (should have) the columns. Is there a way to run the pipeline in debug mode so that I can check the exact dataframe being passed?Shah
10/02/2025, 2:41 PMdatajoely
10/03/2025, 1:14 PMShah
10/03/2025, 2:01 PM