Hi everyone, I need help and advise with an archit...
# questions
s
Hi everyone, I need help and advise with an architecture issue! I want to run an ML pipeline in Kedro for fraud detection. I several datasets, and they vary in many aspects (some are split in train+validation+test others aren't; the feature names vary; some only have numeric features while others also have text/categorical variables; feature engineering will likely differ, the target variable is always binary but its name changes). I'm thinking about having a "sidecar" YAML file with information for each dataset, but have no idea how to bring that into Kedro. And this might not be a good approach. After feature engineering, I envision that I can have exactly the same code running. Any suggestions on how to tackle this with Kedro?
d
so we have a flexible
metadata
field available in the catalog you can use for any purpose and access programatically in hooks and other parts of the run lifecycle
💯 1
is that what you’re looking for?
d
I'm not 100% sure what the problem you're looking to solve is. It sounds like you can reuse the
modeling
pipeline (or whatever you want to call the post-feature-engineering pipeline). For feature engineering, how different are they? If the general process is pretty similar, you can have e.g. a
encode_categorical_features
node in your pipeline, and it can accept an argument with the list of features to encode. Normal approach would be to pass that list as parameters (instead of inventing a sidecar YAML construct). For example, you may have namespaced parameters:
Copy code
whatever.categorical_columns:
  - col_a
  - col_d
That will be used for corresponding dataset
whatever.joined_data
in the modular pipeline instance with namespace
whatever
. In this approach, if a dataset doesn't have text/categorical columns, the
encode_categorical_features
node will just be passed an empty list for
categorical_columns
, and the logic will be robust enough to essentially perform a no-op there.
💯 1
s
Thank you @datajoely, I'm new to kedro; I'll check the metadata attribute and also how hooks work.
Hi @Deepyaman Datta, let me try to give some further context. Indeed, the modeling part is not a problem. I'm not really interested in the predictions I'll be making, but on comparing techniques. For instance, I can use StandardScaler or MinMaxScaler for scaling numeric data. Maybe one of them is usually better than the other, so I want to try both with several datasets. For each dataset, I need to have a list of the variables that are metric and can be scaled. There are several other considerations; some datasets already provide separate train/validation/test datasets but others require a split; file formats also vary. If I understood, you suggest that I should use a namespace for each dataset. That seems to make a lot of sense, indeed. And every configuration would be side the catalog.yaml file, right? I'll have to explore, I'm quite new to Kedro... Tks! Samuel