Hi everyone, I need help and advise with an archit...
# questions
Hi everyone, I need help and advise with an architecture issue! I want to run an ML pipeline in Kedro for fraud detection. I several datasets, and they vary in many aspects (some are split in train+validation+test others aren't; the feature names vary; some only have numeric features while others also have text/categorical variables; feature engineering will likely differ, the target variable is always binary but its name changes). I'm thinking about having a "sidecar" YAML file with information for each dataset, but have no idea how to bring that into Kedro. And this might not be a good approach. After feature engineering, I envision that I can have exactly the same code running. Any suggestions on how to tackle this with Kedro?
so we have a flexible
field available in the catalog you can use for any purpose and access programatically in hooks and other parts of the run lifecycle
💯 1
is that what you’re looking for?
I'm not 100% sure what the problem you're looking to solve is. It sounds like you can reuse the
pipeline (or whatever you want to call the post-feature-engineering pipeline). For feature engineering, how different are they? If the general process is pretty similar, you can have e.g. a
node in your pipeline, and it can accept an argument with the list of features to encode. Normal approach would be to pass that list as parameters (instead of inventing a sidecar YAML construct). For example, you may have namespaced parameters:
Copy code
  - col_a
  - col_d
That will be used for corresponding dataset
in the modular pipeline instance with namespace
. In this approach, if a dataset doesn't have text/categorical columns, the
node will just be passed an empty list for
, and the logic will be robust enough to essentially perform a no-op there.
💯 1
Thank you @datajoely, I'm new to kedro; I'll check the metadata attribute and also how hooks work.
Hi @Deepyaman Datta, let me try to give some further context. Indeed, the modeling part is not a problem. I'm not really interested in the predictions I'll be making, but on comparing techniques. For instance, I can use StandardScaler or MinMaxScaler for scaling numeric data. Maybe one of them is usually better than the other, so I want to try both with several datasets. For each dataset, I need to have a list of the variables that are metric and can be scaled. There are several other considerations; some datasets already provide separate train/validation/test datasets but others require a split; file formats also vary. If I understood, you suggest that I should use a namespace for each dataset. That seems to make a lot of sense, indeed. And every configuration would be side the catalog.yaml file, right? I'll have to explore, I'm quite new to Kedro... Tks! Samuel