Hi, friends! Was writing something, and just wante...
# random
d
Hi, friends! Was writing something, and just wanted to get people's views on the feature engineering process--see if it aligns with what I've seen in the past (mostly on Kedro projects, but the concepts should be framework-agnostic). Let's say my team is doing demand prediction for grocery stores (per item, per location). We'd first engineer a "Spine" dataset (consisting of the label/target column, as well as the timestamp and identifying keys). Over time, we'd create a bunch of feature dataframes, each consisting of the timestamp, identifying key(s) for the feature or set of features, and feature value(s). Whether we had 2 features or 200, the process of constructing the master (or model input) table always looked like a series of joins; e.g. in the attached example:
Copy code
model_input = (
    spine
    .join(number_of_visitors, how="left", on="store_id")
    .join(price_per_pound, how="left", on=["store_id", "item_id"])
)
Does this process resonate with your experience, or does it look different in your organization? Does the terminology also resonate? E.g. • spine dataframe vs target dataframe vs something else? • unit of analysis columns vs something else?
a
This all resonates very strongly with me, but I guess that’s not very surprising because it seems very popular at QB… Is there a reason we shouldn’t just refer to the spine as “primary key(s)” or “index” or something like that though? In pandas I would even make it a (possibly multidimensional) index.
🙏 1
d
@Antony Milne Would not the timestamp + unit of analysis columns (
date
+
item_id
+
store_id
in this example) be the index? I totally agree with making those into a multi-index, and same with making date + join key for each feature into a multi-index, because then your syntax is simplified to
spine.join(number_of_visitors).join(price_per_pound)
. I'd say the label shouldn't be part of the index though (I think?).
👍 1
a
ok yeah, agree with all that!
p
Resonates, except maybe for the case where performance is an issue, in which case creating multiple feature tables independently and then joining them is suboptimal
👍 1
🙏 1
m
The overall process seems pretty reasonable to me, this is how it usually looks in our projects. As for the terminology - I personally haven’t seen the “spine” dataframe, we just call them feature tables, and the final stage is training/testing/validation dataset (the one you have after joins).
🙏 1
d
@marrrcin Thanks! Quick clarification--is your target variable column also just another feature table then?
m
Pretty much. This is the first time I see the nomenclature of a “spine” dataframe 🤷🏻‍♂️
👍 1
The target variable is just one of the columns from the multiple potential data sources (feature tables) we join
👍 1