Hi friends Was writing something and just wanted to get peop Kedro #random

Hi, friends! Was writing something, and just wante...

Deepyaman Datta

01/18/2023, 3:17 PM

Hi, friends! Was writing something, and just wanted to get people's views on the feature engineering process--see if it aligns with what I've seen in the past (mostly on Kedro projects, but the concepts should be framework-agnostic). Let's say my team is doing demand prediction for grocery stores (per item, per location). We'd first engineer a "Spine" dataset (consisting of the label/target column, as well as the timestamp and identifying keys). Over time, we'd create a bunch of feature dataframes, each consisting of the timestamp, identifying key(s) for the feature or set of features, and feature value(s). Whether we had 2 features or 200, the process of constructing the master (or model input) table always looked like a series of joins; e.g. in the attached example:

Copy code

model_input = (
    spine
    .join(number_of_visitors, how="left", on="store_id")
    .join(price_per_pound, how="left", on=["store_id", "item_id"])
)

Does this process resonate with your experience, or does it look different in your organization? Does the terminology also resonate? E.g. • spine dataframe vs target dataframe vs something else? • unit of analysis columns vs something else?

Antony Milne

01/18/2023, 3:31 PM

This all resonates very strongly with me, but I guess that’s not very surprising because it seems very popular at QB… Is there a reason we shouldn’t just refer to the spine as “primary key(s)” or “index” or something like that though? In pandas I would even make it a (possibly multidimensional) index.

🙏 1

Deepyaman Datta

01/18/2023, 3:35 PM

@Antony Milne Would not the timestamp + unit of analysis columns (

date

item_id

store_id

in this example) be the index? I totally agree with making those into a multi-index, and same with making date + join key for each feature into a multi-index, because then your syntax is simplified to

spine.join(number_of_visitors).join(price_per_pound)

. I'd say the label shouldn't be part of the index though (I think?).

👍 1

Antony Milne

01/18/2023, 4:07 PM

ok yeah, agree with all that!

Pedro Abreu

01/19/2023, 10:12 AM

Resonates, except maybe for the case where performance is an issue, in which case creating multiple feature tables independently and then joining them is suboptimal

👍 1

🙏 1

marrrcin

01/19/2023, 10:21 AM

The overall process seems pretty reasonable to me, this is how it usually looks in our projects. As for the terminology - I personally haven’t seen the “spine” dataframe, we just call them feature tables, and the final stage is training/testing/validation dataset (the one you have after joins).

🙏 1

Deepyaman Datta

01/19/2023, 12:12 PM

@marrrcin Thanks! Quick clarification--is your target variable column also just another feature table then?

marrrcin

01/19/2023, 12:30 PM

Pretty much. This is the first time I see the nomenclature of a “spine” dataframe 🤷🏻‍♂️

👍 1

marrrcin

01/19/2023, 12:31 PM

The target variable is just one of the columns from the multiple potential data sources (feature tables) we join

👍 1

6 Views

Open in Slack

Previous Next