Iñigo Hidalgo
04/16/2024, 10:55 AMIñigo Hidalgo
04/16/2024, 10:57 AMJuan Luis
04/16/2024, 11:41 AMdatajoely
04/16/2024, 11:56 AMIñigo Hidalgo
04/16/2024, 11:57 AMhave you considered some of the open source ones, like Feast or Hopsworks?I know the team looked at them at some point but discarded them for one reason or another, I wasn't involved in the discussions so can't really speak as to why (I assume maintenance required of self-managed vs just buying)
Iñigo Hidalgo
04/16/2024, 11:58 AMIñigo Hidalgo
04/16/2024, 12:00 PM• Do you need discovery in your Kedro flow, or would you go in knowing which features you need?nope, i would go in knowing which features we need. So it would be more of a question of where do we define the needed aggregations, joins, etc. I don't even really fully know what the requirements would be in order to set up a feature store flow.
• What would be your dream flow?Being able to define all the dataset generation stuff e.g. filter window, joins, primary keys, aggregations etc through config, then managing all the actual model training within a kedro pipeline
Iñigo Hidalgo
04/16/2024, 12:01 PMIñigo Hidalgo
04/16/2024, 12:04 PMdatajoely
04/16/2024, 12:04 PMDeepyaman Datta
04/16/2024, 1:34 PMWithout even really knowing what a feature store actually does, I would expect that populating the batch features would happen offline, in another kedro pipeline which wouldn't even necessarily know about a feature store, it would just do data tranbsformations and save to a parquet table.Yes, this is more-or-less correct for things like Feast, and, to my understanding, Databricks. If using Kedro, you can probably define a dataset for the write into the feature store, which will create or retrieve a
FeatureEngineeringClient
, write the table with the provided name (to Unity Catalog?), and save it to the store.
Your DS workflow will probably start with feature lookups, maybe a read-only dataset (or the same dataset overloaded?) using FeatureEngineeringClient
again. It will take some specification of feature lookups to perform. The feature store is responsible for handling things like point-in-time-correct joining of features, etc. This process will yield your training dataset.
Looking at the examples (https://docs.databricks.com/en/machine-learning/feature-store/example-notebooks.html), FeatureEngineeringClient
does also seem to step into the MLFlow integration bit; haven't really looked into that much so far.Deepyaman Datta
04/16/2024, 1:36 PMIñigo Hidalgo
04/17/2024, 8:39 AMIt will take some specification of feature lookups to perform. The feature store is responsible for handling things like point-in-time-correct joining of features, etc. This process will yield your training dataset.This is the part which I'm thinking about. What would a "proper" design be? Something like the ibis approach where the dataset is a lazy reference to a featureengineeringclient and then through a pipeline and parameters we query whatever featengs? Or do we define those queried features and aggregations within the dataset definition? I don't think there's one correct approach but would like to get other people's thoughts on this.
Deepyaman Datta
04/17/2024, 6:22 PMDeepyaman Datta
04/17/2024, 6:24 PMI don't think there's one correct approach but would like to get other people's thoughts on this.I am happy to help brainstorm more on this, since did a good amount of looking into feature stores in my previous role (building another feature platform 😅 ). But also would love to hear about other people's thoughts, and experience in practice, since I don't come from the user side.