Has there been any discussion around feature store...
# questions
i
Has there been any discussion around feature stores in kedro? We're currently carrying out a PoC exploration into Databricks' feature store and the integration with kedro will be quite important to us.
We're also looking into Tecton
j
no discussion yet. have you considered some of the open source ones, like Feast or Hopsworks?
d
Good question - the short answer is that a Kedro dataset abstraction should be easy to load / save to . @Lim H. deffo made one in the past, but I can’t find it now. The versions of the same question I’d like to ask you - • Do you need discovery in your Kedro flow, or would you go in knowing which features you need? • What would be your dream flow?
i
have you considered some of the open source ones, like Feast or Hopsworks?
I know the team looked at them at some point but discarded them for one reason or another, I wasn't involved in the discussions so can't really speak as to why (I assume maintenance required of self-managed vs just buying)
oops
• Do you need discovery in your Kedro flow, or would you go in knowing which features you need?
nope, i would go in knowing which features we need. So it would be more of a question of where do we define the needed aggregations, joins, etc. I don't even really fully know what the requirements would be in order to set up a feature store flow.
• What would be your dream flow?
Being able to define all the dataset generation stuff e.g. filter window, joins, primary keys, aggregations etc through config, then managing all the actual model training within a kedro pipeline
all the discovery stuff would most likely happen outside of kedro. Our DS don't really use kedro for interactive exploration, only for production code
Without even really knowing what a feature store actually does, I would expect that populating the batch features would happen offline, in another kedro pipeline which wouldn't even necessarily know about a feature store, it would just do data tranbsformations and save to a parquet table. Then we would query the feature store at train/inference time, persist an intermediate dataset for traceability and then train/predict as we do right now. The feature store would basically remove all the loading, joining and feature transformation kedro pipelines. So kedro would live in the ELT part and in the train/predict part, the step from loaded raw features to derived features would be within the feature store.
d
yup
d
I'm not really familiar with Databricks feature store. I've quickly looked at a few examples; if I had to guess, it's a layer on top of Feast (or something very similar).
Without even really knowing what a feature store actually does, I would expect that populating the batch features would happen offline, in another kedro pipeline which wouldn't even necessarily know about a feature store, it would just do data tranbsformations and save to a parquet table.
Yes, this is more-or-less correct for things like Feast, and, to my understanding, Databricks. If using Kedro, you can probably define a dataset for the write into the feature store, which will create or retrieve a
FeatureEngineeringClient
, write the table with the provided name (to Unity Catalog?), and save it to the store. Your DS workflow will probably start with feature lookups, maybe a read-only dataset (or the same dataset overloaded?) using
FeatureEngineeringClient
again. It will take some specification of feature lookups to perform. The feature store is responsible for handling things like point-in-time-correct joining of features, etc. This process will yield your training dataset. Looking at the examples (https://docs.databricks.com/en/machine-learning/feature-store/example-notebooks.html),
FeatureEngineeringClient
does also seem to step into the MLFlow integration bit; haven't really looked into that much so far.
This is different if you choose something like Tecton, which is more of a feature platform than just the feature store component. Tecton will manage the computation for producing features, too. This means your integration point with Kedro is likely earlier; you would basically do your general data wrangling, but Tecton would want to own the feature creation logic. Most of the offerings I'm familiar with are actually broader feature platforms like this, rather than just feature store.
i
Thanks for the detailed replies Deepyaman. I'll reply bit by bit as I digest your messages
It will take some specification of feature lookups to perform. The feature store is responsible for handling things like point-in-time-correct joining of features, etc. This process will yield your training dataset.
This is the part which I'm thinking about. What would a "proper" design be? Something like the ibis approach where the dataset is a lazy reference to a featureengineeringclient and then through a pipeline and parameters we query whatever featengs? Or do we define those queried features and aggregations within the dataset definition? I don't think there's one correct approach but would like to get other people's thoughts on this.
d
With Databricks feature store (and something like Feast), all the feature store does is register an already-created feature (i.e. table), I believe. There's not 100% need to be lazy.
I don't think there's one correct approach but would like to get other people's thoughts on this.
I am happy to help brainstorm more on this, since did a good amount of looking into feature stores in my previous role (building another feature platform 😅 ). But also would love to hear about other people's thoughts, and experience in practice, since I don't come from the user side.