Compartir a través de


What is a Feature Store?

A Feature Store is a software toolkit designed to manage raw data transformations into features. Feature Stores often include metadata management tools to register, share, and track features. Feature stores also handle the complexity of doing correct point-in-time joins. So, the resulting data frame can be ingested by any model training libraries.

How Feature Stores simplify the ML Model Workflow?

A typical ML Model workflow (for example, without Feature Stores) generally means one has to build multiple models using hundreds of features. The maintenance of the feature transformation pipelines alone becomes tedious, slowing down the productivity and efficiency of model development. Also, the design isn't easily reuseable and doesn't promote sharing of features across teams.

MLFlow without feature store

The following drawing illustrates workflow change with the introduction of Feature Stores.

MLFlow with feature store

Deciding whether to adopt a feature store

Who should decide whether it's worth the investment for a team and/or company to build a feature store? Feature stores offer benefits for data science and infrastructure/platform teams, so it makes sense to include both groups in the discussion (see Data science factors below).

Platform engineers on large teams or organizations often recommend adoption of a feature store. In contrast, data scientists are likely to first see the benefits in smaller teams and organizations. Buy-in from both groups is critical to successfully implementing a solution like this one.

Data platform teams are looking for ways to provide most of the functionality needed by the data science team. The goal is to provide it in an easy to manage and control using self-service, if possible. Data scientists are looking for a platform where they can find and share features and abstract feature access during training and inference.

The following content highlights decision points most teams will reach while considering implementing a feature store. We'll start with the definition of some terms that are frequently used in feature stores.

Feature Store Terminology

Feature transformation: refers to the process of converting raw data into features. Transformation generally requires building data pipelines to ingest both historical and real-time data.

Feature registry: refers to a location where all features used are defined and registered. By using a registry, data scientists can search, find, and reuse features in their models. Feature definitions include information like type, source, and other relevant metadata.

Feature serving: refers to being able to serve feature values for both batch operations like training (high latency) and low latency for inference. It abstracts the complexity away when querying the feature values while providing functionality like point-in-time joins.

Observation data: refers to the raw input for the data being queried in the Feature Serving layer. Observation data is, at the least, composed of the IDs of the entities of interest and timestamps. The IDs and timestamps both exist as join keys. This concept is called Entity data frame in other feature stores.

Point-in-Time Joins (PITJ): for time-series-driven data. It is important to make sure the data used for training is not mixed with the latest data ingested. Doing so creates feature leakage (also known as label leakage). PITJ ensures that data served corresponds to the closest observation times.

Data science factors

Data scientists should review the following questions to help decide if it is worth the cost to invest in a feature store.

Decision point Without feature store With feature store
Do your data scientists have problems finding available features for reuse? Without a centralized repository for features, data scientists often jump directly to creating feature transformation pipelines. These pipelines increase the complexity of the platform as the use cases supported grow and reduce the value of previously acquired domain knowledge. A key component in a feature store is the feature registry. A feature registry is a module that works as a centralized repository for all features created by and within an organization. It makes discovery and management of features easier.

A feature registry contains information about feature definitions and their source. Depending on the feature store, it might include information about the transformation code and lineage information.
This component is, ideally, searchable, easy to understand, and accessible from a centralized endpoint.
Do you want to share your features with business users? Information about features is scattered throughout docs and code and is not easily shareable with business users. These users provide domain knowledge about which features to use or might become outdated. Feature Stores are a single source of truth with a standardized and structured way of viewing information about features.
Do many of your features need to be served/computed in real time? You have clients requesting predictions that do not have the feature values, without a way to inject them into the requests. These features need to be computed in near real-time.

A good use case is a real-time recommendation engine where you aggregate streamed events and generate new recommendations on-demand.
Feature stores provide a component called an Online Store. The online store contains the latest version of a feature value (often called materialization). The values persist in a low latency data store, ensuring that features are served in near real-time to your model.

The feature store abstracts this materialization process.
Are many of your features time-dependent? Do your data scientists spend much time handling complex point-in-time joins? Data scientists need to spend time learning how to do point-in-time correct joins. Constructing point-in-time correct data is time-consuming and error-prone. A feature store has built-in point-in-time joining capabilities, abstracting this complexity away from the data scientists.
Do your data scientists spend time writing complex queries or code to access the feature data? During feature value retrieval, data scientists must write code to access the data according to the data source of choice. The lack of abstraction can require writing complex queries or spending time writing code that has little direct value to their work.

Sometimes, the time is spent debugging infrastructure issues instead of higher value activities like feature engineering/building the ML model itself.
Feature stores provide a Serving layer that works as an abstraction away from the infrastructure. Data scientists can minimize the time spent dealing with the infrastructure and specific syntax and focus on the needed features.

This layer combines the feature registry and the point-in-time joins, providing a powerful mechanism for data scientists to access data without knowing the underlying infrastructure.

Platform factors

Infrastructure and data platform teams should consider the following questions when evaluating the pros and cons of building a feature store.

Decision point Without feature store With feature store
Do you maintain many duplicated feature transformation code/pipelines? When data scientists are unaware of existing features, there will be a propensity to create and manage duplicate pipelines to perform feature transformations. Managing all these pipelines is expensive and demands a lot of attention from a platform team when making changes or upgrades. Given the predilection for shareability in a feature store, the number of duplicated feature transformation pipelines should be reduced in favor of reusing existing features.
Do you have to serve features for training (batch or high-latency) and inference (low-latency)? Processing historical (for training) and streaming (for inference) data is done differently and requires separate pipelines. These pipelines might process the data using different methods and technologies. The methods and technologies used are specific to how the data is ingested, whether it is batch or streaming. They might store the results in various data stores according to the latency requirements. All these factors increase the complexity of maintaining these pipelines. Most feature stores provide a module for feature computation that takes care of storing data in a suitable data store according to the requirements of the feature. A module like this enables the processing of batch data from ETL processes or historical data in a data warehouse and of streaming data from low latency message bus systems.

To make this process consistent, a feature store would provide a domain-specific language (DSL) to perform transformations that deliver consistent results no matter how the data is ingested. The DSL allows features to be computed once for both training (batch) and inference (real-time) and reused across models. An example of such a case is Feathr.
Do you need to keep your data systems compliant? Maintaining control of each training dataset used by your data science team might be daunting. This control is especially difficult as the number of use cases grows. Some feature stores provide the governance tools required by an enterprise to exercise control over the feature data. Access control, data quality, policies, auditing, etc., enable the platform team to maintain control over the data ingested and transformed in the feature store from one centralized place.

Feature store summaries

Here are brief summaries of each feature store solution.

Azure Managed Feature Store

The Azure Machine Learning managed feature store makes it much easier for ML professionals to develop and productionize features. It has the following capabilities:

  • Monitors features
  • Provides network isolation via private endpoints and managed VNets
  • Can be used in both Azure ML and custom ML solutions.

FeaSt

FeaSt is an open-source Feature Store created by GoJek.

It focuses on providing a Feature Registry for sharing features and a Feature Serving layer to provide point-in-time joins support and abstract the queries to access the features from the datastore. FeaSt expects that you bring your own data warehouse and doesn’t provide support for feature transformation. It also expects that your data transformation has been completed and persisted beforehand.

Databricks FS

Databricks FS is a proprietary Feature Store solution provided within the Databricks environment. It makes use of the Delta Lake file system that Databricks uses to provide a solution that works with different data versions. It only works within Databricks and integration with other data sources is not available.

Feathr

Feathr is an open-source Feature Store created by LinkedIn and Microsoft.

In terms of functionality, Feathr provides a Feature Registry, support for Feature Transformation through its built-in functions, and the functionality to share features across teams.

Feathr runs the feature computation on Spark against incoming data from multiple sources. It supports different storage systems to persist that data after it has been processed for consumption at training or inference time.

NOTE: As of July 2024, it does not appear that Feathr is receiving active development.

Implementation

This implementation shows a feature engineering system using AML Managed Feature Store and Microsoft Fabric.

Logical architecture

Mermaid diagram #1

Feature Engineering on Microsoft Fabric

Link

This implementation shows a feature engineering system using AML Managed Feature Store and Microsoft Fabric.

Feature engineering is a crucial process in machine learning where domain knowledge is used to extract features from raw data. These features are then used to train models for predicting values in relevant business scenarios.

The system architecture involves a data pipeline running on Microsoft Fabric that lands, ingests, and transforms incoming public NYC taxi data into features. These features are built, registered, and stored in AML Managed Feature Store, and are used for model training and inferencing. The data pipeline is tracked and monitored by Azure Purview, which also captures and stores the feature lineage.

For more information

Data Pipeline: Best practices for designing and building data platforms

Azure ML Managed Feature Store: What is managed feature store?

Fabric: What is Microsoft Fabric?