We have heard a lot about feature stores and how they are transforming the MLops space but all from the data engineering viewpoint. For a data scientist, the toolkit is rapidly expanding. Does it make sense for them to adopt one more?
In the development stage of a machine learning project, data scientists do large amounts of feature engineering to find the features that lead to the highest prediction accuracy. Once that process is complete, they usually hand off the project to an engineering colleague who will put those feature engineering pipelines into production.
If you’re a data scientist, you don’t want to be concerned about how the data becomes available or how it is computed. You know which features you want, and you want those features to be available for the model to make live predictions.
Engineers, on the other hand, need to re-implement those data pipelines in a production environment, which quickly becomes very complex as soon as there’s real-time or near-real-time data involved. To power operational ML applications, these pipelines need to run continuously, can’t break, need to be extremely fast, and need to scale with the business.
That’s why there is a pressing need for data scientists to add feature stores to their toolkits.
First, let’s see what is a feature store?
What is a Feature Store?
A feature store is a centralized repository or platform for managing and serving machine learning features, which are data inputs that are used to train and deploy machine learning models.
In machine learning, features are the input variables that are used to make predictions or classifications. For example, in a model that predicts housing prices, features could include the number of bedrooms, the square footage, and the neighborhood of a given property. A feature store helps organizations manage these features by providing a centralized location to store and track them.
A feature store typically includes tools for data ingestion, data processing, and data versioning to ensure that the features are consistent and up-to-date. It also provides a way to serve these features to machine learning models in a scalable and efficient manner.
By using a feature store, organizations can improve the productivity of their data science teams, reduce the time to develop and deploy models and improve the accuracy and consistency of their machine learning predictions.
Read More: Advantages of using Data Analytics
Re-implementing data pipelines into a production environment is the main blocker for operational ML projects.
Currently, feature stores are most useful where models have rapid development and dynamicity. Some examples of these are recommendation systems, search ranking, dynamic pricing, fraud detection, and loan application approvals.
Feature stores/platforms enable operational machine learning (ML), which happens when a customer-facing application uses ML to autonomously and continuously make real-time decisions that impact the business.
Operational ML’s older sibling is analytical machine learning in the “offline” world. These are applications that help a business user make better decisions with machine learning. Analytical ML applications sit in the company’s analytical stack and typically feed directly into reports, dashboards, and business intelligence tools.
Common examples include sales forecasting, churn predictions, and customer segmentation.
What is the Feature Store Ultimately Solving?
Getting features into ML models should be easy but currently, it is not. A feature platform solves the data challenges associated with production and operation. Simplifying further, it creates a path to production. Feature stores enable rapid iteration.
Just like packages are code dependencies, feature stores are feature dependencies. Lack of feature monitoring infrastructure and tools. If something changes downstream say a data point that was capturing the distance earlier in kilometers has now been changed to meters will disrupt the model. Important to track the data lineage for features
Real-time serving of features at scale is difficult. Training an ML model on historical point-in-time features is difficult. Implementing real-time data pipelines is difficult. But the feature store is what simplifies all these with –
- Improved Productivity – Faster Implementation of new features from weeks to days
- Improved Collaboration – Applications can share features that were difficult to do previously
- Improved Performance – Running time improved over custom pipelines by as much as 50%
What a feature store should be is still an open industry question and as we progress they will evolve. For now, feature stores solve some of the persistent challenges that data scientists face, such as
- Varying feature definitions: Different teams might name and define features differently which makes accessing the feature or its documentation challenging. Feature stores keep the features and their definitions consistent. This creates a standardized language around all features in one place stating how every feature is computed and what they represent.
- Non-reusability of features: Redeveloping features is a common bottleneck that data scientists face. Using previously developed features or the ones developed by others is a feasible option.
- Inconsistency between training and production features: The production and research environments use different programming languages and technologies. The data streaming into the production systems need to be processed into features in real-time to be fed into the ML models. A feature store is environment agnostic and suggests that given the same data, the model will be fed with the same feature.
The Three Approaches to Feature Store
There are three ways a feature store could be explored. These are:
- Literal Feature Store
- Physical Feature Store
- Virtual Feature Store
Literal Feature Store: A literal feature store only stores pre-processed features. Staying true to the name “Literal” feature store, it does not manage computing and creating features and just provides storage for features.
Physical Feature Store: Beyond storage, a physical feature store also computes the features. This is the most common type of feature store among vendors and in-house feature stores and has its own domain-specific language to define transformations. It also has its own storage layer for storing and serving features.
Virtual Feature Store: With low adoption cost and more flexibility, a virtual feature store coordinates and manages the transformations rather than actually computing them. The computations are offloaded to the organization’s existing data infrastructure. The virtual feature store essentially transforms your existing data infrastructure into a feature store.
To Conclude: The importance of having a feature store
- At an enterprise level, multiple features are hosted at multiple places by different teams. Here, the feature store allows a central storage option
- Feature stores eliminate the need for extensive data engineering by transforming the features suitable for production environments
- As already mentioned, feature stores enable feature reusability and sharing across teams saving time and effort in repetitive tasks.
- Feature stores monitor the correctness of feature pipelines in the production environment
- Feature stores also provide consistent feature definitions, versioning, and metadata
Embedding Stores are the next frontier as more and more deep-learning models are used. The Data Scientists’ requirements and roadblocks will continue shaping the future of feature stores and it would be interesting to see future developments in the area.