I’m expecting 2021 to be the year of the feature store
— Mike Del Balso, CEO and co-founder of Tecton
According to a Gartner study, 85 percent of AI projects will flatline by 2022. Even the most diligent machine learning models may not meet expectations when deployed in an enterprise setting, mainly due to two reasons — inadequate data infrastructure and talent scarcity.
In the machine learning pipeline, search for appropriate data and dataset preparation are among the most time-consuming processes. A data scientist spends around 80 percent of his/her time in managing and preparing data for analysis. The demand-supply gap for qualified data scientists is another pressing challenge.
Enter, feature store.
What Are Feature Stores?
A feature store allows features to be registered, discovered, and used for the machine learning pipelines and online applications for model inferencing. They can store large volumes of feature data and provide low latency access to features for online applications. A feature store automates the input, tracks, and governs data into machine learning models. Enterprise AI can benefit immensely from such a centralised and reproducible framework to manage machine learning models.
In 2017, Uber changed the game with the introduction of Michelangelo, an ML platform for data management. Michelangelo offered a feature store. In 2019, Feast project, in collaboration with Google Cloud, announced a feature store.
The latest to join the bandwagon is Amazon’s AWS SageMaker Feature Store — a fully managed and purpose-built repository. Airbnb, Twitter, Facebook, and Netflix are other major players with feature stores.
Feature stores (by taking up the most mundane yet time-intensive data tasks) allow data scientists to focus on essential tasks such as model building and experimentation rather than spending time on cleaning and managing data.
Feature stores manage data pipelines that transform raw data to feature values. These pipelines can be either the scheduled pipelines that aggregate a large amount of data (petabytes) or real-time pipelines triggered by events. Feature stores contain the ‘freshest’ feature values to machine learning models.
Feature store exposes APIs and UIs to the data scientist to show the currently available features, pipelines and other training datasets available or are under development. Data scientists can choose the features required for their use cases and incorporate them into their models.
Feature stores offer the following benefits:
- One of the main challenges in implementing a machine learning model in an enterprise environment is that the features used for training the model may not be the same in the production serving layer. A feature store provides a consistent feature set, enabling a smoother deployment process.
- The feature store keeps metadata in addition to the actual features. This helps data scientists in selecting particular features that performed well on existing models.
- Unlike traditional methods where features are developed in silos, feature stores allow sharing features and their metadata with peers. This helps in collaboration and avoids duplication.
- In critical services such as finance, healthcare, and security, it becomes essential to track the lineage of algorithms being developed. To achieve this, scientists require visibility into the end-to-end flow of the model. A feature store gives a peek into the data lineage of a feature, capturing how a feature was developed, providing insights and reports for regulatory compliance.
Wrapping Up
As mentioned earlier, larger tech companies that extensively deal with AI have built their own feature stores. The industry needs to standardise and automate the core of feature engineering. Moreover, feature stores are slated to become a prerequisite in the machine learning pipeline.