Skip to main content

Feature Engineering

Intro

  • Scale to large datasets
  • Find good features
    • Synthetic features
  • Preprocess with Cloud MLE
  • Hyperparameter tuning

Tools

GitHub - feast-dev/feast: The Open Source Feature Store for Machine Learning

Feast (Feature Store) is an open source feature store for machine learning. Feast is the fastest path to manage existing infrastructure to productionize analytic data for model training and online inference.

Feast allows ML platform teams to:

  • Make features consistently available for training and serving by managing an offline store (to process historical data for scale-out batch scoring or model training), a low-latency online store (to power real-time prediction), and a battle-tested feature server (to serve pre-computed features online).
  • Avoid data leakage by generating point-in-time correct feature sets so data scientists can focus on feature engineering rather than debugging error-prone dataset joining logic. This ensure that future feature values do not leak to models during training.
  • Decouple ML from data infrastructure by providing a single data access layer that abstracts feature storage from feature retrieval, ensuring models remain portable as you move from training models to serving models, from batch models to realtime models, and from one data infra system to another.

Good vs Bad features

  • Good Feature
    • Be related to objective
    • Be known at prediction-time
    • Be numeric with meaningful magnitude
      • Numeric features
      • Able to do mathematical operations
    • Have enough examples
    • Bring human insight to problem

image

image

  • Sparse Columns
  • If don't know the list of keys, Create a Vocabulary (This is what preprocessing is)

image

  • The vocabulary and the mapping of the vocabulary needs to be identical at prediction time.

image

image

PS - Take care of cases where user doesn't provide a value, i.e. missing values.

image

ML - lot of data, keep outliers and build model for them

Statistics - I've got all the data I'll ever get, throw away outliers

image

image

image

Preprocessing and Feature Creation

image

  • Apache Beam
  • BigQuery
  • TensorFlow

image

image

image

image

image

image

image

image

Apache Beam and Cloud Dataflow

image

image

image

image

image

image

image

image

image

image

image

image

Preprocessing with Cloud Dataprep

image

image

image

image

  • Ingesting, Transforming and Analyzing Taxi Data

Feature Crosses

Way to bring non-linear inputs to a linear learner

image

image

image

image

A feature cross memorizes the input space

image

image

image

image

Beware - Feature cross are a temptation for a model to overfit

image

image

image

Implementing Feature Crosses

image

image

image

image

image

image

image

image

image

image

image

image

image

image

image

By Feature Crossing the two grids.

Embeddings allow to generalize two grid cells, like all the grid cells that are on the ocean front should have a similar value.

image

Feature Creation in TensorFlow

image

Data Type - Python Dictionary

Ex - Distance between house and metro station (public transport) is a key for house prices

image

Feature engineering can be done in 3 places

  • Training
  • Evaluation
  • Serving

image

Using DataFlow

tf.transform allows users to define preprocessing pipelines and run these using large scale data processing frameworks, while also exporting the pipeline in a way that can be run as part of a TensorFlow graph

image

image

image

  • Feature cross is only useful when we have a large dataset since it's memorization so for each bucket there must be enough samples.

TensorFlow Transform

image

image

image

image

Analysis Phase

image

image

image

Transform Phase

image

image

image

image

image

Summary

  • Convert raw data into features
  • Preprocess data in such a way that the preprocessing is also done during serving
  • Choose among the various feature columns in TensorFlow
  • Memorize large datasets using feature crosses and simple models
  • Simplify preprocessing pipelines using TensorFlow Transform

Feature Engineering A-Z | Preface