Skip to main content

Feature Engineering

  • Scale to large datasets
  • Find good features
    • Synthetic features
  • Preprocess with Cloud MLE
  • Hyperparameter tuning

Good vs Bad features

  • Good Feature
    • Be related to objective
    • Be known at prediction-time
    • Be numeric with meaningful magnitude
      • Numeric features
      • Able to do mathematical operations
    • Have enough examples
    • Bring human insight to problem

image

image

  • Sparse Columns
  • If don't know the list of keys, Create a Vocabulary (This is what preprocessing is)

image

  • The vocabulary and the mapping of the vocabulary needs to be identical at prediction time.

image

image

PS - Take care of cases where user doesn't provide a value, i.e. missing values.

image

ML - lot of data, keep outliers and build model for them

Statistics - I've got all the data I'll ever get, throw away outliers

image

image

image

Preprocessing and Feature Creation

image

  • Apache Beam
  • BigQuery
  • TensorFlow

image

image

image

image

image

image

image

image

Apache Beam and Cloud Dataflow

image

image

image

image

image

image

image

image

image

image

image

image

Preprocessing with Cloud Dataprep

image

image

image

image

  • Ingesting, Transforming and Analyzing Taxi Data

Feature Crosses

Way to bring non-linear inputs to a linear learner

image

image

image

image

A feature cross memorizes the input space

image

image

image

image

Beware - Feature cross are a temptation for a model to overfit

image

image

image

Implementing Feature Crosses

image

image

image

image

image

image

image

image

image

image

image

image

image

image

image

By Feature Crossing the two grids.

Embeddings allow to generalize two grid cells, like all the grid cells that are on the ocean front should have a similar value.

image

Feature Creation in TensorFlow

image

Data Type - Python Dictionary

Ex - Distance between house and metro station (public transport) is a key for house prices

image

Feature engineering can be done in 3 places

  • Training
  • Evaluation
  • Serving

image

Using DataFlow

tf.transform allows users to define preprocessing pipelines and run these using large scale data processing frameworks, while also exporting the pipeline in a way that can be run as part of a TensorFlow graph

image

image

image

  • Feature cross is only useful when we have a large dataset since it's memorization so for each bucket there must be enough samples.

TensorFlow Transform

image

image

image

image

Analysis Phase

image

image

image

Transform Phase

image

image

image

image

image

Summary

  • Convert raw data into features
  • Preprocess data in such a way that the preprocessing is also done during serving
  • Choose among the various feature columns in TensorFlow
  • Memorize large datasets using feature crosses and simple models
  • Simplify preprocessing pipelines using TensorFlow Transform

Feature Engineering A-Z | Preface