Skip to main content

Statistics

Statistics in Data Preparation

Statistical methods are required in the preparation of train and test data for your machine learning model.

This includes techniques for

  • Outlier detection
  • Missing value imputation
  • Data sampling
  • Data scaling
  • Variable encoding

A basic understanding of data distributions, descriptive statistics, and data visualization is required to help you identify the methods to choose when performing these tasks.

Statistics in Model Evaluation

Statistical methods are required when evaluating the skill of a machine learning model on data not seen during training.

This includes techniques for:

  • Data sampling.
  • Data resampling.
  • Experimental design.

Resampling techniques such as k-fold cross-validation are often well understood by machine learning practitioners, but the rationale for why this method is required is not.

k-fold cross validation

Cross-validation is a statistical method used to estimate the skill of machine learning models.

It is commonly used in applied machine learning to compare and select a model for a given predictive modeling problem because it is easy to understand, easy to implement, and results in skill estimates that generally have a lower bias than other methods.

https://machinelearningmastery.com/k-fold-cross-validation

Statistics in Model Selection

Statistical methods are required when selecting a final model or model configuration to use for a predictive modeling problem.

These include techniques for:

  • Checking for a significant difference between results.
  • Quantifying the size of the difference between results.

This might include the use of statistical hypothesis tests.

Statistics in Model Presentation

Statistical methods are required when presenting the skill of a final model to stakeholders.

This includes techniques for:

  • Summarizing the expected skill of the model on average.
  • Quantifying the expected variability of the skill of the model in practice.

This might include estimation statistics such as confidence intervals.

Statistics in Prediction

Statistical methods are required when making a prediction with a finalized model on new data.

This includes techniques for:

  • Quantifying the expected variability for the prediction.

This might include estimation statistics such as prediction intervals.