Skip to main content

# Data Quality

  • What kinds of data quality problems?
  • How can we detect problems with the data?
  • What can we do about these problems?
  • Examples of data quality problems:
    • Noise and outliers
    • Missing values
    • Duplicate data

Noise

  • Noise refers to modification of original values
    • Examples: distortion of a person's voice when talking on a poor phone and "snow" on television screen

Outliers

  • Outliers are data objects with characteristics that are considerably different than most of the other data objects in the data set

Missing Values

  • Reasons for missing values
    • Information is not collected (e.g., people decline to give their age and weight)
    • Attributes may not be applicable to all cases (e.g., annual income is not applicable to children)
  • Handling missing values
    • Eliminate data objects
    • Estimate missing values
    • Ignore the missing value during analysis
    • Replace with all possible values (weighted by their probabilities)

Duplicate Data

  • Data set may include data objects that are duplicates, or almost duplicates of one another
    • Major issue when merging data from heterogenous sources
  • Examples:
    • Same person with multiple email addresses
  • Data cleaning
    • Process of deaing with duplicate data issues