- What kinds of data quality problems?
- How can we detect problems with the data?
- What can we do about these problems?
- Examples of data quality problems:
- Noise and outliers
- Missing values
- Duplicate data
Noise
- Noise refers to modification of original values
- Examples: distortion of a person's voice when talking on a poor phone and "snow" on television screen
Outliers
- Outliers are data objects with characteristics that are considerably different than most of the other data objects in the data set
Missing Values
- Reasons for missing values
- Information is not collected (e.g., people decline to give their age and weight)
- Attributes may not be applicable to all cases (e.g., annual income is not applicable to children)
- Handling missing values
- Eliminate data objects
- Estimate missing values
- Ignore the missing value during analysis
- Replace with all possible values (weighted by their probabilities)
Duplicate Data
- Data set may include data objects that are duplicates, or almost duplicates of one another
- Major issue when merging data from heterogenous sources
- Examples:
- Same person with multiple email addresses
- Data cleaning
- Process of deaing with duplicate data issues