Evaluation
- Generalization and overfitting
- Overfitting in decision trees
- Validation set
- Metrics to evaluate model performance
- Machine learning tool
- Classification in spark
Generalization and Overfitting
Errors in Classification
-
Recall that a machine learning model maps the input it receives to an output. For a classification model, the model's output is the predicted class label for the input variables and the true clas label is the target
-
Then if the classifier predicts the correct classes label for a sample, that is a success. If the predicted class label is different from the true class label, then that is an error.
-
The error rate, then, is the percentage of errors made over the entire data set. That is, it is the number of errors divided by the total number of samples in a data set
-
Error rate is also known as misclassification rate, or simply error
-
The model is built using training data and evaluated on test data. The training and test data are two different data sets. The goal in building a ML model is to have the model perform well on training, as well as test data.
-
Error rate, or simply error, on the training data is referred to as training error, and the error on test data is referred to as test error. The error on the test data is an indication of how well the classifier will perform on new data.
Generalization
- Generalization refers to how well your model performs on new data, that is data not used to train the model
- You want your model to generalize well to new data. If your model generalizes well, then it will perform on data sets that are similar in structure to the training data, but doesn't contain exactly the same samples as in the training set
- Since the test error indicates how well your model generalizes to new data, note that the test error is also called generalization error
Overfitting
- A related concept to generalization is overfitting. If your model has very low training error but high generalization error, then it is overfitting
- This means that the model has learned to model the noise in the training data, instead of learning the underlying structure of the data
Connection between overfitting and generalization
- A model that overfits will not generalize well to new data
- So the model will do well on just the data it was trained on, but given a new data set, it will perform poorly
- A classifier that performs well on just the training data set will not be very useful. So it is essential that the goal of good generalization performance is kept in mind when building a model
Overfitting and Underfitting
- Overfitting occurs when the model is fitting to the noise in the training data. This results in low training error and high test error
- Underfitting on the other hand, occurs when the model has not learned the structure of the data. This results in high training error and high test error
- Both are undesirable, since both mean that the model will not generalize well to new data. Overfitting generally occurs when a model is too complex, that is, it has too many parameters relative to the number of training samples. So to avoid overfitting, the model needs to be kept as simple as possible, and yet still solve the input/output mapping for the given data set
What causes overfitting
- In summary, overfitting is when your model has learned the noise in the training data instead of the underlying structure of the data. You want to avoid overfitting so that your model will generalize well to new data
Overfitting in Decision Trees
- A decision tree, also referred to as tree induction, the tree repeatedly splits the data in a node in order to get successively paired subsets of data
- Note that a decision tree classifier can potentially expand its nodes until it can perfectly classify samples in the training data
- But if the tree grows node to fit the noise in the training data, then it will not classify a new sample well
- This is because the tree has partitioned the input space according to the noise in the data instead of to the true structure of a data. In other words, it has overfit