Boosting Model Performance Through Error Analysis

By Mark Sabini

Introduction

Error analysis is the process of examining image examples that your algorithm misclassified. This allows you to understand the underlying causes of the errors. With this analysis you can prioritize the best steps to fix your model and improve performance.

After training your machine learning model, you will calculate metrics on the development set. At this point, two things could happen. In the ideal case, your model does well on the development set and achieves your metric goals. Congratulations! However, more often than not, there will be a performance gap, and you’ll need to improve your model. You have a choice from the wide variety of tools and techniques, including data augmentation, hyperparameter optimization or possibly new architectures. The fastest way in understanding how to close the error gap and increase performance is to apply changes that are small, isolated, and directly address your largest errors. Rather than making big sweeping changes, choose techniques that are quick to test and directly address a specific type of error that contributes the most to your overall error.

We will use a simple form of error analysis to do this:

  1. Examine approximately 100 images from the development set where the model made a mistake.
  2. Place each image into one or more buckets that represent different potential sources of error. For example, an automated visual inspection might include: “the image was out of focus,” “the defect was small,” and “the lighting changed.”
  3. Select the one or two buckets containing the largest number of images.
  4. Propose a change to directly address that error.
  5. Retrain the model.
  6. Compare the new model’s performance to that of the current model.

Error analysis improves performance of the model. With the results of the analysis, you apply the corrections to better train the model. This creates a more generalized model that will better identify new images.

Analyze Model Errors on Development Set

During the creation of our model we use datasets for training. Here are the three types of data sets.

  • Training set — Which you run your learning algorithm on.
  • Development set (also known as the validation set) — Which you use to tune hyperparameters, select features, and make other decisions regarding the learning algorithm.
  • Test set — Which you use to evaluate the real-world performance of the algorithm, but not to make any decisions regarding what learning algorithm or hyperparameters to use.

During error analysis we use the first two types, namely the training and development sets. The test set of images is independent of the first two. We must not include test images in the training process. That would risk overfitting the model, resulting in poor performance. The goal is to have a generalized solution, based on the training and development data.

Improve False Negative Cases

Our model is used to detect manufacturing defects. In this case a positive result is correctly identifying a defective part. Whereas a negative result is identifying a part that is not defective. False negatives are cases where the model incorrectly predicts the part is not defective but is in fact defective. False negatives are the worst possible outcome because it means sending a defective part to the customer. The model is failing to identify defects.

To help solve this we need a model with a high recall rate. This would be a model with a high percentage of correctly identifying defects.

We’ll examine two scenarios where we used error analysis to quickly increase performance. We were training an object-detection model to detect fin defects and tube defects. Each part yielded several images, which we used for training and evaluation. Fin defects were relatively common. Tube defects were very rare, representing less than 1%.

Our baseline model achieved only a 74% recall on tube defects. On error analysis with the development set we found the model missed the severe tube defects. Since tube defects were rare, we could manually inspect the distribution in the data sets.

Surprisingly, we found that there were no severe tube defects in the training set. All of them were spread across the development and test sets. This happened because of how we created the training set.

Typically, we will split our image data randomly between the training, development, and test sets in order to get similar distributions. However, since severe tube defects were very rare, there was a high chance of getting unlucky due to random selection. This caused the training set to have no severe tube defects.

Ideally we would collect more examples of severe tube defects to distribute into the training set. This wasn’t feasible due to their rarity.

If we were publishing a research paper, the datasets would typically be fixed at the beginning of the project. We’d do all our development with that fixed dataset. Then, we’d report final results using the test set. This constraint is from the need to compare our methods with other researchers’ using the same datasets. Yet in this case it’s not a research project.

We are not constrained to fixed datasets while working on a customer’s model. A customer dataset is always growing. The end result in this case is delivering the best possible working solution. In this case it’s acceptable to move images between training and development. The critical issue is to not impact our test set performance and ensure the best possible performance in production.

Our simple solution here was to move some images of severe tube defects from the development set to the training set. Upon retraining, tube defect recall increased to 91%.

Apply Data Augmentation

After retraining our model, we re-examined it for errors. Analysis now showed it missed tube defects primarily near the edges of the image. This could be easily solved by applying data augmentation. We randomly cropped existing images to simulate more cases where defects end up on the edges of images.

We retrained the model using probabilities of 0.25, 0.5, and 0.75 for the cropped images. After data augmentation the best model achieved a tube defect recall of 100%.

Costly Changes to the Training Data

We considered a more elaborate option such as switching to a segmentation model. This would provide more information to the model when learning the defects. We even considered increasing the size of our dataset by generating synthetic defects. Such changes would have needed significant labeling and engineering costs. Still, none of these approaches would directly address the model’s errors.

Instead we came up with these two small and isolated changes. First, we found our training set did not include the defects we needed to detect. Second, we augmented the training set with existing images by cropping them. Opting for a simpler and oftentimes more practical method can quickly improve the model’s performance.

Small and Isolated Experiments

We’ve compiled a short table of suggestions for resolving errors in visual inspection. Techniques like these should be the first thing to try before attempting something more advanced or involved. While they might not completely close performance gaps, you should find you quickly narrow the possible choices.

Error Bucket Small and Isolated Experiment
Images are blurry
  • Add Random Blur data augmentation.
Images are brighter or darker
  • Add Random Brightness data augmentation.
  • Normalize image brightness.
Defects are in corners or edges of the image
  • Add Random Crop data augmentation.
  • Add Random Shift data augmentation.
High false positive rate for defect X
  • Increase the confidence threshold for defect X.
Rare defect being missed
  • Add more examples of defects to the training set.
  • Oversample rare defects during training.
  • Add class weights to loss.

We recommend that you create and share your own Small and Isolated Experiments table. Make them specific to your use case. Update it as you iterate on your models.

Conclusion

To recap, error analysis will allow you to easily surface a model’s most common error types. As a result, you will be able develop a series of small and isolated experiments to quickly and efficiently close gaps in performance.

Key Takeaways

  • When evaluating a model, group mistakes into buckets finding the largest sources of error.
  • Close performance gaps quickly with small and isolated changes to directly address the largest sources of error.