Key Tips When Comparing Deep Learning Models

Landing AI

April 05, 2023

Deep Learning model comparison presents challenges.

Clean data provides a clear path forward

People across industries of all types rely on computer vision systems to improve productivity, enhance efficiency, and drive revenue. By augmenting a computer vision system with deep learning (DL), businesses can take their automation systems to new heights, due to the technology’s ability to deliver accurate, effective, and flexible automated inspection capabilities. Several computer vision platforms are available, and when it comes to comparing different options, you must make several considerations up front.

In the ever-evolving world of DL, new model architectures are always being developed, offering increasingly better performance of benchmark datasets. These datasets are fixed, however, unlike the datasets of most companies deploying DL on the factory floor. As a result, it can be difficult to truly compare DL models to one another. If a company compares two models against each other, it must follow several guidelines to ensure a fair and transparent comparison.

Tip 1: Level the Playing Field

To ensure a fair comparison between two DL models, you should start by considering the data. The same exact datasets and data splits should be used, so that both models are trained and evaluated on the exact same images. Test datasets must also be representative of actual production data. Performing agreement-based labeling can reduce any ambiguity and ensure a clean dataset.

Eliminating duplicate data in the training and test sets ensures that the models can’t “cheat” and deliver falsely high results. Comparable dataset sizes must also be used to train all models under evaluation, because if one dataset is significantly larger, it will typically yield better results. At the same time, the test sets must be large enough to allow a meaningful comparison, with a low metric variance. For instance, if a test set has only 10 defect images, then the difference between an 80% and a 90% recall could just be from random chance.

Tip 2: Consistent Models and Metrics

In addition to ensuring that datasets are on an even playing field, make sure that models under comparison are consistent. For instance, models must utilize comparable parameters, such as preprocessing and post-processing. If one model uses custom preprocessing or post-processing, this parameter should be added to the other model. Other examples include background subtraction, image registration, and morphology. Both models should also have comparable input image size. If a customer compares a model using 256 x 256 images to one using 512 x 512 images, the performances will differ.

Each model should be trained several times to account for randomness during training, and neither model should be overfit to the test set. Overfitting may occur when a long time is spent tuning hyperparameters, such as the batch size, learning rate schedule, number of epochs, and dropout probability, to improve development/test set performance.

Furthermore, ensure that the same metrics are being calculated during evaluation, as some metrics can have different implementations. Examples include the threshold value for defining true positives, false positives, and false negatives, and the format of output predictions that feed into the metrics code. Models should be compared using the appropriate metrics for the particular use case, as many visual inspection applications have inherent class imbalance. In these cases, metrics tied to precision and recall are preferred over those tied to accuracy, since accuracy can be biased toward more frequent classes.

Tip 3: Real-Life Example: Ditch the Duplicates

Seeking to evaluate its internal DL model against that of LandingLens, a global telecommunications company sent Landing AI a set of 113 images with individual bounding boxes, along with defect classes and results from its internal model — some of which were suboptimal. In a comparison of the company’s model against the LandingLens model, the initial LandingLens results were poor, and the company seemed content to stick with its existing solution.

Asking why this had happened, Landing AI found that the company ran its model using all its data, while Landing AI used just two of 15 different sets of images, causing LandingLens to underperform. Landing AI then took one individual dataset and ran LandingLens against the existing model to compare. Again, the LandingLens model underperformed — with a mean average precision (mAP) of 0.67 to 0.27.

Digging deeper, Landing AI engineers noticed that the company’s model was nearly perfect in terms of ground truth versus predictions. After additional research, the Landing AI team found that the two models used different augmentations and metrics, and calculated mAP differently, so the engineers decided to take a deeper dive into the data. Upon closer inspection, the Landing AI team found that after removing duplicates and highly similar images, only 41 of the 113 images were left, and only two bounding boxes. This meant that 64% of the data was duplicated. The existing model had memorized the training data, which was very similar to the development data, producing artificially high metrics.

Essentially, the company’s existing model was doing a better job of overfitting. The company was unaware that its model was using duplicate data, and the project helped everyone realize that models don’t really matter when the data is insufficient. Starting with a clean dataset without duplicates would have produced much better results, much faster. So the company began using LandingLens to label images, reach consensus, and quickly build a model based on good data to avoid such issues in the future.

Summary: Data Lights the Way

When evaluating different deep learning options for automated inspection, the checklist should begin with data. A data-centric approach focuses on the quality of the data used to train the DL model, rather than trying to tweak the model by changing the values or statistical methods used to sample the images and to create the model. It also means that you should make a concerted effort to accurately classify, grade, and label defect images rather than just focus on the number of images.

AI theory states that if 10% of data is mislabeled, you will need 1.88 times as much new data to achieve a certain level of accuracy. If 30% of data is mislabeled, you need 8.4 times as much new data compared to a situation with clean data. Using a data-centric deep learning platform allows you to save significant time and energy when it comes to producing quality data.

A data-centric deep learning platform like LandingLens software provides tools such as the digital Label Book, which allows multiple employees — regardless of location — to objectively define various defect categories that might be identified during a particular use case. The tool automatically identifies outliers in the labeled data, which helps build consensus among quality inspectors and improves final system performance.

Following the right set of guidelines and focusing on obtaining clean, labeled data will set you up to train and deploy DL models to succeed in your computer vision solutions.