Machine learning model comparison presents challenges. Clean data provides a clear path forward.
Manufacturers around the globe and across industries of all types rely on machine vision systems to improve productivity, enhance efficiency, and drive revenue. By augmenting a machine vision system with deep learning, businesses can take their automation systems to new heights, due to the technology’s ability to deliver accurate, effective, and flexible automated inspection capabilities. Several deep learning platforms are available, and when it comes to comparing different options, companies must make several considerations up front.
In the ever-evolving world of machine learning (ML), new model architectures are always being developed, offering increasingly better performance of benchmark datasets. These datasets are fixed, however, unlike the datasets of most companies deploying ML on the factory floor. As a result, it can be difficult to truly compare ML models to one another. If a company compares two models against each other, it must follow several guidelines to ensure a fair and transparent comparison.
Level the Playing Field
To ensure a fair comparison between two ML models, companies should start by considering the data. The same exact datasets and data splits should be used, so that both models are trained and evaluated on the exact same images. Test datasets must also be representative of actual production data. To ensure a clean dataset, companies evaluating ML models should ask their subject matter experts to perform agreement-based labeling on test sets to reduce any ambiguity.
Companies should also eliminate duplicate data in the training and test sets to ensure that the models can’t “cheat” and deliver falsely high results. Comparable dataset sizes must also be used to train all models under evaluation, because if one dataset is significantly larger, it will typically yield better results. At the same time, the test sets must be large enough to allow a meaningful comparison, with a low metric variance. For instance, if a test set has only 10 defect images, then the difference between an 80% and a 90% recall could just be from random chance.
Consistent Models and Metrics
In addition to ensuring that datasets are on an even playing field, companies must make sure that models under comparison are consistent. For instance, models must utilize comparable parameters, such as preprocessing and post-processing. If one model uses custom preprocessing or post processing, this parameter should be added to the other model. Other examples include background subtraction, image registration, and morphology. Both models should also have comparable input image size. If a customer compares a model using 256 x 256 images to one using 512 x 512 images, the performances will differ.
Each model should be trained several times to account for randomness during training, and neither model should be overfit to the test set. Overfitting may occur when a long time is spent tuning hyper-parameters, such as the batch size, learning rate schedule, number of epochs, and dropout probability, to improve development/test set performance.
Furthermore, companies must ensure that the same metrics are being calculated during evaluation, as some metrics can have different implementations. Examples include the threshold value for defining true positives, false positives, and false negatives, and the format of output predictions that feed into the metrics code. Models should be compared using the appropriate metrics for the particular use case, as many visual inspection applications have inherent class imbalance. In these cases, metrics tied to precision and recall are preferred over those tied to accuracy, since accuracy can be biased toward more frequent classes.
Real-Life Example: Ditch the Duplicates
Seeking to evaluate its internal ML model against that of LandingLens, a global telecommunications company sent LandingAI a set of 113 images with individual bounding boxes, along with defect classes and results from its internal model — some of which were suboptimal. In a comparison of the company’s model against the LandingLens model, the initial LandingLens results were poor, and the company seemed content to stick with its existing solution.
Asking why this had happened, LandingAI found that the company ran its model using all its data, while LandingAI used just two of 15 different sets of images, causing LandingLens to underperform. LandingAI then took one individual dataset and ran LandingLens against the existing model to compare. Again, the LandingLens model underperformed — with a mean average precision (mAP) of 0.67 to 0.27.
Digging deeper, LandingAI engineers noticed that the company’s model was nearly perfect in terms of ground truth versus predictions. After additional research, the LandingAI team found that the two models used different augmentations and metrics, and calculated mAP differently, so the engineers decided to take a deeper dive into the data. Upon closer inspection, the LandingAI team found that after removing duplicates and highly similar images, only 41 of the 113 images were left, and only two bounding boxes. This meant that 64% of the data was duplicated. The existing model had memorized the training data, which was very similar to the development data, producing artificially high metrics.
Essentially, the company’s existing model was doing a better job of overfitting. The company was unaware that its model was using duplicate data, and the project helped everyone realize that models don’t really matter when the data is insufficient. Starting with a clean dataset without duplicates would have produced much better results, much faster. So the company began using LandingLens to label images, reach consensus, and quickly build a model based on good data to avoid such issues in the future.
Data Lights the Way
When evaluating different deep learning options for automated inspection, the checklist should begin with data. A data-centric approach to AI means focusing on the quality of the data used to train the AI model, rather than trying to tweak the model by changing the values or statistical methods used to sample the images and to create the model. It also means that manufacturers should make a concerted effort to accurately classify, grade, and label defect images rather than just focus on the number of images.
AI theory states that if 10% of data is mislabeled, manufacturers will need 1.88 times as much new data to achieve a certain level of accuracy. If 30% of data is mislabeled, manufacturers need 8.4 times as much new data compared to a situation with clean data. Using a data-centric deep learning platform that is machine learning operations (MLOps) compliant will allow manufacturers to save significant time and energy when it comes to producing quality data.
A data-centric deep learning platform like LandingAI’s LandingLens software provides tools such as the digital defect book, which allows multiple employees — regardless of location — to objectively define various defect categories that might be identified during a particular use case. The tool automatically identifies outliers in the labeled data, which helps build consensus among quality inspectors and improves final system performance.
Ultimately, comparing ML models to one another presents a unique set of challenges. Companies must truly understand how metrics are calculated. When following the right set of guidelines, model comparison is possible, but regardless of which models are being compared, companies should focus on obtaining clean, labeled data with which to train and deploy a deep learning model.