View all Blog

How to Improve Segmentation Inference Speed

Dillon Laird
January 30, 2024

The Need for Fast & Light Architectures

At Landing AI, we are continuously looking to find ways to provide our customers access and use of a wide variety of model options. In many of our real-world customer applications, we have found a strong need for high throughput applications that need to run on smaller edge devices. For example, you may need to inspect a very large image such as a wafer die.



A common trick to do this is to divide the image up into smaller patches and process each patch so as to not run out of memory. Doing this while ensuring that the entire image is processed quickly requires the model to be able to handle a very high throughput. In addition to this, these models are often run on smaller edge devices so the model cannot consume too much memory.

To better handle these situations, Landing AI recently introduced a new segmentation architecture that significantly improves model inference speed and reduces memory. In this blog, we will delve into the technical background behind the architecture, explore its benefits, and explain how users can easily train on these faster model architectures and speed up model inference time.

Technical Description

First, let’s examine the FastViT architecture from Apple. The architecture utilizes both convolutional blocks and attention blocks, taking advantage of the latency savings from convolutional blocks and the performance increase from attention blocks.

(figure from FastViT)

There are three main tricks employed in this architecture:

  1. Splitting convolutions into depthwise and pointwise convolutions to reduce the number of operations.
  2. Removing skip connections and merging batch normalization layers into previous convolutional layers to decrease inference latency.
  3. Overparameterizing during training to increase training performance and decrease inference latency.

Depthwise and Pointwise Convolutions

A common trick used by many latency-efficient architectures is to split convolutions into two convolutions using smaller kernels. I’m going to assume you know the details of how a convolution works, in short you take an input image, say HxWxC, and multiplying it by D, KxKxC kernels to produce an H’xW’xD output. Ignoring strides for now, this means you must do D*(K*K*C) operations. Since D is often large, we can reduce the number of operations by ensuring that D does not get multiplied by K*K by using a depthwise and then a pointwise convolution. You can think of a depthwise convolution as breaking the kernel into C different parts, so a depthwise convolution between an HxWxC image and a KxKxC kernel will produce an H’xW’xC output where the channels stay the same. A pointwise convolution is just a regular convolution but with only 1×1 spatial dimensions, so a 1x1xCxD. Now, instead of having K*K*C*D operations we have K*K*C + 1*1*C*D.

Merging Convolution and BatchNorm Layers and Removing Skip Connections

The second trick is merging convolution layers with batch normalization layers and removing skip connections during inference. To better understand this we show a simple 1D case, where you can think of x_i as a pixel, is derived below for both convolution + batch normalization and skip connections:

Just by removing these skip connections we can get huge speed ups, up to 40% on 1024×1024 sized images (figure from FastViT):

Train-Time Overparameterization

The depthwise and pointwise convolutions previously not only reduced our computation but also reduced our parameter count. To help account for some of this we can also overparameterize our model during training time. The trick is very similar to the trick we used to reparameterize our skip connections. Say you have a 3×3 and a 1×1 convolution whose outputs are added together:

If we keep track of the strides and padding on both of the convolutions, we can see that they actually overlap each other. Given this, during inference time we can simply pad the 1×1 kernel into a 3×3 kernel and add it to the other 3×3 kernel:

This trick slightly increases training time but also gives us a boost in performance (figure from FastViT):

Putting Everything Together

Combining all of these tricks together we can significantly simplify the building blocks of the model during inference time:

For LandingLens

To build a FastViT for Segmentation models in LandingLens, we added a semantic FPN head to the SA12 backbone. Since the original paper did not release this model, we trained our own on ADE20k and achieved similar performance to the paper:

  • LandingLens:  39 mIoU 
  • FastViT paper: 38 mIoU

We also tuned the hyperparameters, such as learning rate schedulers and regularization, to work better with our customers’ datasets which tend to be smaller. This tuning led to a 1-2% improvement in mIoU on our internal benchmark datasets, which cover everything from satellite imagery, to healthcare, to manufacturing.

Note all the experiments below are performed on images with a size of 1024×1024. Most reported latencies are from much smaller images,like 256×256, which is 1/16th the size of this image. The results are reported on 1024×1024 because in many real-world applications, they require larger image sizes to find smaller features, such as chip manufacturing or satellite imagery.

Comparing the reparameterized model to the non-reparameterized mode, we can get a ~26% speed up on CPU and a 46% improvement on GPU. 

We can also examine the latency across different hardware. The first column shows the results when we used 16-core AMD Ryzen Threadripper PRO for a CPU. The second column shows the results with an  NVIDIA Jetson Orin Nano, which is a much smaller embedded device. The last column shows the results with an NVIDIA RTX A5000, a larger consumer-grade GPU.

We can take it even further by using mixed precision with TensorRT to get 7.4ms inference speed on 1024×1024 images with the RTX A5000.


In this blog, we showed three tricks from the FastViT paper to speed up your model inference time without sacrificing too much performance:

  • Using depthwise and pointwise convolutions to lower the computational burden of convolution operations. 
  • Merging batch normalization and skipping connections into single convolutions.
  • Overparameterizing convolutions during training. 

Finally, we used these tricks and added an FPN head to train a Segmentation model and showed we can achieve 7.4ms inference time on 1024×1024 images.

Check out the segmentation project in LandingLens today!

Landing AI Monthly Newsletter

Stay updated with AI news and resources delivered to your box

Related Resources