DeepLabv3+ is a semantic segmentation architecture that builds on DeepLabv3 by adding a simple yet effective decoder module to enhance segmentation results.

Multiple downsampling of a CNN will lead the feature map resolution to become smaller, resulting in lower prediction accuracy and loss of boundary information in semantic segmentation. Similarly, aggregating context around a feature helps in segmenting it better, which is accomplished with the atrous convolutions. DeepLabv3+ helps in solving these issues.

Downsampling is widely adopted in deep convolutional neural networks (CNN) for reducing memory consumption while preserving the transformation invariance to some degree.

Atrous rate

Atrous Convolution/Dilated Convolution is a tool for refining the effective field of view of the convolution. It modifies the field of view using a parameter termed atrous rate. It is a simple yet powerful approach for enlarging the field of view of filters without affecting computation or the number of parameters.

Atrous/Dilated Convolution has wider field of view with same number of parameters as Normal

DeepLabV3+ adds an encoder based on DeepLabV3 to fix the previously noted problem of DeepLabV3 consuming too much time to process high-resolution images.\
The application of the depthwise separable convolution to both atrous spatial pyramid pooling and decoder modules results in a faster and stronger encoder-decoder network for semantic segmentation.

Output stride

Output stride describes the ratio of the size of the input image to the size of the output feature map. It specifies how much signal reduction the input vector experiences as it passes the network.

In Model Playground, we have the option of having output stride as 8 or 16

In the architecture below, the encoder is based on an output stride of 16, i.e. the input image is down-sampled by a factor of 16.


Architecture proposed in deeplabv3 original paper by Chen et al.

Encoder network

DeepLabV3+ employs Aligned Xception as its main feature extractor (encoder), although with substantial modifications. Depth-wise separable convolution replaces all max pooling procedures.

Thanks to the encoder-decoder structure in DeepLabv3+, you can arbitrarily control the resolution of extracted encoder features by atrous convolution to trade-off precision and runtime.

In Model Playground, we can select feature extraction (encoding) network to use as either Resnet or EffiecientNet.


It's the weights to use for model initialization, and in Model Playground ResNet101 COCO weights are used.

Code Implementation

  # implement semantic segmentation with deeplabv3+ model is trained on ade20k dataset.
  pip3 install tensorflow
  import pixellib
  pip3 install pixellib — upgrade
  from pixellib.semantic import semantic_segmentation

  segment_image = semantic_segmentation()
  segment_image.segmentAsAde20k("path_to_image", output_image_name= "path_to_output_image")

  #xception model trained on ade20k for segmenting objects:

PASCAL VOC 2012 test set results with SOTA approaches

As seen above, DeepLabv3+ surpasses various SOTA techniques, including LC, ResNet-DUC-HDC (TuSimple), GCN (Large Kernel Matters), RefineNet, ResNet-38, PSPNet, IDW-CNN, SDN, DIS, and DeepLabv3.

Further Resources

Wiki entry for U-Net

Wiki entry for U-Net+\

Last updated on Jun 01, 2022

Removing the risk from vision AI.

Only 13% of vision AI projects make it to production, with Hasty we boost that number to 100%.