CascadeMask R-CNN

CascadeMask R-CNN extends Cascade R-CNN by adding an extra mask head to the cascade.

Here we can see that the segmentation branch "S" is placed parallel to the detection branch.

Cascade Mask R-CNN in Model Playground

Hyperparameters

Typically, the following hyperparameters are tweaked when using Cascade Mask R-CNN:

Backbone network

‌Specifying the architecture for the network on which Cascade Mask R-CNN is built.

Mostly, the backbone network is a ResNet variation.

IoU thresholds

These thresholds are used to decide if an anchor box generated contains an object or is part of the background.

Everything that is above the upper IoU threshold of the proposed anchor box and ground truth label will be classified as an object and forwarded. Everything below the lower threshold will be classified as background and the network will be penalized. For all the anchor boxes with an IoU between the thresholds, we're not sure if it's for- or background and we'll just ignore them.

Empirically, setting the lower bound to 0.3 and the upper to 0.7 leads to robust results.

Number of convolution filters in the ROI box head

T‌his is identifying how many convolution filters the final layer to make the classification contains. To a certain degree, increasing the number of filters will enable the network to learn more complex features, but the effect vanishes if you add too many filters and the network will perform worse (see the original ResNet paper to understand why you cannot endlessly chain convolution filters).

A good default value is 4 conv filters.

Number of fully connected layers in the ROI box head

T‌his is identifying how many fully connected layers (FC) the last part of the network contains. Increasing the number of FCs can increase performance for a computational cost, but you might overfit the sub-network if you add too many.

Often, 2 FCs are used as starting point.

Weights

It's the weights to use for model initialization, and in Model Playground R50-Cascade-COCO weights are used.

NMS number of proposals

‌Pre NMS

‌The maximum of proposals that are taken into consideration by NMS. The proposals are sorted descending after confidence and only the ones with the highest confidence are chosen.

‌Post NMS

‌The maximum of proposals that will be forwarded to the ROI box head. Again, the proposals are sorted descending after confidence and only the ones with the highest confidence are chosen.

‌Config for training

‌Low numbers of NMS proposals in training will result in a lower recall, but higher precision. Vice versa.

The default for post NMS number of proposals in training is 1000, pre NMS 4000.

‌Config for testing

‌Here, the number of NMS proposals for the simple forward pass, e.g., inference, is defined. Less NMS proposals will increase inference speed, but higher numbers yield greater performance.

If you don't need super-fast inference, a good default is 1000 for post NMS number of proposals and 2000 for pre NMS.

Advanced Options

Stages to Freeze

Freezing the stages in a Neural Network is a technique that was introduced to reduce the computation. After the stages have been frozen, the architecture doesn't have to backpropagate to it. Freezing many stages will help the computation be faster but aggressive freezing can degrade the model output and can result in sub-optimal predictions.

Pooler resolution (Box Head)

It is the spatial size to pool proposals before feeding them to the box predictor, in the model playground default value is set as 7.

Pooler resolution (Mask Head)

It is the spatial size to pool proposals before feeding them to the mask predictor, in the model playground default value is set as 14.

Pooler Sampling Ratio

After extracting the Region of Interests from the feature map, they should be adjusted to a certain dimension before feeding them to the fully connected layer that will later do the actual object detection. For this, ROI Align is used which makes use of points that would be sampled from a defined grid, to resize the ROIs. The number of points that we use is defined by Pooler Sampling Ratio.

If Pooler Sampling ratio is set to 2, then 2 * 2 = 4 points are used for the interpolation.

Boost model performance quickly with AI-powered labeling and 100% QA.

Learn more

Last modified 9d ago

Previous - Computer Vision model architectures

FBNetV3OD

Next - Computer Vision model architectures

HybridTask Cascade