CascadeMask R-CNN extends Cascade R-CNN by adding an extra mask head to the cascade.
Here we can see that the segmentation branch "S" is placed parallel to the detection branch.
Typically, the following hyperparameters are tweaked when using Cascade Mask R-CNN:
Specifying the architecture for the network on which Cascade Mask R-CNN is built.
These thresholds are used to decide if an anchor box generated contains an object or is part of the background.
Everything that is above the upper IoU threshold of the proposed anchor box and ground truth label will be classified as an object and forwarded. Everything below the lower threshold will be classified as background and the network will be penalized. For all the anchor boxes with an IoU between the thresholds, we're not sure if it's for- or background and we'll just ignore them.
This is identifying how many convolution filters the final layer to make the classification contains. To a certain degree, increasing the number of filters will enable the network to learn more complex features, but the effect vanishes if you add too many filters and the network will perform worse (see the original ResNet paper to understand why you cannot endlessly chain convolution filters).
This is identifying how many fully connected layers (FC) the last part of the network contains. Increasing the number of FCs can increase performance for a computational cost, but you might overfit the sub-network if you add too many.
It's the weights to use for model initialization, and in Model Playground R50-Cascade-COCO weights are used.
The maximum of proposals that are taken into consideration by NMS. The proposals are sorted descending after confidence and only the ones with the highest confidence are chosen.
The maximum of proposals that will be forwarded to the ROI box head. Again, the proposals are sorted descending after confidence and only the ones with the highest confidence are chosen.
Freezing the stages in a Neural Network is a technique that was introduced to reduce the computation. After the stages have been frozen, the architecture doesn't have to backpropagate to it. Freezing many stages will help the computation be faster but aggressive freezing can degrade the model output and can result in sub-optimal predictions.
It is the spatial size to pool proposals before feeding them to the box predictor, in the model playground default value is set as 7.
It is the spatial size to pool proposals before feeding them to the mask predictor, in the model playground default value is set as 14.
After extracting the Region of Interests from the feature map, they should be adjusted to a certain dimension before feeding them to the fully connected layer that will later do the actual object detection. For this, ROI Align is used which makes use of points that would be sampled from a defined grid, to resize the ROIs. The number of points that we use is defined by Pooler Sampling Ratio.