This semantic segmentation model also makes use of the encoder-decoder modules. A regular convolutional neural network with a fully connected layer is used as the encoder. The encoder extracts the low-resolution feature map.
We obtain a low-resolution feature map in the encoding process due to different stride convolutions and pooling that is used in the convolutional network.
Then, this feature map has to go through a decoder module to produce a segmented image of the original resolution. The decoder module makes use of the Feature Pyramid Network, FPN.
The diagram given above depicts the inner working of the FPN at a very high level. The bottom-up path is the encoding part where the image is converted to a low-resolution feature map.
For the decoding part, the feature map combines these low-resolution feature map that has semantically strong features, with the previously upsampled image that has the semantically low features.
We obtain stronger features as we move deeper in the neural network. For example, in the first layer of the CNN network, we might find the features like lines or simple edges of an object, but as we go deeper, we might find features that describe the image. For example, a bus or car.
Due to this, the feature pyramid has rich semantic features at all levels.
Now, since all the levels are rich in semantic features, they can be combined to produce the final segmented image. Note that the above picture depicts an additional 3X3 convolution from the 2nd column to the 3rd column. This is done to reduce the aliasing effect of upsampling.
This defines the depth of the encoder network. Note that deeper networks produce results with lower error but are more computationally expensive.
It is the weight by which the encoder network is initialized. The weights here are the ones that were found on the ImageNet dataset of the respective architectures.
They are the weights by which the entire FPN architecture is initialized. Here, it is initialized randomly.
It is the spatial dropout rate used in FPN.
When the feature maps are strongly correlated to each other, regular dropout of the layers will not cause regularization and only results in a learning rate decrease. Spatial dropout is used to drop entire feature maps. This regularizes the network with strongly correlated feature maps and makes the training computationally efficient.
If the dropout rate is 0.7, then 70% of the feature maps will be dropped.