Scene parsing is the process of segmenting and parsing an image into various visual areas that correspond to semantic categories such as sky, road, person, and bed.
From the figure above we see that there are several issues with complex-scene parsing. The first row shows the issue of mismatched relationships – cars are seldom over water than boats. The second row shows confusion categories where the class “building” is easily confused with “skyscraper”. The third row illustrates inconspicuous classes. In terms of color and texture, the pillow in this case is extremely comparable to the bedsheet. These inconspicuous objects are easily misclassified by Fully Convolutional Network (FCN).
A deep network with a suitable global-scene-level prior can much improve the performance of scene parsing, and this is where PSPNet comes in, it is able to capture the context of the whole image to classify the object as a boat.
Given an input image, PSPNet uses a pre-trained ResNet model with the dilated network strategy to extract the feature map of the last convolutional layer. On top of this feature map, the pyramid pooling module is applied to harvest different sub-region representations. This is followed by upsampling on the pooled features to make them the same size as the original feature map. Afterward, these upsampled maps are concatenated with the original feature map to be passed to the decoder carrying both local and global context information which is fed into a convolution layer to get the final per-pixel prediction.
The PSPNet encoder contains the CNN backbone with dilated convolutions along with the pyramid pooling module. The use of the encoder network is to transform the raw input image into an intermediate input that is understandable by the neural network. For PSPnet, the feature map of the image is this intermediate input that is generated through a CNN backbone.
In Model Playground, several encoder architectures can be used to generate this feature map. They are:
ResNet is a feature extractor with very deep layers and skipped connections. The main idea behind ResNet is to build better models with increasing depth of the model by skipping the connections between some of the blocks.
For this encoder network, the depth of the ResNet and the weights can be selected.
Efficient Net, an architecture that is searched using NARS (Neural Architecture Search), can also be used as a feature extractor.
Efficient Net subtype and the weights of the network can be chosen for this architecture.
MobileNetV2 was introduced as a new mobile architecture that improved the state-of-the-art performances of the mobile models. MobileNetV2 is based on depthwise separable convolution which is far more computationally efficient than the standard convolution. DSC (depthwise separable convolution) uses pointwise convolution of 1X1XM and depth-wise convolution of kXk on each of the channels of the filters.
This encoder is initialized with MobileNetV2ImageNet weights.
Users can also specify the width multiplier for this encoder network. The width multiplier is the factor by which the number of channels of the current layer is multiplied to obtain the number of channels in the next layer.
SWIN is based on a transformer technology that is dominantly used for Natural Language Processing tasks. But SWIN is scalable when the input is an image and has outperformed the dominating CNNs in some scenarios.
It's the weights to use for model initialization, and in Model Playground Random Initialization of weights is used.
Dropout refers to randomly ignoring neurons making the network less sensitive to the specific weights of neurons, which in turn results in a network that is capable of better generalization and is less likely to overfit the training data. It is the probability of an element to be zeroed in the decoder output (right before the segmentation head)
In Model Playground PSPNet dropout can be set between 0 and 1.
In Model Playground the number of out-channels in the PSP decoder can be set with an increased number of output channels.
class PSPModule(nn.Module): def __init__(self, features, out_features=1024, sizes=(1, 2, 3, 6)): super().__init__() self.stages =  self.stages = nn.ModuleList([self._make_stage(features, size) for size in sizes]) self.bottleneck = nn.Conv2d(features * (len(sizes) + 1), out_features, kernel_size=1) self.relu = nn.ReLU() def _make_stage(self, features, size): prior = nn.AdaptiveAvgPool2d(output_size=(size, size)) conv = nn.Conv2d(features, features, kernel_size=1, bias=False) return nn.Sequential(prior, conv) def forward(self, feats): h, w = feats.size(2), feats.size(3) priors = [F.upsample(input=stage(feats), size=(h, w), mode='bilinear') for stage in self.stages] + [feats] bottle = self.bottleneck(torch.cat(priors, 1)) return self.relu(bottle) class PSPUpsample(nn.Module): def __init__(self, in_channels, out_channels): super().__init__() self.conv = nn.Sequential( nn.Conv2d(in_channels, out_channels, 3, padding=1), nn.BatchNorm2d(out_channels), nn.PReLU() ) def forward(self, x): h, w = 2 * x.size(2), 2 * x.size(3) p = F.upsample(input=x, size=(h, w), mode='bilinear') return self.conv(p) class PSPNet(nn.Module): def __init__(self, n_classes=18, sizes=(1, 2, 3, 6), psp_size=2048, deep_features_size=1024, backend='resnet34', pretrained=True): super().__init__() self.feats = getattr(extractors, backend)(pretrained) self.psp = PSPModule(psp_size, 1024, sizes) self.drop_1 = nn.Dropout2d(p=0.3) self.up_1 = PSPUpsample(1024, 256) self.up_2 = PSPUpsample(256, 64) self.up_3 = PSPUpsample(64, 64) self.drop_2 = nn.Dropout2d(p=0.15) self.final = nn.Sequential( nn.Conv2d(64, n_classes, kernel_size=1), nn.LogSoftmax() ) self.classifier = nn.Sequential( nn.Linear(deep_features_size, 256), nn.ReLU(), nn.Linear(256, n_classes) ) def forward(self, x): f, class_f = self.feats(x) p = self.psp(f) p = self.drop_1(p) p = self.up_1(p) p = self.drop_2(p) p = self.up_2(p) p = self.drop_2(p) p = self.up_3(p) p = self.drop_2(p) auxiliary = F.adaptive_max_pool2d(input=class_f, output_size=(1, 1)).view(-1, class_f.size(1)) return self.final(p), self.classifier(auxiliary)