Scene Parsing

Scene parsing is the process of segmenting and parsing an image into various visual areas that correspond to semantic categories such as sky, road, person, and bed.

Scene parsing on ADE20K dataset.

From the figure above we see that there are several issues for complex-scene parsing. The first row shows the issue of mismatched relationship – cars are seldom over water than boats. The second row shows confusion categories where class “building” is easily confused as “skyscraper”. The third row illustrates inconspicuous classes. In terms of colour and texture, the pillow in this case is extremely comparable to the bedsheet. These inconspicuous objects are easily misclassified by Fully Convolutional Network (FCN).

A deep network with a suitable global-scene-level prior can much improve the performance of scene parsing, and this is where PSPNet comes in, it is able to capture the context of the whole image to classify the object as a boat.

Encoder network

The PSPNet encoder contains the CNN backbone with dilated convolutions along with the pyramid pooling module. The last two blocks of the backbone have Dilated convolution layers with dilation rate 2 and 4 respectively.

In Model Playground, we can select feature extraction (encoding) network to use as either Resnet or EffiecientNet. We can also select the depth and weights in the Resnet variant.


Dropout refers to randomly ignoring neurons making the network less sensitive to the specific weights of neurons, which in turn results in a network that is capable of better generalization and is less likely to overfit the training data. It is the probability of an element to be zeroed in the decoder output (right before segmentation head)

In Model Playground PSPNet dropout can be set between 0 and 1.

Pyramid Pooling Module

Proposed architecture design of PSPNet by Zhao et al.

Given an input image, PSPNet uses a pretrained ResNet model with the dilated network strategy to extract the feature map of the last convolutional layer. On top of this feature map, the pyramid pooling module is applied to harvest different sub-region representations. This is followed by upsampling on the pooled features to make them the same size as the original feature map. Afterward, these upsampled maps are concatenated with the original feature map to be passed to the decoder carrying both local and global context information which is fed into a convolution layer to get the final per-pixel prediction.

PSP out channels

In Model Playground the number of out channels in the PSP decoder can be set.


It's the weights to use for model initialization, and in Model Playground Random Initialization of weights is used.

Code Implementation

class PSPModule(nn.Module):
    def __init__(self, features, out_features=1024, sizes=(1, 2, 3, 6)):
        self.stages = []
        self.stages = nn.ModuleList([self._make_stage(features, size) for size in sizes])
        self.bottleneck = nn.Conv2d(features * (len(sizes) + 1), out_features, kernel_size=1)
        self.relu = nn.ReLU()

    def _make_stage(self, features, size):
        prior = nn.AdaptiveAvgPool2d(output_size=(size, size))
        conv = nn.Conv2d(features, features, kernel_size=1, bias=False)
        return nn.Sequential(prior, conv)

    def forward(self, feats):
        h, w = feats.size(2), feats.size(3)
        priors = [F.upsample(input=stage(feats), size=(h, w), mode='bilinear') for stage in self.stages] + [feats]
        bottle = self.bottleneck(, 1))
        return self.relu(bottle)

class PSPUpsample(nn.Module):
    def __init__(self, in_channels, out_channels):
        self.conv = nn.Sequential(
            nn.Conv2d(in_channels, out_channels, 3, padding=1),

    def forward(self, x):
        h, w = 2 * x.size(2), 2 * x.size(3)
        p = F.upsample(input=x, size=(h, w), mode='bilinear')
        return self.conv(p)

class PSPNet(nn.Module):
    def __init__(self, n_classes=18, sizes=(1, 2, 3, 6), psp_size=2048, deep_features_size=1024, backend='resnet34',
        self.feats = getattr(extractors, backend)(pretrained)
        self.psp = PSPModule(psp_size, 1024, sizes)
        self.drop_1 = nn.Dropout2d(p=0.3)

        self.up_1 = PSPUpsample(1024, 256)
        self.up_2 = PSPUpsample(256, 64)
        self.up_3 = PSPUpsample(64, 64)

        self.drop_2 = nn.Dropout2d(p=0.15) = nn.Sequential(
            nn.Conv2d(64, n_classes, kernel_size=1),

        self.classifier = nn.Sequential(
            nn.Linear(deep_features_size, 256),
            nn.Linear(256, n_classes)

    def forward(self, x):
        f, class_f = self.feats(x) 
        p = self.psp(f)
        p = self.drop_1(p)

        p = self.up_1(p)
        p = self.drop_2(p)

        p = self.up_2(p)
        p = self.drop_2(p)

        p = self.up_3(p)
        p = self.drop_2(p)

        auxiliary = F.adaptive_max_pool2d(input=class_f, output_size=(1, 1)).view(-1, class_f.size(1))

        return, self.classifier(auxiliary)

Further Resources:

PSPNet by Zhao et al. :\

Get AI confident. Start using Hasty today.

Our platform is completely free to try. Sign up today to start your two-month trial.