Faster R-CNN is an architecture for object detection achieving great results on most benchmark data sets. It builds directly on the work on the R-CNN and Fast R-CNN architectures but is more accurate as it uses a deep network for region proposal unlike the other two.
The breakthrough of Faster R-CNN is that it does the region proposals and classification predictions on the same feature map instead of using a sliding window approach and then splitting the tasks like its predecessors.
First, the architecture uses a backbone network to extract some features of the input image. Any classification architecture can be used, e.g., some ResNet variant combined with a Feature Pyramide Network (FPN).
Then, an anchor is generated for each feature, and for each anchor, a set of anchor boxes with variable sizes and aspect ratios are created.
The Region Proposal Network (RPN) detects the "good" anchor boxes, which will be forwarded to the next layer. The RPN consists of a classifier and a regressor.
The classifier predicts if an anchor box contains an object (IoU of anchor box with ground truth label is above a certain threshold) or contains parts of the background (IoU of anchor box with ground truth label is below a certain threshold).
Then, the regressor predicts offsets for the anchor boxes which contain objects to fit them as tightly as possible to the ground truth labels.
The RPN outputs a lot of noise, i.e., multiple bounding boxes for the same object. To reduce the noise and improve the performance of the overall model, Non-Max Suppression (NMS) is applied.
NMS works by identifying the bounding boxes with the highest confidence and then discarding the ones with high overlap:
Step 1: Select the box with the highest confidence (=objectiveness) score and pass it forward.\
Step 2: Then, compare the overlap (IoU) of this box with other boxes. \
Step 3: Remove the bounding boxes with high overlap (IoU > threshold, often 0.5). \
Step 4: Then, move to the box with the next highest confidence score. \
Step 5: Repeat steps 2-4 until all boxes have been checked.
Finally, the RoI pooling layer converts generated proposals of variable sizes to a fixed size to run a classifier and regress a bounding box on top of it.
Typically, the following hyperparameters are tweaked when using Faster R-CNN:
Specifying the architecture for the network on which Faster R-CNN is built.
These thresholds are used to decide if an anchor box generated contains an object or is part of the background.
Everything that is above the upper IoU threshold of the proposed anchor box and ground truth label will be classified as an object and forwarded. Everything below the lower threshold will be classified as background and the network will be penalized. For all the anchor boxes with an IoUbetween the thresholds, we're not sure if it's for- or background and we'll just ignore them.
How many convolution filters the final layer to make the classification contains. To a certain degree, increasing the number of filters will enable the network to learn more complex features, but the effect vanishes if you add too many filters and the network will perform worse (see the original ResNet paper to understand why you cannot endlessly chain convolution filters).
How many fully connected layers (FC) the last part of the network contains. Increasing the number of FCs can increase performance for a computational cost, but you might overfit the sub-network if you add too many.
The maximum of proposals that are taken into consideration by NMS. The proposals are sorted descending after confidence and only the ones with the highest confidence are chosen.
The maximum of proposals that will be forwarded to the ROI box head. Again, the proposals are sorted descending after confidence and only the ones with the highest confidence are chosen.
Config for training
Low numbers of NMS proposals in training will result in a lower recall, but higher precision. Vice versa.
Config for testing
After extracting the Region of Interests from the feature map, they should be adjusted to a certain dimension before feeding them to the fully connected layer that will later do the actual object detection. For this, ROI Align is used which makes use of points that would be sampled from a defined grid, to resize the ROIs. The number of points that we use is defined by Pooler Sampling Ratio.
It is the size to pool proposals before feeding them to the mask predictor, in Model Playground default value is set as 7.
It is the depth variant of resnet to use as the backbone feature extractor, in Model Playground depth can be set as 18/50/101/152
It's the weights to use for model initialization, and in Model Playground R50-FPN COCO weights is used.
# import necessary libraries from PIL import Image import matplotlib.pyplot as plt import torch import torchvision.transforms as T import torchvision import torch import numpy as np import cv2 import os # get the pretrained model from torchvision.models # Note: pretrained=True will get the pretrained weights for the model. # model.eval() to use the model for inference model = torchvision.models.detection.fasterrcnn_resnet50_fpn(pretrained=True) model.eval() # Class labels from official PyTorch documentation for the pretrained model # Note that there are some N/A's # for complete list check https://tech.amikelive.com/node-718/what-object-categories-labels-are-in-coco-dataset/ COCO_INSTANCE_CATEGORY_NAMES = [ '__background__', 'person', 'bicycle', 'car', 'motorcycle', 'airplane', 'bus', 'train', 'truck', 'boat', 'traffic light', 'fire hydrant', 'N/A', 'stop sign', 'parking meter', 'bench', 'bird', 'cat', 'dog', 'horse', 'sheep', 'cow', 'elephant', 'bear', 'zebra', 'giraffe', 'N/A', 'backpack', 'umbrella', 'N/A', 'N/A', 'handbag', 'tie', 'suitcase', 'frisbee', 'skis', 'snowboard', 'sports ball', 'kite', 'baseball bat', 'baseball glove', 'skateboard', 'surfboard', 'tennis racket', 'bottle', 'N/A', 'wine glass', 'cup', 'fork', 'knife', 'spoon', 'bowl', 'banana', 'apple', 'sandwich', 'orange', 'broccoli', 'carrot', 'hot dog', 'pizza', 'donut', 'cake', 'chair', 'couch', 'potted plant', 'bed', 'N/A', 'dining table', 'N/A', 'N/A', 'toilet', 'N/A', 'tv', 'laptop', 'mouse', 'remote', 'keyboard', 'cell phone', 'microwave', 'oven', 'toaster', 'sink', 'refrigerator', 'N/A', 'book', 'clock', 'vase', 'scissors', 'teddy bear', 'hair drier', 'toothbrush' ] # Functions to run inference on the image and display the output bounding boxes on the image. def get_prediction(img_path, threshold): """ get_prediction parameters: - img_path - path of the input image - threshold - threshold value for prediction score method: - Image is obtained from the image path - the image is converted to image tensor using PyTorch's Transforms - image is passed through the model to get the predictions - class, box coordinates are obtained, but only prediction score > threshold are chosen. """ img = Image.open(img_path) transform = T.Compose([T.ToTensor()]) img = transform(img) pred = model([img]) pred_class = [COCO_INSTANCE_CATEGORY_NAMES[i] for i in list(pred['labels'].numpy())] pred_boxes = [[(i, i), (i, i)] for i in list(pred['boxes'].detach().numpy())] pred_score = list(pred['scores'].detach().numpy()) pred_t = [pred_score.index(x) for x in pred_score if x>threshold][-1] pred_boxes = pred_boxes[:pred_t+1] pred_class = pred_class[:pred_t+1] return pred_boxes, pred_class def object_detection_api(img_path, threshold=0.5, rect_th=3, text_size=3, text_th=3): """ object_detection_api parameters: - img_path - path of the input image - threshold - threshold value for prediction score - rect_th - thickness of bounding box - text_size - size of the class label text - text_th - thickness of the text method: - prediction is obtained from get_prediction method - for each prediction, bounding box is drawn and text is written with opencv - the final image is displayed """ boxes, pred_cls = get_prediction(img_path, threshold) img = cv2.imread(img_path) img = cv2.cvtColor(img, cv2.COLOR_BGR2RGB) for i in range(len(boxes)): cv2.rectangle(img, boxes[i], boxes[i],color=(0, 255, 0), thickness=rect_th) cv2.putText(img,pred_cls[i], boxes[i], cv2.FONT_HERSHEY_SIMPLEX, text_size, (0,255,0),thickness=text_th) plt.figure(figsize=(20,30)) plt.imshow(img) plt.xticks() plt.yticks() plt.show() # testing on image object_detection_api('/content/Hasty_Founders.jpg', threshold=0.8)