We have already discussed Cross-Entropy Loss and Binary Cross-Entropy Loss for the classification problem. Focal loss is an extension of these losses with new parameters that give more weight to "hard" samples and less weight to "easy" samples. We shall discuss more thoroughly what the hard and easy samples refer to.
Easy samples are the samples that are classified as foreground or background, and hard samples are misclassified as foreground or background.
Class Imbalance might happen if information relayed to one class in a dataset over-represents the rest of the classes. Training a network in such an imbalanced data set might cause the network to be more biased towards learning more of the data-dominated class.
This type of imbalance might happen in two ways. First, it can happen between the classes of the foreground image, for example having 20 times more pictures of dogs than other animals in an animal detection problem. Second, there might be more background anchors than foreground anchors.
Figure representing two cases of imbalances. source: https://arxiv.org/pdf/1708.02002.pdf
Let us look at the first figure. Since there is almost a thousand
times more background than the foreground images, more background images
are the easy samples because they are easier to detect and more foreground images are the hard
samples. Therefore, the loss in background classification is
considerably lower than the foreground classification for a single
image. But, if we sum all the errors, then the background loss can overwhelm the foreground loss despite having fewer individual losses.
To counteract this problem posed by class imbalance, we introduce Focal Loss.
We use a scaling factor along with the Cross-Entropy Loss in order to down-weight the easy samples and give more weight to the hard samples. In other words, we modulate the cross-entropy such that the training process can focus on the hard negative samples.
This term is the in the following formula for Focal Loss. This alpha offsets the class imbalance of the number of examples. This alpha is set inversely proportional to the number of examples for a particular class or is learned through cross-validation.
This term is used to focus on the hard examples. The hard and easy examples are defined by the probability of the correct classification, . If is low, then we are dealing with a hard sample. In this case, the modulating factor, remains near 1 and is unaffected. If is high then the modulating factor vanishes to 0. This gives lower weight to the easy samples.
The value of gamma is set to 2 which was found to be a value that performed quite well.