Box2Seg: Attention Weighted Loss and Discriminative Feature Learning for Weakly Supervised Segmentation

August 19, 2020

Here we present our work on Box2Seg: Attention Weighted Loss and Discriminative Feature Learning for Weakly Supervised Segmentation where we call our model Box2Seg. This work got accepted in ECCV-2020. In this work we focus on weakly supervised semantic segmentation where we propose an approach using only bounding box annotations during training time.

We treat bounding boxes as noisy labels for the foreground objects. As all the earlier approaches, we also use GrabCut pre-processing of the bounding boxes, to recover pseudo segmentation ground truth. One may directly train a fully-supervised segmentation model using the GrabCut outputs. However, while the bounding-box is a super set of foreground pixels, GrabCut on the other hand can both over-estimate or under-estimate the foreground, so we still cannot ignore the bounding box annotations. For this specific purpose, we propose a novel loss function called the Attention Weighted Loss.

The attention weighted loss tries to deal with label noise in bounding box annotations. We modify the standard Cross-Entropy loss function by introducing a multiplicative attention score per pixel. We predict the per-class attention map that saliently guides the per-pixel cross entropy loss to focus on foreground pixels and refines the segmentation boundaries. This avoids propagating erroneous gradients due to incorrect foreground labels on the background. We regularize the attention map with a soft filling-rate constraint, which is obtained using the GrabCut output.

Additionally, we learn pixel embeddings to simultaneously optimize for high intra-class feature affinity while increasing discrimination between features across different classes. The goal here is to train our network to produce features, that can help us identify which pixels are likely to take the same label. For this, we define a pairwise affinity matrix A, such that the affinity between pixels i and j is the cosine distance between their features. In this context, pixels in the image which should take the same label should have similar features. Before we start our training procedure we pretrain these embeddings by minimizing the cosine distance of the embeddings of the pixels, which should take the same labels according to the GrabCut output. We introduce a loss function to the training objective that minimizes disagreements between the Affinities produced by the discriminative features, and the actual segmentation output. By minimizing this loss, we try to harmonize the two kinds of information coming from the two sources.

Once trained, we can discard the branches producing attention maps and disciminative features, and directly use the segmentation branch’s output as the prediction during the test time.

We evaluated our approach against other state of the art approaches for the task of weakly-supervised semantic segmentation on the PASCAL VOC 2012 dataset, and we outperform the latest papers in this domain, pushing the state of the art by more than 2%. Our weakly supervised approach is also comparable to the recent fully supervised methods.

Attention weighted loss also helps in the fully supervised segmentation in case of PASCAL VOC 2012 dataset. We analysed the VOC dataset and found that roughly 10% of the segmentation ground truth in VOC are missing foregrounds. However, the bounding boxes are exhaustive and cover all objects in the dataset. Because our Attention Weighted Loss is able to automatically discover attention maps, it is able to capitalize on missing annotations, and therefore we are able to produce a 1.5% improvement in the mean IoU on top of the fully-supervised baseline simply by introducing our Attention Weighted Loss.

Our other qualitative results and ablation studies show the benefit of different loss terms on the overall performance of our model Box2Seg.

Code will be released here


Copyright 2015 Viveka Kulharia. Last update (ET): 2020-11-03 11:24:14 +0000. Atom Feed