Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks

Metadata

Slug: faster_rcnn_2015
Year: 2015
Venue: NeurIPS
Authors: Shaoqing Ren, Kaiming He, Ross Girshick, Jian Sun
Reading status: read complete
Compute regime: Single-GPU deep learning
Primary sources: PDF, extracted text

Compute Setup

The paper explicitly reports inference timing on a K40 GPU, with the exception that Selective Search proposal generation is evaluated on CPU. Table 5 is the key hardware statement: timings are in milliseconds on a K40 GPU, except the Selective Search proposal step. The reported full VGG-16 Faster R-CNN system runs at 198 ms per image, or 5 fps, and the ZF version runs at 59 ms per image, or 17 fps.

For larger COCO training, the paper states that the models are trained on an 8-GPU implementation. The effective mini-batch size becomes 8 for RPN, one image per GPU, and 16 for Fast R-CNN, two images per GPU. The COCO RPN and Fast R-CNN steps are each trained for 240k iterations at learning rate 0.003 and 80k more at 0.0003, with the learning rate changed because the mini-batch size changes.

Bottleneck

After Fast R-CNN, region proposal generation becomes the bottleneck. Fast R-CNN had already reduced the detector's repeated convolution cost by sharing image-level convolutional features, so CPU proposal algorithms became the slowest remaining stage. The source says Selective Search is about 2 seconds per image on CPU, EdgeBoxes is about 0.2 seconds per image, and the proposal step can consume as much time as the detection network.

This is a device mismatch problem. The detector is GPU-accelerated and dense-convolution heavy, but widely used proposal methods are CPU algorithms based on superpixels or hand-designed edge cues. Faster R-CNN's core move is to make proposal generation a GPU CNN computation that reuses the already-computed convolutional feature map. In the paper's phrasing, the marginal cost of proposals becomes small once the full-image features exist.

Method Adaptation

Faster R-CNN adapts the detector pipeline by:

Adding a Region Proposal Network that shares full-image convolutional features with the detector.
Using anchors at multiple scales and aspect ratios.
Reducing proposal count to about 300.
Making proposal generation a small marginal GPU cost.
Training proposal and detection stages with shared convolutional features.

The RPN is a small fully convolutional network over the detector's feature map. At each sliding-window location it predicts objectness scores and box-regression offsets for anchors of multiple scales and aspect ratios. This fits GPU computation because the expensive convolutional backbone is shared over the image, while the proposal head is a dense local operation over the feature grid. The paper's default anchor setting uses three scales and three aspect ratios, which keeps the proposal vocabulary regular enough to evaluate everywhere.

The design is also an inference-budget adaptation. R-CNN evaluated about 2000 warped regions through a CNN. Fast R-CNN still commonly used 2000 Selective Search boxes. Faster R-CNN cuts the test-time proposal count to 300 while preserving detection mAP. That reduces the region-wise work after RoI pooling: non-maximum suppression, fully connected layers, and softmax classification. Table 5 shows the region-wise VGG cost falls from 174 ms with Selective Search proposals to 47 ms with RPN proposals, because there are fewer regions and the proposals are already ranked.

Training is not fully end-to-end in the simplest reported recipe. The paper adopts a pragmatic four-step alternating algorithm: train RPN, train Fast R-CNN using RPN proposals, fine-tune RPN initialized from the detector, then fine-tune detector-specific layers with shared convolutional layers. Approximate joint training reduces training time by about 25-50%, but the reported method keeps the implementation conservative and avoids fragile gradients through proposal coordinates.

Evidence

The timing table is the clearest compute evidence. On the K40 setup, VGG with Selective Search plus Fast R-CNN spends 146 ms in convolution, 1510 ms in proposals, 174 ms region-wise, and 1830 ms total, for 0.5 fps. VGG with RPN plus Fast R-CNN spends 141 ms in convolution, 10 ms in proposals, 47 ms region-wise, and 198 ms total, for 5 fps. The lighter ZF network reaches 31 ms convolution, 3 ms proposal, 25 ms region-wise, and 59 ms total, for 17 fps.

The accuracy table shows that the speedup is not bought by losing localization quality. On PASCAL VOC 2007 with VGG-16, shared RPN features and 300 proposals reach 69.9 mAP with VOC07 training, 73.2 with VOC07+12, and 78.8 with COCO+VOC07+12. On VOC 2012, RPN with 300 proposals reaches 70.4 mAP with VOC07++12 and 75.9 with COCO+VOC07++12. The anchor ablation supports the compute design: one scale and one ratio is 65.8 or 66.7 mAP depending on scale, while the default three scales and three ratios reaches 69.9.

The COCO section adds larger-data evidence. Faster R-CNN with VGG-16 and 300 RPN proposals reaches 42.1 mAP@0.5 and 21.5 mAP@[.5,.95] on COCO test-dev when trained on COCO train, versus the paper's Fast R-CNN baseline at 39.3 and 19.3. Training on COCO trainval gives 42.7 and 21.9, with testing still about 200 ms per image.

The one-stage comparison also matters. A dense sliding-window emulation with 20,000 proposals reaches 53.8-53.9 mAP using ZF, while the two-stage RPN plus Fast R-CNN system reaches 58.7 mAP with only 300 proposals.

Historical Effect

Faster R-CNN unified proposal generation and object detection under shared CNN computation. Historically, it marks the point where object detection stopped treating proposal generation as a separate CPU preprocessor and made it part of the neural network. The bottleneck moved from external proposal algorithms to GPU feature extraction and region-wise classification.

It became a foundation for two-stage detection systems and for ILSVRC/COCO-era dense vision pipelines. The paper reports that RPN was a building block in winning 2015 ILSVRC and COCO entries, including systems that replaced VGG-16 with ResNet-101.

Limits

Training is still multi-stage in the reported recipe, and approximate joint training ignores gradients with respect to proposal box coordinates.
VGG-16 still dominates runtime and memory; RPN removes the proposal bottleneck but not the cost of a heavy backbone.
Anchor scales and aspect ratios are manual design choices.
The design keeps a two-stage cascade, which improves accuracy in the paper but leaves more moving parts than a single dense detector.
Selective Search is still used as a timing baseline, but its CPU implementation makes the comparison a cross-device pipeline comparison rather than a pure GPU algorithm comparison.