Fast R-CNN

Metadata

Slug: fast_rcnn_2015
Year: 2015
Venue: ICCV
Author: Ross Girshick
Reading status: read complete
Compute regime: Single-GPU deep learning
Primary sources: PDF, extracted text

Compute Setup

The paper explicitly states that all timing measurements use one NVIDIA K40 GPU overclocked to 875 MHz. This is a single-GPU detector paper: the timing tables compare Fast R-CNN, R-CNN, and SPPnet with the same VGG16 backbone and include training hours, per-image test rate, speedups, and VOC07 mAP. Proposal time is excluded from the fastest detector timings, so the GPU measurements isolate the ConvNet and detector head rather than the full selective-search pipeline.

Fine-tuning uses N = 2 images and R = 128 RoIs per minibatch, sampling 64 RoIs from each image. RoIs from the same image share the convolutional feature map, so forward and backward passes share computation and memory. The paper says this is roughly 64x faster than sampling one RoI from each of 128 images. For VGG16, scale choices are memory-constrained: single-scale images use shortest side s = 600 and longest side capped at 1000, and multi-scale VGG is limited by GPU memory.

Bottleneck

The bottleneck in R-CNN is redundant convolution. Each image has thousands of proposals, and R-CNN runs the ConvNet separately on each warped proposal. With VGG16, the paper reports 47 seconds per image on a GPU. Training is also multi-stage and I/O-heavy: features are extracted from each proposal, written to disk, then used to train SVMs and regressors. For VOC07 trainval, VGG16 feature extraction takes 2.5 GPU-days, and cached features occupy hundreds of gigabytes.

SPPnet removed some test-time redundancy by pooling from a shared feature map, but its training still freezes convolutional layers because backpropagation through the pyramid layer is inefficient when each RoI comes from a different image. Fast R-CNN therefore needs to share convolutional work within an image and make it trainable end to end.

Method Adaptation

Fast R-CNN moves computation from per-proposal to per-image. The network processes the whole image once, then RoI pooling crops and max-pools each proposal's region of the shared feature map into a fixed 7 x 7 tensor for VGG16. The head branches into a softmax classifier and class-specific bounding-box regressors. Because RoIs remain differentiable through RoI pooling, the model can fine-tune the convolutional backbone and train localization/classification jointly.

The minibatch scheme is a direct adaptation to GPU memory and bandwidth. With N = 2 images and R = 128 RoIs, the expensive convolutional activations for an image are reused by 64 proposals instead of being recomputed or loaded repeatedly. This reduces arithmetic and also reduces activation movement. It accepts correlated RoIs in a minibatch as the cost of compute sharing; empirically, the authors say convergence remains practical.

The final adaptation is truncated SVD for the fully connected layers. Detection evaluates thousands of RoIs, so fc6 and fc7 remain expensive even after convolution sharing. The paper compresses VGG16's 25088 x 4096 fc6 matrix with the top 1024 singular values and fc7 with the top 256, replacing each large matrix with two smaller matrices.

Evidence

For VGG16, R-CNN training takes 84 hours and tests at 47.0 seconds per image. Fast R-CNN trains in 9.5 hours and tests at 0.32 seconds per image without SVD, or 0.22 seconds with SVD. The paper reports a 9x training reduction and 146x to 213x test-time speedup. Compared with SPPnet, VGG16 Fast R-CNN trains in 9.5 rather than 25.5 hours and tests 7x faster without SVD or 10x faster with it. Feature caching also disappears.

Accuracy is not sacrificed. On VOC07 with VGG16, Fast R-CNN reaches 66.9 mAP, compared with 66.0 for R-CNN and 63.1 for SPPnet. On VOC12, it reaches 65.7 mAP, and 68.4 with extra data. Freezing convolutional layers and fine-tuning only fc6 and above drops VOC07 mAP from 66.9 to 61.4. Fine-tuning from conv2_1 slows training by 1.3x, 12.5 versus 9.5 hours, and conv1_1 overruns GPU memory, so conv3_1-and-up is an accuracy-memory trade-off.

SVD confirms the detector-head bottleneck: before compression, fc6 and fc7 take 45% of the forward time; after SVD, VGG16 timing drops from 320 ms/image to 223 ms/image with mAP changing from 66.9 to 66.6.

Historical Effect

Fast R-CNN made region-based CNN detection fit the practical single-GPU regime. It converted object detection from an offline feature-caching pipeline into an end-to-end trainable network whose convolutional work is shared across proposals. Once the image-level feature map is the shared substrate, proposals become indexed views rather than separate ConvNet examples.

That shift also clarified the next bottleneck. The paper's fastest timings exclude object proposal time, so after Fast R-CNN accelerates the ConvNet path, selective search remains outside the GPU detector and becomes the obvious remaining latency source. This is the direct setup for Faster R-CNN, which moves proposal generation into the network.

Limits

The main limit is that Fast R-CNN still depends on external object proposals, and the 0.22 to 0.32 second detector timings exclude proposal generation. It is therefore not a complete real-time detector by itself. The memory constraints are also visible throughout the paper: VGG16 cannot use the full multi-scale setup in the implementation, conv1_1 fine-tuning overruns GPU memory, and image side lengths are capped to fit training.

Fast R-CNN also leaves dense RoI evaluation in the fully connected head. Truncated SVD reduces the cost with little accuracy loss, but that compression is an extra approximation layered on top of the detector. Finally, the hierarchical minibatch strategy depends on many proposals per image; it is excellent for proposal-based detection, but it is not a general batching solution for all dense prediction workloads.