Rich Feature Hierarchies for Accurate Object Detection and Semantic Segmentation

Metadata

Slug: rcnn_2014
Year: 2014
Venue: CVPR
Authors: Ross Girshick, Jeff Donahue, Trevor Darrell, Jitendra Malik
Reading status: read complete
Compute regime: Single-GPU deep learning
Primary sources: PDF, extracted text

Compute Setup

The paper gives both GPU/CPU runtime and a named GPU for one training run. At test time, R-CNN runs Selective Search, warps each proposal, and forwards each region through a Caffe CNN. The source reports that computing region proposals and CNN features takes 13 seconds per image on a GPU or 53 seconds per image on a CPU, amortized over all object classes. In the ILSVRC2013 section, the paper also states that CNN fine-tuning for 50k SGD iterations on a single NVIDIA Tesla K20 took 13 hours using Caffe. The acknowledgments separately note NVIDIA GPU support.

Device basis: explicit GPU/CPU runtime comparison and an explicit Tesla K20 fine-tuning run. The exact GPU used for the 13 seconds per image test-time number is not separately named in the extracted text, but the paper's reported implementation is clearly a Caffe-era single-GPU detection pipeline with CPU comparison.

Bottleneck

The bottleneck is applying an ImageNet-scale CNN to object detection before convolutional feature sharing across proposals existed. Dense sliding windows are poorly matched to precise localization, but proposal-level detection requires roughly 2000 CNN forward passes per image. R-CNN therefore shifts the problem from hand-designed features to a much stronger representation, while exposing the cost of per-region CNN evaluation and the storage cost of proposal features.

The compute structure is hybrid. Selective Search is an external bottom-up proposal generator. The GPU is used as a fixed-shape CNN feature extractor by warping each proposal to 227 x 227 RGB pixels. The resulting 4096-dimensional feature vectors make class scoring cheap: the paper notes that all dot products for one image can be batched into a 2000 x 4096 by 4096 x N matrix multiplication. That batching is why the class-specific part scales to many classes, while the expensive part remains proposal feature extraction.

Method Adaptation

R-CNN adapts CNNs to the single-GPU detection setup by:

Generating bottom-up region proposals.
Warping each proposal to 227 x 227 pixels.
Extracting 4096-dimensional CNN features.
Training class-specific linear SVMs.
Using ImageNet supervised pretraining followed by detection fine-tuning.

This turns the GPU CNN into a reusable feature extractor for a proposal pipeline. The design is compute-aware in several ways. First, it avoids a dense image pyramid of class-specific sliding windows and evaluates only category-independent candidate regions. Second, it forces irregular proposal boxes into the fixed input shape required by the AlexNet-style Caffe CNN, accepting geometric warping to keep the GPU kernel path simple. Third, it separates the slow CNN pass from cheap linear classifiers, allowing the 4096-dimensional features to be cached and reused across SVM training, bounding-box regression, and later analysis.

The batching choice matters. Rather than run one classifier per class per proposal as small scalar operations, the paper emphasizes a single matrix-matrix product over all proposals and classes. That is a good fit for CPU BLAS and for the dense linear-algebra regime of the period. It also explains why R-CNN could contemplate thousands or even 100k classes after feature extraction: the paper estimates 100k linear predictors would take 1.5GB with R-CNN's features, versus 134GB for a much larger bag-of-visual-words feature set.

Training also reflects memory limits. Detection fine-tuning samples 32 positive windows and 96 background windows for a mini-batch of 128. In the OverFeat comparison appendix, the authors report using smaller minibatches for O-Net because they were required to fit within GPU memory. The method is therefore not just "CNN plus proposals"; it is a way to fit transfer learning, proposal warping, feature caching, and batched linear classification into the memory and throughput envelope of early commodity GPU deep learning.

Evidence

The runtime evidence is direct. The paper reports 13 seconds per image on GPU and 53 seconds per image on CPU for proposal and feature computation. It then argues that, once features are extracted, the remaining class scoring is inexpensive because all dot products are batched. The ILSVRC setup gives a concrete training scale: 50k fine-tuning SGD iterations on a single Tesla K20 took 13 hours.

The accuracy evidence explains why the expensive pipeline was historically important. On PASCAL VOC 2010, R-CNN reaches 53.7% mAP, compared with 35.1% for the prior region-feature system and 33.4% for DPM in the paper's comparison. On VOC 2011/12, it reports 53.3% mAP. Fine-tuning the CNN for detection improves mAP by about 8 percentage points, showing that supervised ImageNet pretraining plus domain-specific adaptation was not merely a convenience for small datasets but a measurable performance step.

For larger-class detection, the ILSVRC2013 result is 31.4% mAP, ahead of OverFeat at 24.3%. The source also reports that Selective Search on ILSVRC validation produced an average of 2403 proposals per image with 91.6% recall at 0.5 IoU, lower than the approximately 98% recall on PASCAL. That exposes a second bottleneck: proposal quality and proposal count interact with runtime and recall, so detection accuracy is limited by both the CNN representation and the external proposal stage.

Historical Effect

R-CNN made CNN features dominant in object detection and showed that ImageNet pretraining could transfer to smaller structured vision datasets. Its historical role is also negative evidence: it made the next bottleneck obvious. Once the CNN representation won on accuracy, the cost of running the same convolutional stack thousands of times per image became indefensible. That is the line from R-CNN to SPPnet, Fast R-CNN, and Faster R-CNN: share convolutional features over the image, pool per region, and eventually learn proposals on the same feature map.

The card belongs in the single-GPU deep-learning regime because the method is organized around what a 2013-2014 GPU could do well: dense CNN forward passes and dense matrix products. It also shows what that regime could not yet do conveniently: end-to-end detection with shared proposal and classification computation. R-CNN's feature-cache/SVM architecture was a pragmatic bridge.

Limits

Multi-stage training and inference: proposal generation, CNN fine-tuning, feature extraction, SVM training, and bounding-box regression are separate pieces.
Slow per-proposal CNN computation: about 2000 warped CNN forwards per image produce 13 seconds per image on GPU.
External proposal generation: Selective Search recall and proposal count constrain both accuracy and runtime.
Large feature caches: the method trades GPU time for stored 4096-dimensional proposal features.
Exact test-time GPU SKU for the 13 seconds per image number is not named in the extracted text, even though the fine-tuning GPU is identified as a Tesla K20.

Rich feature hierarchies for accurate object detection and semantic segmentation