Source notes and reports

Sources

← Home

                                                                                                Fast R-CNN

                                                                                                Ross Girshick
                                                                                              Microsoft Research
                                                                                              rbg@microsoft.com




arXiv:1504.08083v2 [cs.CV] 27 Sep 2015
                                                                  Abstract                                 while achieving top accuracy on PASCAL VOC 2012 [7]
                                                                                                           with a mAP of 66% (vs. 62% for R-CNN).1
                                            This paper proposes a Fast Region-based Convolutional
                                         Network method (Fast R-CNN) for object detection. Fast            1.1. R-CNN and SPPnet
                                         R-CNN builds on previous work to efficiently classify ob-            The Region-based Convolutional Network method (R-
                                         ject proposals using deep convolutional networks. Com-            CNN) [9] achieves excellent object detection accuracy by
                                         pared to previous work, Fast R-CNN employs several in-            using a deep ConvNet to classify object proposals. R-CNN,
                                         novations to improve training and testing speed while also        however, has notable drawbacks:
                                         increasing detection accuracy. Fast R-CNN trains the very
                                         deep VGG16 network 9× faster than R-CNN, is 213× faster            1. Training is a multi-stage pipeline. R-CNN first fine-
                                         at test-time, and achieves a higher mAP on PASCAL VOC                 tunes a ConvNet on object proposals using log loss.
                                         2012. Compared to SPPnet, Fast R-CNN trains VGG16 3×                  Then, it fits SVMs to ConvNet features. These SVMs
                                         faster, tests 10× faster, and is more accurate. Fast R-CNN            act as object detectors, replacing the softmax classi-
                                         is implemented in Python and C++ (using Caffe) and is                 fier learnt by fine-tuning. In the third training stage,
                                         available under the open-source MIT License at https:                 bounding-box regressors are learned.
                                         //github.com/rbgirshick/fast-rcnn.                                 2. Training is expensive in space and time. For SVM
                                                                                                               and bounding-box regressor training, features are ex-
                                                                                                               tracted from each object proposal in each image and
                                         1. Introduction                                                       written to disk. With very deep networks, such as
                                                                                                               VGG16, this process takes 2.5 GPU-days for the 5k
                                            Recently, deep ConvNets [14, 16] have significantly im-
                                                                                                               images of the VOC07 trainval set. These features re-
                                         proved image classification [14] and object detection [9, 19]
                                                                                                               quire hundreds of gigabytes of storage.
                                         accuracy. Compared to image classification, object detec-
                                         tion is a more challenging task that requires more com-            3. Object detection is slow. At test-time, features are
                                         plex methods to solve. Due to this complexity, current ap-            extracted from each object proposal in each test image.
                                         proaches (e.g., [9, 11, 19, 25]) train models in multi-stage          Detection with VGG16 takes 47s / image (on a GPU).
                                         pipelines that are slow and inelegant.
                                                                                                              R-CNN is slow because it performs a ConvNet forward
                                            Complexity arises because detection requires the ac-
                                                                                                           pass for each object proposal, without sharing computation.
                                         curate localization of objects, creating two primary chal-
                                                                                                           Spatial pyramid pooling networks (SPPnets) [11] were pro-
                                         lenges. First, numerous candidate object locations (often
                                                                                                           posed to speed up R-CNN by sharing computation. The
                                         called “proposals”) must be processed. Second, these can-
                                                                                                           SPPnet method computes a convolutional feature map for
                                         didates provide only rough localization that must be refined
                                                                                                           the entire input image and then classifies each object pro-
                                         to achieve precise localization. Solutions to these problems
                                                                                                           posal using a feature vector extracted from the shared fea-
                                         often compromise speed, accuracy, or simplicity.
                                                                                                           ture map. Features are extracted for a proposal by max-
                                            In this paper, we streamline the training process for state-
                                                                                                           pooling the portion of the feature map inside the proposal
                                         of-the-art ConvNet-based object detectors [9, 11]. We pro-
                                                                                                           into a fixed-size output (e.g., 6 × 6). Multiple output sizes
                                         pose a single-stage training algorithm that jointly learns to
                                                                                                           are pooled and then concatenated as in spatial pyramid pool-
                                         classify object proposals and refine their spatial locations.
                                                                                                           ing [15]. SPPnet accelerates R-CNN by 10 to 100× at test
                                            The resulting method can train a very deep detection
                                                                                                           time. Training time is also reduced by 3× due to faster pro-
                                         network (VGG16 [20]) 9× faster than R-CNN [9] and 3×
                                                                                                           posal feature extraction.
                                         faster than SPPnet [11]. At runtime, the detection network
                                         processes images in 0.3s (excluding object proposal time)           1 All timings use one Nvidia K40 GPU overclocked to 875 MHz.
   SPPnet also has notable drawbacks. Like R-CNN, train-
                                                                                                                      Outputs:
ing is a multi-stage pipeline that involves extracting fea-                                                                      bbox
                                                                                         Deep                          softmax regressor
tures, fine-tuning a network with log loss, training SVMs,                               ConvNet
                                                                                                           RoI                   FC    FC
and finally fitting bounding-box regressors. Features are                                                  pooling
also written to disk. But unlike R-CNN, the fine-tuning al-                                                layer      FCs
                                                                                         RoI
gorithm proposed in [11] cannot update the convolutional                                 projection
layers that precede the spatial pyramid pooling. Unsurpris-                                  Conv                    RoI feature
ingly, this limitation (fixed convolutional layers) limits the                               feature map               vector For each RoI
accuracy of very deep networks.
                                                                   Figure 1. Fast R-CNN architecture. An input image and multi-
1.2. Contributions                                                 ple regions of interest (RoIs) are input into a fully convolutional
                                                                   network. Each RoI is pooled into a fixed-size feature map and
   We propose a new training algorithm that fixes the disad-       then mapped to a feature vector by fully connected layers (FCs).
vantages of R-CNN and SPPnet, while improving on their             The network has two output vectors per RoI: softmax probabilities
speed and accuracy. We call this method Fast R-CNN be-             and per-class bounding-box regression offsets. The architecture is
cause it’s comparatively fast to train and test. The Fast R-       trained end-to-end with a multi-task loss.
CNN method has several advantages:
                                                                      RoI max pooling works by dividing the h × w RoI win-
  1. Higher detection quality (mAP) than R-CNN, SPPnet
                                                                   dow into an H × W grid of sub-windows of approximate
  2. Training is single-stage, using a multi-task loss             size h/H × w/W and then max-pooling the values in each
  3. Training can update all network layers                        sub-window into the corresponding output grid cell. Pool-
                                                                   ing is applied independently to each feature map channel,
  4. No disk storage is required for feature caching               as in standard max pooling. The RoI layer is simply the
   Fast R-CNN is written in Python and C++ (Caffe                  special-case of the spatial pyramid pooling layer used in
[13]) and is available under the open-source MIT Li-               SPPnets [11] in which there is only one pyramid level. We
cense at https://github.com/rbgirshick/                            use the pooling sub-window calculation given in [11].
fast-rcnn.                                                         2.2. Initializing from pre-trained networks
2. Fast R-CNN architecture and training                                We experiment with three pre-trained ImageNet [4] net-
                                                                   works, each with five max pooling layers and between five
    Fig. 1 illustrates the Fast R-CNN architecture. A Fast         and thirteen conv layers (see Section 4.1 for network de-
R-CNN network takes as input an entire image and a set             tails). When a pre-trained network initializes a Fast R-CNN
of object proposals. The network first processes the whole         network, it undergoes three transformations.
image with several convolutional (conv) and max pooling                First, the last max pooling layer is replaced by a RoI
layers to produce a conv feature map. Then, for each ob-           pooling layer that is configured by setting H and W to be
ject proposal a region of interest (RoI) pooling layer ex-         compatible with the net’s first fully connected layer (e.g.,
tracts a fixed-length feature vector from the feature map.         H = W = 7 for VGG16).
Each feature vector is fed into a sequence of fully connected          Second, the network’s last fully connected layer and soft-
(fc) layers that finally branch into two sibling output lay-       max (which were trained for 1000-way ImageNet classifi-
ers: one that produces softmax probability estimates over          cation) are replaced with the two sibling layers described
K object classes plus a catch-all “background” class and           earlier (a fully connected layer and softmax over K + 1 cat-
another layer that outputs four real-valued numbers for each       egories and category-specific bounding-box regressors).
of the K object classes. Each set of 4 values encodes refined          Third, the network is modified to take two data inputs: a
bounding-box positions for one of the K classes.                   list of images and a list of RoIs in those images.
2.1. The RoI pooling layer                                         2.3. Fine-tuning for detection
   The RoI pooling layer uses max pooling to convert the              Training all network weights with back-propagation is an
features inside any valid region of interest into a small fea-     important capability of Fast R-CNN. First, let’s elucidate
ture map with a fixed spatial extent of H × W (e.g., 7 × 7),       why SPPnet is unable to update weights below the spatial
where H and W are layer hyper-parameters that are inde-            pyramid pooling layer.
pendent of any particular RoI. In this paper, an RoI is a             The root cause is that back-propagation through the SPP
rectangular window into a conv feature map. Each RoI is            layer is highly inefficient when each training sample (i.e.
defined by a four-tuple (r, c, h, w) that specifies its top-left   RoI) comes from a different image, which is exactly how
corner (r, c) and its height and width (h, w).                     R-CNN and SPPnet networks are trained. The inefficiency
stems from the fact that each RoI may have a very large                      bounding box and hence Lloc is ignored. For bounding-box
receptive field, often spanning the entire input image. Since                regression, we use the loss
the forward pass must process the entire receptive field, the                                         X
training inputs are large (often the entire image).                                  Lloc (tu , v) =       smoothL1 (tui − vi ),   (2)
    We propose a more efficient training method that takes                                         i∈{x,y,w,h}
advantage of feature sharing during training. In Fast R-
                                                                             in which
CNN training, stochastic gradient descent (SGD) mini-                                                    (
batches are sampled hierarchically, first by sampling N im-                                                  0.5x2       if |x| < 1
ages and then by sampling R/N RoIs from each image.                                     smoothL1 (x) =                                  (3)
                                                                                                             |x| − 0.5   otherwise,
Critically, RoIs from the same image share computation
and memory in the forward and backward passes. Making                        is a robust L1 loss that is less sensitive to outliers than the
N small decreases mini-batch computation. For example,                       L2 loss used in R-CNN and SPPnet. When the regression
when using N = 2 and R = 128, the proposed training                          targets are unbounded, training with L2 loss can require
scheme is roughly 64× faster than sampling one RoI from                      careful tuning of learning rates in order to prevent exploding
128 different images (i.e., the R-CNN and SPPnet strategy).                  gradients. Eq. 3 eliminates this sensitivity.
    One concern over this strategy is it may cause slow train-                   The hyper-parameter λ in Eq. 1 controls the balance be-
ing convergence because RoIs from the same image are cor-                    tween the two task losses. We normalize the ground-truth
related. This concern does not appear to be a practical issue                regression targets vi to have zero mean and unit variance.
and we achieve good results with N = 2 and R = 128                           All experiments use λ = 1.
using fewer SGD iterations than R-CNN.                                           We note that [6] uses a related loss to train a class-
    In addition to hierarchical sampling, Fast R-CNN uses a                  agnostic object proposal network. Different from our ap-
streamlined training process with one fine-tuning stage that                 proach, [6] advocates for a two-network system that sepa-
jointly optimizes a softmax classifier and bounding-box re-                  rates localization and classification. OverFeat [19], R-CNN
gressors, rather than training a softmax classifier, SVMs,                   [9], and SPPnet [11] also train classifiers and bounding-box
and regressors in three separate stages [9, 11]. The compo-                  localizers, however these methods use stage-wise training,
nents of this procedure (the loss, mini-batch sampling strat-                which we show is suboptimal for Fast R-CNN (Section 5.1).
egy, back-propagation through RoI pooling layers, and SGD
hyper-parameters) are described below.                                       Mini-batch sampling. During fine-tuning, each SGD
                                                                             mini-batch is constructed from N = 2 images, chosen uni-
Multi-task loss. A Fast R-CNN network has two sibling                        formly at random (as is common practice, we actually iter-
output layers. The first outputs a discrete probability distri-              ate over permutations of the dataset). We use mini-batches
bution (per RoI), p = (p0 , . . . , pK ), over K + 1 categories.             of size R = 128, sampling 64 RoIs from each image. As
As usual, p is computed by a softmax over the K +1 outputs                   in [9], we take 25% of the RoIs from object proposals that
of a fully connected layer. The second sibling layer outputs
                                                                            have intersection over union (IoU) overlap with a ground-
bounding-box regression offsets, tk = tkx , tky , tkw , tkh , for            truth bounding box of at least 0.5. These RoIs comprise
each of the K object classes, indexed by k. We use the pa-                   the examples labeled with a foreground object class, i.e.
rameterization for tk given in [9], in which tk specifies a                  u ≥ 1. The remaining RoIs are sampled from object pro-
scale-invariant translation and log-space height/width shift                 posals that have a maximum IoU with ground truth in the in-
relative to an object proposal.                                              terval [0.1, 0.5), following [11]. These are the background
   Each training RoI is labeled with a ground-truth class u                  examples and are labeled with u = 0. The lower threshold
and a ground-truth bounding-box regression target v. We                      of 0.1 appears to act as a heuristic for hard example mining
use a multi-task loss L on each labeled RoI to jointly train                 [8]. During training, images are horizontally flipped with
for classification and bounding-box regression:                              probability 0.5. No other data augmentation is used.
    L(p, u, tu , v) = Lcls (p, u) + λ[u ≥ 1]Lloc (tu , v),            (1)
                                                                             Back-propagation through RoI pooling layers. Back-
in which Lcls (p, u) = − log pu is log loss for true class u.                propagation routes derivatives through the RoI pooling
   The second task loss, Lloc , is defined over a tuple of                   layer. For clarity, we assume only one image per mini-batch
true bounding-box regression targets for class u, v =                        (N = 1), though the extension to N > 1 is straightforward
(vx , vy , vw , vh ), and a predicted tuple tu = (tux , tuy , tuw , tuh ),   because the forward pass treats all images independently.
again for class u. The Iverson bracket indicator function                       Let xi ∈ R be the i-th activation input into the RoI pool-
[u ≥ 1] evaluates to 1 when u ≥ 1 and 0 otherwise. By                        ing layer and let yrj be the layer’s j-th output from the r-
convention the catch-all background class is labeled u = 0.                  th RoI. The RoI pooling layer computes yrj = xi∗ (r,j) , in
For background RoIs there is no notion of a ground-truth                     which i∗ (r, j) = argmaxi0 ∈R(r,j) xi0 . R(r, j) is the index
set of inputs in the sub-window over which the output unit         test-time, R is typically around 2000, although we will con-
yrj max pools. A single xi may be assigned to several dif-         sider cases in which it is larger (≈ 45k). When using an
ferent outputs yrj .                                               image pyramid, each RoI is assigned to the scale such that
   The RoI pooling layer’s backwards function computes             the scaled RoI is closest to 2242 pixels in area [11].
partial derivative of the loss function with respect to each          For each test RoI r, the forward pass outputs a class
input variable xi by following the argmax switches:                posterior probability distribution p and a set of predicted
                                                                   bounding-box offsets relative to r (each of the K classes
             ∂L    XX                   ∂L
                 =     [i = i∗ (r, j)]      .               (4)    gets its own refined bounding-box prediction). We assign a
             ∂xi   r j
                                       ∂yrj                        detection confidence to r for each object class k using the
                                                                                                                ∆
                                                                   estimated probability Pr(class = k | r) = pk . We then
In words, for each mini-batch RoI r and for each pooling
                                                                   perform non-maximum suppression independently for each
output unit yrj , the partial derivative ∂L/∂yrj is accumu-
                                                                   class using the algorithm and settings from R-CNN [9].
lated if i is the argmax selected for yrj by max pooling.
In back-propagation, the partial derivatives ∂L/∂yrj are al-       3.1. Truncated SVD for faster detection
ready computed by the backwards function of the layer
on top of the RoI pooling layer.                                       For whole-image classification, the time spent comput-
                                                                   ing the fully connected layers is small compared to the conv
                                                                   layers. On the contrary, for detection the number of RoIs
SGD hyper-parameters. The fully connected layers used
                                                                   to process is large and nearly half of the forward pass time
for softmax classification and bounding-box regression are
                                                                   is spent computing the fully connected layers (see Fig. 2).
initialized from zero-mean Gaussian distributions with stan-
                                                                   Large fully connected layers are easily accelerated by com-
dard deviations 0.01 and 0.001, respectively. Biases are ini-
                                                                   pressing them with truncated SVD [5, 23].
tialized to 0. All layers use a per-layer learning rate of 1 for
                                                                       In this technique, a layer parameterized by the u × v
weights and 2 for biases and a global learning rate of 0.001.
                                                                   weight matrix W is approximately factorized as
When training on VOC07 or VOC12 trainval we run SGD
for 30k mini-batch iterations, and then lower the learning                                W ≈ U Σt V T                         (5)
rate to 0.0001 and train for another 10k iterations. When
we train on larger datasets, we run SGD for more iterations,       using SVD. In this factorization, U is a u × t matrix com-
as described later. A momentum of 0.9 and parameter decay          prising the first t left-singular vectors of W , Σt is a t × t
of 0.0005 (on weights and biases) are used.                        diagonal matrix containing the top t singular values of W ,
                                                                   and V is v × t matrix comprising the first t right-singular
2.4. Scale invariance
                                                                   vectors of W . Truncated SVD reduces the parameter count
    We explore two ways of achieving scale invariant ob-           from uv to t(u + v), which can be significant if t is much
ject detection: (1) via “brute force” learning and (2) by us-      smaller than min(u, v). To compress a network, the single
ing image pyramids. These strategies follow the two ap-            fully connected layer corresponding to W is replaced by
proaches in [11]. In the brute-force approach, each image          two fully connected layers, without a non-linearity between
is processed at a pre-defined pixel size during both training      them. The first of these layers uses the weight matrix Σt V T
and testing. The network must directly learn scale-invariant       (and no biases) and the second uses U (with the original bi-
object detection from the training data.                           ases associated with W ). This simple compression method
    The multi-scale approach, in contrast, provides approx-        gives good speedups when the number of RoIs is large.
imate scale-invariance to the network through an image
pyramid. At test-time, the image pyramid is used to ap-            4. Main results
proximately scale-normalize each object proposal. During
                                                                      Three main results support this paper’s contributions:
multi-scale training, we randomly sample a pyramid scale
each time an image is sampled, following [11], as a form of         1. State-of-the-art mAP on VOC07, 2010, and 2012
data augmentation. We experiment with multi-scale train-
                                                                    2. Fast training and testing compared to R-CNN, SPPnet
ing for smaller networks only, due to GPU memory limits.
                                                                    3. Fine-tuning conv layers in VGG16 improves mAP
3. Fast R-CNN detection
                                                                   4.1. Experimental setup
   Once a Fast R-CNN network is fine-tuned, detection
                                                                       Our experiments use three pre-trained ImageNet models
amounts to little more than running a forward pass (assum-
                                                                   that are available online.2 The first is the CaffeNet (essen-
ing object proposals are pre-computed). The network takes
                                                                   tially AlexNet [14]) from R-CNN [9]. We alternatively refer
as input an image (or an image pyramid, encoded as a list
of images) and a list of R object proposals to score. At             2 https://github.com/BVLC/caffe/wiki/Model-Zoo
method            train set aero bike bird boat bottle bus car cat chair cow table dog horse mbike persn plant sheep sofa train tv mAP
              †
SPPnet BB [11]    07 \ diff 73.9 72.3 62.5 51.5 44.4 74.4 73.0 74.4 42.3 73.6 57.7 70.3 74.6 74.3 54.2 34.0 56.4 56.4 67.9 73.5 63.1
R-CNN BB [10]     07        73.4 77.0 63.4 45.4 44.6 75.1 78.1 79.8 40.5 73.7 62.2 79.4 78.1 73.1 64.2 35.6 66.8 67.2 70.4 71.1 66.0
FRCN [ours]       07        74.5 78.3 69.2 53.2 36.6 77.3 78.2 82.0 40.7 72.7 67.9 79.6 79.2   73.0   69.0   30.1   65.4   70.2 75.8 65.8 66.9
FRCN [ours]       07 \ diff 74.6 79.0 68.6 57.0 39.3 79.5 78.6 81.9 48.0 74.0 67.4 80.5 80.7   74.1   69.6   31.8   67.1   68.4 75.3 65.5 68.1
FRCN [ours]       07+12     77.0 78.1 69.3 59.4 38.3 81.6 78.6 86.7 42.8 78.8 68.9 84.7 82.0   76.6   69.9   31.8   70.1   74.8 80.4 70.4 70.0

Table 1. VOC 2007 test detection average precision (%). All methods use VGG16. Training set key: 07: VOC07 trainval, 07 \ diff: 07
without “difficult” examples, 07+12: union of 07 and VOC12 trainval. † SPPnet results were prepared by the authors of [11].

method            train set aero bike bird boat bottle bus car cat chair cow table dog horse mbike persn plant sheep sofa train tv        mAP
BabyLearning      Prop.     77.7 73.8 62.3 48.8 45.4 67.3 67.0 80.3 41.3 70.8 49.7 79.5 74.7 78.6 64.5 36.0 69.9 55.7 70.4 61.7           63.8
R-CNN BB [10]     12        79.3 72.4 63.1 44.0 44.4 64.6 66.3 84.9 38.8 67.3 48.4 82.3 75.0 76.7 65.7 35.8 66.2 54.8 69.1 58.8           62.9
SegDeepM          12+seg 82.3 75.2 67.1 50.7 49.8 71.1 69.6 88.2 42.5 71.2 50.0 85.7 76.6 81.8 69.3 41.5 71.9 62.2 73.2 64.6              67.2
FRCN [ours]       12       80.1 74.4 67.7 49.4 41.4 74.2 68.8 87.8 41.9 70.1 50.2 86.1 77.3    81.1   70.4   33.3   67.0   63.3 77.2 60.0 66.1
FRCN [ours]       07++12   82.0 77.8 71.6 55.3 42.4 77.3 71.7 89.3 44.5 72.1 53.7 87.7 80.0    82.5   72.7   36.6   68.7   65.4 81.1 62.7 68.8

Table 2. VOC 2010 test detection average precision (%). BabyLearning uses a network based on [17]. All other methods use VGG16.
Training set key: 12: VOC12 trainval, Prop.: proprietary dataset, 12+seg: 12 with segmentation annotations, 07++12: union of VOC07
trainval, VOC07 test, and VOC12 trainval.

method            train set aero bike bird boat bottle bus car cat chair cow table dog horse mbike persn plant sheep sofa train tv        mAP
BabyLearning  Prop.         78.0 74.2 61.3 45.7 42.7 68.2 66.8 80.2 40.6 70.0 49.8 79.0 74.5 77.9 64.0 35.3 67.9 55.7 68.7 62.6           63.2
NUS NIN c2000 Unk.          80.2 73.8 61.9 43.7 43.0 70.3 67.6 80.7 41.9 69.7 51.7 78.2 75.2 76.9 65.1 38.6 68.3 58.0 68.7 63.3           63.8
R-CNN BB [10] 12            79.6 72.7 61.9 41.2 41.9 65.9 66.4 84.6 38.5 67.2 46.7 82.0 74.8 76.0 65.2 35.6 65.4 54.2 67.4 60.3           62.4
FRCN [ours]       12       80.3 74.7 66.9 46.9 37.7 73.9 68.6 87.7 41.7 71.1 51.1 86.0 77.8    79.8   69.8   32.1   65.5   63.8 76.4 61.7 65.7
FRCN [ours]       07++12   82.3 78.4 70.8 52.3 38.7 77.8 71.6 89.3 44.2 73.0 55.0 87.5 80.5    80.8   72.0   35.1   68.3   65.7 80.4 64.2 68.4

Table 3. VOC 2012 test detection average precision (%). BabyLearning and NUS NIN c2000 use networks based on [17]. All other
methods use VGG16. Training set key: see Table 2, Unk.: unknown.


to this CaffeNet as model S, for “small.” The second net-              SegDeepM [25] achieves a higher mAP than Fast R-CNN
work is VGG CNN M 1024 from [3], which has the same                    (67.2% vs. 66.1%). SegDeepM is trained on VOC12 train-
depth as S, but is wider. We call this network model M,                val plus segmentation annotations; it is designed to boost
for “medium.” The final network is the very deep VGG16                 R-CNN accuracy by using a Markov random field to reason
model from [20]. Since this model is the largest, we call              over R-CNN detections and segmentations from the O2 P
it model L. In this section, all experiments use single-scale          [1] semantic-segmentation method. Fast R-CNN can be
training and testing (s = 600; see Section 5.2 for details).           swapped into SegDeepM in place of R-CNN, which may
                                                                       lead to better results. When using the enlarged 07++12
4.2. VOC 2010 and 2012 results                                         training set (see Table 2 caption), Fast R-CNN’s mAP in-
   On these datasets, we compare Fast R-CNN (FRCN, for                 creases to 68.8%, surpassing SegDeepM.
short) against the top methods on the comp4 (outside data)             4.3. VOC 2007 results
track from the public leaderboard (Table 2, Table 3).3 For
the NUS NIN c2000 and BabyLearning methods, there are                      On VOC07, we compare Fast R-CNN to R-CNN and
no associated publications at this time and we could not               SPPnet. All methods start from the same pre-trained
find exact information on the ConvNet architectures used;              VGG16 network and use bounding-box regression. The
they are variants of the Network-in-Network design [17].               VGG16 SPPnet results were computed by the authors of
All other methods are initialized from the same pre-trained            [11]. SPPnet uses five scales during both training and test-
VGG16 network.                                                         ing. The improvement of Fast R-CNN over SPPnet illus-
                                                                       trates that even though Fast R-CNN uses single-scale train-
   Fast R-CNN achieves the top result on VOC12 with a
                                                                       ing and testing, fine-tuning the conv layers provides a large
mAP of 65.7% (and 68.4% with extra data). It is also two
                                                                       improvement in mAP (from 63.1% to 66.9%). R-CNN
orders of magnitude faster than the other methods, which
                                                                       achieves a mAP of 66.0%. As a minor point, SPPnet was
are all based on the “slow” R-CNN pipeline. On VOC10,
                                                                       trained without examples marked as “difficult” in PASCAL.
   3 http://host.robots.ox.ac.uk:8080/leaderboard                      Removing these examples improves Fast R-CNN mAP to
(accessed April 18, 2015)                                              68.1%. All other experiments use “difficult” examples.
4.4. Training and testing time                                                     4.5. Which layers to fine-tune?
   Fast training and testing times are our second main re-                            For the less deep networks considered in the SPPnet pa-
sult. Table 4 compares training time (hours), testing rate                         per [11], fine-tuning only the fully connected layers ap-
(seconds per image), and mAP on VOC07 between Fast R-                              peared to be sufficient for good accuracy. We hypothesized
CNN, R-CNN, and SPPnet. For VGG16, Fast R-CNN pro-                                 that this result would not hold for very deep networks. To
cesses images 146× faster than R-CNN without truncated                             validate that fine-tuning the conv layers is important for
SVD and 213× faster with it. Training time is reduced by                           VGG16, we use Fast R-CNN to fine-tune, but freeze the
9×, from 84 hours to 9.5. Compared to SPPnet, Fast R-                              thirteen conv layers so that only the fully connected layers
CNN trains VGG16 2.7× faster (in 9.5 vs. 25.5 hours) and                           learn. This ablation emulates single-scale SPPnet training
tests 7× faster without truncated SVD or 10× faster with it.                       and decreases mAP from 66.9% to 61.4% (Table 5). This
Fast R-CNN also eliminates hundreds of gigabytes of disk                           experiment verifies our hypothesis: training through the RoI
storage, because it does not cache features.                                       pooling layer is important for very deep nets.

                         Fast R-CNN                  R-CNN               SPPnet                     layers that are fine-tuned in model L SPPnet L
                                                                            †
                         S      M           L       S M             L         L                     ≥ fc6 ≥ conv3 1            ≥ conv2 1     ≥ fc6
 train time (h)        1.2   2.0  9.5              22    28         84       25    VOC07 mAP          61.4         66.9              67.2     63.1
 train speedup      18.3× 14.0× 8.8×              1×    1×         1×     3.4×     test rate (s/im) 0.32           0.32              0.32      2.3
 test rate (s/im)     0.10      0.15      0.32    9.8 12.1 47.0             2.3
                                                                                   Table 5. Effect of restricting which layers are fine-tuned for
  B with SVD          0.06      0.08      0.22      -    -    -               -
                                                                                   VGG16. Fine-tuning ≥ fc6 emulates the SPPnet training algo-
 test speedup          98× 80× 146×               1×    1×         1×      20×     rithm [11], but using a single scale. SPPnet L results were ob-
  B with SVD          169× 150× 213×                -     -          -        -    tained using five scales, at a significant (7×) speed cost.
 VOC07 mAP            57.1      59.2      66.9 58.5 60.2 66.0              63.1
                                                                                      Does this mean that all conv layers should be fine-tuned?
 B with SVD           56.5      58.7      66.6    -    -    -                 -
                                                                                   In short, no. In the smaller networks (S and M) we find
Table 4. Runtime comparison between the same models in Fast R-                     that conv1 is generic and task independent (a well-known
CNN, R-CNN, and SPPnet. Fast R-CNN uses single-scale mode.                         fact [14]). Allowing conv1 to learn, or not, has no mean-
SPPnet uses the five scales specified in [11]. † Timing provided by                ingful effect on mAP. For VGG16, we found it only nec-
the authors of [11]. Times were measured on an Nvidia K40 GPU.                     essary to update layers from conv3 1 and up (9 of the 13
                                                                                   conv layers). This observation is pragmatic: (1) updating
Truncated SVD. Truncated SVD can reduce detection                                  from conv2 1 slows training by 1.3× (12.5 vs. 9.5 hours)
time by more than 30% with only a small (0.3 percent-                              compared to learning from conv3 1; and (2) updating from
age point) drop in mAP and without needing to perform                              conv1 1 over-runs GPU memory. The difference in mAP
additional fine-tuning after model compression. Fig. 2 il-                         when learning from conv2 1 up was only +0.3 points (Ta-
lustrates how using the top 1024 singular values from the                          ble 5, last column). All Fast R-CNN results in this paper
25088 × 4096 matrix in VGG16’s fc6 layer and the top 256                           using VGG16 fine-tune layers conv3 1 and up; all experi-
singular values from the 4096 × 4096 fc7 layer reduces run-                        ments with models S and M fine-tune layers conv2 and up.
time with little loss in mAP. Further speed-ups are possi-
ble with smaller drops in mAP if one fine-tunes again after                        5. Design evaluation
compression.
                                                                                      We conducted experiments to understand how Fast R-
   Forward pass timing                     Forward pass timing (SVD)               CNN compares to R-CNN and SPPnet, as well as to eval-
mAP 66.9% @ 320ms / image                 mAP 66.6% @ 223ms / image                uate design decisions. Following best practices, we per-
         fc6                                                 fc6                   formed these experiments on the PASCAL VOC07 dataset.
      38.7% (122ms)                                 17.5% (37ms) other
                             other                       5.1% (11ms)               5.1. Does multi-task training help?
                  3.5% (11ms) roi_pool5                                roi_pool5
                   5.4% (17ms)                             7.9% (17ms)
                                                            1.7% (4ms) fc7            Multi-task training is convenient because it avoids man-
                   6.2% (20ms) fc7
                                                                                   aging a pipeline of sequentially-trained tasks. But it also has
                                             67.8% (143ms)                         the potential to improve results because the tasks influence
      46.3% (146ms)
                                           conv                                    each other through a shared representation (the ConvNet)
       conv                                                                        [2]. Does multi-task training improve object detection ac-
Figure 2. Timing for VGG16 before and after truncated SVD. Be-                     curacy in Fast R-CNN?
fore SVD, fully connected layers fc6 and fc7 take 45% of the time.                    To test this question, we train baseline networks that
                                                                                   use only the classification loss, Lcls , in Eq. 1 (i.e., setting
                                                   S                                 M                               L
              multi-task training?            X                 X                X                X             X                  X
              stage-wise training?                       X                                 X                               X
              test-time bbox reg?                        X      X                          X      X                        X       X
              VOC07 mAP              52.2   53.3       54.6   57.1   54.7     55.5       56.6   59.2   62.6   63.4       64.0    66.9

         Table 6. Multi-task training (forth column per group) improves mAP over piecewise training (third column per group).

λ = 0). These baselines are printed for models S, M, and L                                       SPPnet ZF            S                 M             L
in the first column of each group in Table 6. Note that these                scales                 1    5        1          5      1          5      1
models do not have bounding-box regressors. Next (second                     test rate (s/im)    0.14 0.38     0.10       0.39   0.15       0.64   0.32
column per group), we take networks that were trained with                   VOC07 mAP           58.0 59.2     57.1       58.4   59.2       60.7   66.9
the multi-task loss (Eq. 1, λ = 1), but we disable bounding-
box regression at test time. This isolates the networks’ clas-              Table 7. Multi-scale vs. single scale. SPPnet ZF (similar to model
sification accuracy and allows an apples-to-apples compar-                  S) results are from [11]. Larger networks with a single-scale offer
                                                                            the best speed / accuracy tradeoff. (L cannot use multi-scale in our
ison with the baseline networks.
                                                                            implementation due to GPU memory constraints.)
    Across all three networks we observe that multi-task
training improves pure classification accuracy relative to
training for classification alone. The improvement ranges                   firm their result: deep ConvNets are adept at directly learn-
from +0.8 to +1.1 mAP points, showing a consistent posi-                    ing scale invariance. The multi-scale approach offers only
tive effect from multi-task learning.                                       a small increase in mAP at a large cost in compute time
    Finally, we take the baseline models (trained with only                 (Table 7). In the case of VGG16 (model L), we are lim-
the classification loss), tack on the bounding-box regression               ited to using a single scale by implementation details. Yet it
layer, and train them with Lloc while keeping all other net-                achieves a mAP of 66.9%, which is slightly higher than the
work parameters frozen. The third column in each group                      66.0% reported for R-CNN [10], even though R-CNN uses
shows the results of this stage-wise training scheme: mAP                   “infinite” scales in the sense that each proposal is warped to
improves over column one, but stage-wise training under-                    a canonical size.
performs multi-task training (forth column per group).                          Since single-scale processing offers the best tradeoff be-
                                                                            tween speed and accuracy, especially for very deep models,
5.2. Scale invariance: to brute force or finesse?                           all experiments outside of this sub-section use single-scale
                                                                            training and testing with s = 600 pixels.
    We compare two strategies for achieving scale-invariant
object detection: brute-force learning (single scale) and im-               5.3. Do we need more training data?
age pyramids (multi-scale). In either case, we define the
scale s of an image to be the length of its shortest side.                      A good object detector should improve when supplied
    All single-scale experiments use s = 600 pixels; s may                  with more training data. Zhu et al. [24] found that DPM [8]
be less than 600 for some images as we cap the longest im-                  mAP saturates after only a few hundred to thousand train-
age side at 1000 pixels and maintain the image’s aspect ra-                 ing examples. Here we augment the VOC07 trainval set
tio. These values were selected so that VGG16 fits in GPU                   with the VOC12 trainval set, roughly tripling the number
memory during fine-tuning. The smaller models are not                       of images to 16.5k, to evaluate Fast R-CNN. Enlarging the
memory bound and can benefit from larger values of s; how-                  training set improves mAP on VOC07 test from 66.9% to
ever, optimizing s for each model is not our main concern.                  70.0% (Table 1). When training on this dataset we use 60k
We note that PASCAL images are 384 × 473 pixels on av-                      mini-batch iterations instead of 40k.
erage and thus the single-scale setting typically upsamples                     We perform similar experiments for VOC10 and 2012,
images by a factor of 1.6. The average effective stride at the              for which we construct a dataset of 21.5k images from the
RoI pooling layer is thus ≈ 10 pixels.                                      union of VOC07 trainval, test, and VOC12 trainval. When
    In the multi-scale setting, we use the same five scales                 training on this dataset, we use 100k SGD iterations and
specified in [11] (s ∈ {480, 576, 688, 864, 1200}) to facili-               lower the learning rate by 0.1× each 40k iterations (instead
tate comparison with SPPnet. However, we cap the longest                    of each 30k). For VOC10 and 2012, mAP improves from
side at 2000 pixels to avoid exceeding GPU memory.                          66.1% to 68.8% and from 65.7% to 68.4%, respectively.
    Table 7 shows models S and M when trained and tested
                                                                            5.4. Do SVMs outperform softmax?
with either one or five scales. Perhaps the most surpris-
ing result in [11] was that single-scale detection performs                    Fast R-CNN uses the softmax classifier learnt during
almost as well as multi-scale detection. Our findings con-                  fine-tuning instead of training one-vs-rest linear SVMs
post-hoc, as was done in R-CNN and SPPnet. To under-                                      This result is difficult to predict without actually running
stand the impact of this choice, we implemented post-hoc                              the experiment. The state-of-the-art for measuring object
SVM training with hard negative mining in Fast R-CNN.                                 proposal quality is Average Recall (AR) [12]. AR correlates
We use the same training algorithm and hyper-parameters                               well with mAP for several proposal methods using R-CNN,
as in R-CNN.                                                                          when using a fixed number of proposals per image. Fig. 3
                                                                                      shows that AR (solid red line) does not correlate well with
           method          classifier       S        M        L                       mAP as the number of proposals per image is varied. AR
           R-CNN [9, 10]   SVM           58.5      60.2     66.0                      must be used with care; higher AR due to more proposals
           FRCN [ours]     SVM           56.3      58.7     66.8                      does not imply that mAP will increase. Fortunately, training
           FRCN [ours]     softmax       57.1      59.2     66.9                      and testing with model M takes less than 2.5 hours. Fast
                                                                                      R-CNN thus enables efficient, direct evaluation of object
Table 8. Fast R-CNN with softmax vs. SVM (VOC07 mAP).
                                                                                      proposal mAP, which is preferable to proxy metrics.
    Table 8 shows softmax slightly outperforming SVM for                                  We also investigate Fast R-CNN when using densely
all three networks, by +0.1 to +0.8 mAP points. This ef-                              generated boxes (over scale, position, and aspect ratio), at
fect is small, but it demonstrates that “one-shot” fine-tuning                        a rate of about 45k boxes / image. This dense set is rich
is sufficient compared to previous multi-stage training ap-                           enough that when each selective search box is replaced by
proaches. We note that softmax, unlike one-vs-rest SVMs,                              its closest (in IoU) dense box, mAP drops only 1 point (to
introduces competition between classes when scoring a RoI.                            57.7%, Fig. 3, blue triangle).
                                                                                          The statistics of the dense boxes differ from those of
5.5. Are more proposals always better?                                                selective search boxes. Starting with 2k selective search
                                                                                      boxes, we test mAP when adding a random sample of
   There are (broadly) two types of object detectors: those
                                                                                      1000 × {2, 4, 6, 8, 10, 32, 45} dense boxes. For each exper-
that use a sparse set of object proposals (e.g., selective
                                                                                      iment we re-train and re-test model M. When these dense
search [21]) and those that use a dense set (e.g., DPM [8]).
                                                                                      boxes are added, mAP falls more strongly than when adding
Classifying sparse proposals is a type of cascade [22] in
                                                                                      more selective search boxes, eventually reaching 53.0%.
which the proposal mechanism first rejects a vast number of
                                                                                          We also train and test Fast R-CNN using only dense
candidates leaving the classifier with a small set to evaluate.
                                                                                      boxes (45k / image). This setting yields a mAP of 52.9%
This cascade improves detection accuracy when applied to
                                                                                      (blue diamond). Finally, we check if SVMs with hard nega-
DPM detections [21]. We find evidence that the proposal-
                                                                                      tive mining are needed to cope with the dense box distribu-
classifier cascade also improves Fast R-CNN accuracy.
                                                                                      tion. SVMs do even worse: 49.3% (blue circle).
   Using selective search’s quality mode, we sweep from 1k
to 10k proposals per image, each time re-training and re-                             5.6. Preliminary MS COCO results
testing model M. If proposals serve a purely computational
                                                                                          We applied Fast R-CNN (with VGG16) to the MS
role, increasing the number of proposals per image should
                                                                                      COCO dataset [18] to establish a preliminary baseline. We
not harm mAP.
                                                                                      trained on the 80k image training set for 240k iterations and
      66                                                             66               evaluated on the “test-dev” set using the evaluation server.
                                              Sel. Search (SS)                        The PASCAL-style mAP is 35.9%; the new COCO-style
      63                                      SS (2k) + Rand Dense 63
                                              SS replace Dense                        AP, which also averages over IoU thresholds, is 19.7%.
      61                                      45k Dense Softmax 61



                                                                     Average Recall
                                              45k Dense SVM                           6. Conclusion
      58                                                             58
mAP   56
                                                      SS Avg. Recall
                                                                     56                   This paper proposes Fast R-CNN, a clean and fast update
                                                                                      to R-CNN and SPPnet. In addition to reporting state-of-the-
      53                                                            53                art detection results, we present detailed experiments that
      51                                                            51                we hope provide new insights. Of particular note, sparse
                                                                                      object proposals appear to improve detector quality. This
      49 3                                                          49
        10                              104                                           issue was too costly (in time) to probe in the past, but be-
                     Number of object proposals                                       comes practical with Fast R-CNN. Of course, there may ex-
Figure 3. VOC07 test mAP and AR for various proposal schemes.                         ist yet undiscovered techniques that allow dense boxes to
                                                                                      perform as well as sparse proposals. Such methods, if de-
   We find that mAP rises and then falls slightly as the pro-                         veloped, may help further accelerate object detection.
posal count increases (Fig. 3, solid blue line). This exper-
iment shows that swamping the deep classifier with more                               Acknowledgements. I thank Kaiming He, Larry Zitnick,
proposals does not help, and even slightly hurts, accuracy.                           and Piotr Dollár for helpful discussions and encouragement.
References                                                            [19] P. Sermanet, D. Eigen, X. Zhang, M. Mathieu, R. Fergus,
                                                                           and Y. LeCun. OverFeat: Integrated Recognition, Localiza-
 [1] J. Carreira, R. Caseiro, J. Batista, and C. Sminchisescu. Se-         tion and Detection using Convolutional Networks. In ICLR,
     mantic segmentation with second-order pooling. In ECCV,               2014. 1, 3
     2012. 5                                                          [20] K. Simonyan and A. Zisserman. Very deep convolutional
 [2] R. Caruana. Multitask learning. Machine learning, 28(1),              networks for large-scale image recognition. In ICLR, 2015.
     1997. 6                                                               1, 5
 [3] K. Chatfield, K. Simonyan, A. Vedaldi, and A. Zisserman.         [21] J. Uijlings, K. van de Sande, T. Gevers, and A. Smeulders.
     Return of the devil in the details: Delving deep into convo-          Selective search for object recognition. IJCV, 2013. 8
     lutional nets. In BMVC, 2014. 5                                  [22] P. Viola and M. Jones. Rapid object detection using a boosted
 [4] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-             cascade of simple features. In CVPR, 2001. 8
     Fei. ImageNet: A large-scale hierarchical image database.        [23] J. Xue, J. Li, and Y. Gong. Restructuring of deep neural
     In CVPR, 2009. 2                                                      network acoustic models with singular value decomposition.
 [5] E. Denton, W. Zaremba, J. Bruna, Y. LeCun, and R. Fergus.             In Interspeech, 2013. 4
     Exploiting linear structure within convolutional networks for    [24] X. Zhu, C. Vondrick, D. Ramanan, and C. Fowlkes. Do we
     efficient evaluation. In NIPS, 2014. 4                                need more training data or better models for object detec-
 [6] D. Erhan, C. Szegedy, A. Toshev, and D. Anguelov. Scalable            tion? In BMVC, 2012. 7
     object detection using deep neural networks. In CVPR, 2014.      [25] Y. Zhu, R. Urtasun, R. Salakhutdinov, and S. Fidler.
     3                                                                     segDeepM: Exploiting segmentation and context in deep
 [7] M. Everingham, L. Van Gool, C. K. I. Williams, J. Winn, and           neural networks for object detection. In CVPR, 2015. 1,
     A. Zisserman. The PASCAL Visual Object Classes (VOC)                  5
     Challenge. IJCV, 2010. 1
 [8] P. Felzenszwalb, R. Girshick, D. McAllester, and D. Ra-
     manan. Object detection with discriminatively trained part
     based models. TPAMI, 2010. 3, 7, 8
 [9] R. Girshick, J. Donahue, T. Darrell, and J. Malik. Rich fea-
     ture hierarchies for accurate object detection and semantic
     segmentation. In CVPR, 2014. 1, 3, 4, 8
[10] R. Girshick, J. Donahue, T. Darrell, and J. Malik. Region-
     based convolutional networks for accurate object detection
     and segmentation. TPAMI, 2015. 5, 7, 8
[11] K. He, X. Zhang, S. Ren, and J. Sun. Spatial pyramid pooling
     in deep convolutional networks for visual recognition. In
     ECCV, 2014. 1, 2, 3, 4, 5, 6, 7
[12] J. H. Hosang, R. Benenson, P. Dollár, and B. Schiele. What
     makes for effective detection proposals? arXiv preprint
     arXiv:1502.05082, 2015. 8
[13] Y. Jia, E. Shelhamer, J. Donahue, S. Karayev, J. Long, R. Gir-
     shick, S. Guadarrama, and T. Darrell. Caffe: Convolutional
     architecture for fast feature embedding. In Proc. of the ACM
     International Conf. on Multimedia, 2014. 2
[14] A. Krizhevsky, I. Sutskever, and G. Hinton. ImageNet clas-
     sification with deep convolutional neural networks. In NIPS,
     2012. 1, 4, 6
[15] S. Lazebnik, C. Schmid, and J. Ponce. Beyond bags of
     features: Spatial pyramid matching for recognizing natural
     scene categories. In CVPR, 2006. 1
[16] Y. LeCun, B. Boser, J. Denker, D. Henderson, R. Howard,
     W. Hubbard, and L. Jackel. Backpropagation applied to
     handwritten zip code recognition. Neural Comp., 1989. 1
[17] M. Lin, Q. Chen, and S. Yan. Network in network. In ICLR,
     2014. 5
[18] T. Lin, M. Maire, S. Belongie, L. Bourdev, R. Girshick,
     J. Hays, P. Perona, D. Ramanan, P. Dollár, and C. L. Zit-
     nick. Microsoft COCO: common objects in context. arXiv
     e-prints, arXiv:1405.0312 [cs.CV], 2014. 8