Source notes and reports

Sources

← Home

                                                                                                                                                                               1




                                         Faster R-CNN: Towards Real-Time Object
                                         Detection with Region Proposal Networks
                                                                  Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun

                                         Abstract—State-of-the-art object detection networks depend on region proposal algorithms to hypothesize object locations.
                                         Advances like SPPnet [1] and Fast R-CNN [2] have reduced the running time of these detection networks, exposing region
                                         proposal computation as a bottleneck. In this work, we introduce a Region Proposal Network (RPN) that shares full-image
                                         convolutional features with the detection network, thus enabling nearly cost-free region proposals. An RPN is a fully convolutional
                                         network that simultaneously predicts object bounds and objectness scores at each position. The RPN is trained end-to-end to




arXiv:1506.01497v3 [cs.CV] 6 Jan 2016
                                         generate high-quality region proposals, which are used by Fast R-CNN for detection. We further merge RPN and Fast R-CNN
                                         into a single network by sharing their convolutional features—using the recently popular terminology of neural networks with
                                         “attention” mechanisms, the RPN component tells the unified network where to look. For the very deep VGG-16 model [3],
                                         our detection system has a frame rate of 5fps (including all steps) on a GPU, while achieving state-of-the-art object detection
                                         accuracy on PASCAL VOC 2007, 2012, and MS COCO datasets with only 300 proposals per image. In ILSVRC and COCO
                                         2015 competitions, Faster R-CNN and RPN are the foundations of the 1st-place winning entries in several tracks. Code has been
                                         made publicly available.

                                         Index Terms—Object Detection, Region Proposal, Convolutional Neural Network.

                                                                                                            F



                                    1   I NTRODUCTION                                                              One may note that fast region-based CNNs take
                                    Recent advances in object detection are driven by                           advantage of GPUs, while the region proposal meth-
                                    the success of region proposal methods (e.g., [4])                          ods used in research are implemented on the CPU,
                                    and region-based convolutional neural networks (R-                          making such runtime comparisons inequitable. An ob-
                                    CNNs) [5]. Although region-based CNNs were com-                             vious way to accelerate proposal computation is to re-
                                    putationally expensive as originally developed in [5],                      implement it for the GPU. This may be an effective en-
                                    their cost has been drastically reduced thanks to shar-                     gineering solution, but re-implementation ignores the
                                    ing convolutions across proposals [1], [2]. The latest                      down-stream detection network and therefore misses
                                    incarnation, Fast R-CNN [2], achieves near real-time                        important opportunities for sharing computation.
                                    rates using very deep networks [3], when ignoring the                          In this paper, we show that an algorithmic change—
                                    time spent on region proposals. Now, proposals are the                      computing proposals with a deep convolutional neu-
                                    test-time computational bottleneck in state-of-the-art                      ral network—leads to an elegant and effective solution
                                    detection systems.                                                          where proposal computation is nearly cost-free given
                                                                                                                the detection network’s computation. To this end, we
                                       Region proposal methods typically rely on inex-
                                                                                                                introduce novel Region Proposal Networks (RPNs) that
                                    pensive features and economical inference schemes.
                                                                                                                share convolutional layers with state-of-the-art object
                                    Selective Search [4], one of the most popular meth-
                                                                                                                detection networks [1], [2]. By sharing convolutions at
                                    ods, greedily merges superpixels based on engineered
                                                                                                                test-time, the marginal cost for computing proposals
                                    low-level features. Yet when compared to efficient
                                                                                                                is small (e.g., 10ms per image).
                                    detection networks [2], Selective Search is an order of
                                                                                                                   Our observation is that the convolutional feature
                                    magnitude slower, at 2 seconds per image in a CPU
                                                                                                                maps used by region-based detectors, like Fast R-
                                    implementation. EdgeBoxes [6] currently provides the
                                                                                                                CNN, can also be used for generating region pro-
                                    best tradeoff between proposal quality and speed,
                                                                                                                posals. On top of these convolutional features, we
                                    at 0.2 seconds per image. Nevertheless, the region
                                                                                                                construct an RPN by adding a few additional con-
                                    proposal step still consumes as much running time
                                                                                                                volutional layers that simultaneously regress region
                                    as the detection network.
                                                                                                                bounds and objectness scores at each location on a
                                   • S. Ren is with University of Science and Technology of China, Hefei,
                                                                                                                regular grid. The RPN is thus a kind of fully convo-
                                     China. This work was done when S. Ren was an intern at Microsoft           lutional network (FCN) [7] and can be trained end-to-
                                     Research. Email: sqren@mail.ustc.edu.cn                                    end specifically for the task for generating detection
                                   • K. He and J. Sun are with Visual Computing Group, Microsoft
                                     Research. E-mail: {kahe,jiansun}@microsoft.com
                                                                                                                proposals.
                                   • R. Girshick is with Facebook AI Research. The majority of this work           RPNs are designed to efficiently predict region pro-
                                     was done when R. Girshick was with Microsoft Research. E-mail:             posals with a wide range of scales and aspect ratios. In
                                     rbg@fb.com
                                                                                                                contrast to prevalent methods [8], [9], [1], [2] that use
                                                                                                                                               2




                                                                       multiple filter sizes
                                                                                                                             multiple references

             feature map                                                      feature map                       feature map




                                     multiple scaled images
            image                                                            image                             image

                             (a)                                                     (b)                               (c)

Figure 1: Different schemes for addressing multiple scales and sizes. (a) Pyramids of images and feature maps
are built, and the classifier is run at all scales. (b) Pyramids of filters with multiple scales/sizes are run on
the feature map. (c) We use pyramids of reference boxes in the regression functions.


pyramids of images (Figure 1, a) or pyramids of filters               mercial systems such as at Pinterests [17], with user
(Figure 1, b), we introduce novel “anchor” boxes                      engagement improvements reported.
that serve as references at multiple scales and aspect                   In ILSVRC and COCO 2015 competitions, Faster
ratios. Our scheme can be thought of as a pyramid                     R-CNN and RPN are the basis of several 1st-place
of regression references (Figure 1, c), which avoids                  entries [18] in the tracks of ImageNet detection, Ima-
enumerating images or filters of multiple scales or                   geNet localization, COCO detection, and COCO seg-
aspect ratios. This model performs well when trained                  mentation. RPNs completely learn to propose regions
and tested using single-scale images and thus benefits                from data, and thus can easily benefit from deeper
running speed.                                                        and more expressive features (such as the 101-layer
   To unify RPNs with Fast R-CNN [2] object detec-                    residual nets adopted in [18]). Faster R-CNN and RPN
tion networks, we propose a training scheme that                      are also used by several other leading entries in these
alternates between fine-tuning for the region proposal                competitions2 . These results suggest that our method
task and then fine-tuning for object detection, while                 is not only a cost-efficient solution for practical usage,
keeping the proposals fixed. This scheme converges                    but also an effective way of improving object detec-
quickly and produces a unified network with convo-                    tion accuracy.
lutional features that are shared between both tasks.1
   We comprehensively evaluate our method on the
PASCAL VOC detection benchmarks [11] where RPNs                       2     R ELATED W ORK
with Fast R-CNNs produce detection accuracy bet-                      Object Proposals. There is a large literature on object
ter than the strong baseline of Selective Search with                 proposal methods. Comprehensive surveys and com-
Fast R-CNNs. Meanwhile, our method waives nearly                      parisons of object proposal methods can be found in
all computational burdens of Selective Search at                      [19], [20], [21]. Widely used object proposal methods
test-time—the effective running time for proposals                    include those based on grouping super-pixels (e.g.,
is just 10 milliseconds. Using the expensive very                     Selective Search [4], CPMC [22], MCG [23]) and those
deep models of [3], our detection method still has                    based on sliding windows (e.g., objectness in windows
a frame rate of 5fps (including all steps) on a GPU,                  [24], EdgeBoxes [6]). Object proposal methods were
and thus is a practical object detection system in                    adopted as external modules independent of the de-
terms of both speed and accuracy. We also report                      tectors (e.g., Selective Search [4] object detectors, R-
results on the MS COCO dataset [12] and investi-                      CNN [5], and Fast R-CNN [2]).
gate the improvements on PASCAL VOC using the
                                                                      Deep Networks for Object Detection. The R-CNN
COCO data. Code has been made publicly available
                                                                      method [5] trains CNNs end-to-end to classify the
at https://github.com/shaoqingren/faster_
                                                                      proposal regions into object categories or background.
rcnn (in MATLAB) and https://github.com/
                                                                      R-CNN mainly plays as a classifier, and it does not
rbgirshick/py-faster-rcnn (in Python).
                                                                      predict object bounds (except for refining by bounding
   A preliminary version of this manuscript was pub-
                                                                      box regression). Its accuracy depends on the perfor-
lished previously [10]. Since then, the frameworks of
                                                                      mance of the region proposal module (see compar-
RPN and Faster R-CNN have been adopted and gen-
                                                                      isons in [20]). Several papers have proposed ways of
eralized to other methods, such as 3D object detection
                                                                      using deep networks for predicting object bounding
[13], part-based detection [14], instance segmentation
                                                                      boxes [25], [9], [26], [27]. In the OverFeat method [9],
[15], and image captioning [16]. Our fast and effective
                                                                      a fully-connected layer is trained to predict the box
object detection system has also been built in com-
                                                                      coordinates for the localization task that assumes a
  1. Since the publication of the conference version of this paper
                                                                      single object. The fully-connected layer is then turned
[10], we have also found that RPNs can be trained jointly with Fast
R-CNN networks leading to less training time.                             2. http://image-net.org/challenges/LSVRC/2015/results
                                                                                                                                      3


                                       classifier
                                                                   single, unified network for object detection (Figure 2).
                                                                   Using the recently popular terminology of neural
                                               RoI pooling         networks with ‘attention’ [31] mechanisms, the RPN
                                                                   module tells the Fast R-CNN module where to look.
                                                                   In Section 3.1 we introduce the designs and properties
    proposals
                                                                   of the network for region proposal. In Section 3.2 we
                                                                   develop algorithms for training both modules with
                                                                   features shared.
 Region Proposal Network                                           3.1 Region Proposal Networks
                                                    feature maps
                                                                   A Region Proposal Network (RPN) takes an image
                                                                   (of any size) as input and outputs a set of rectangular
                                                                   object proposals, each with an objectness score.3 We
                                                                   model this process with a fully convolutional network
                                                                   [7], which we describe in this section. Because our ulti-
                         conv layers                               mate goal is to share computation with a Fast R-CNN
                                                                   object detection network [2], we assume that both nets
                                                                   share a common set of convolutional layers. In our ex-
                           image
                                                                   periments, we investigate the Zeiler and Fergus model
Figure 2: Faster R-CNN is a single, unified network                [32] (ZF), which has 5 shareable convolutional layers
for object detection. The RPN module serves as the                 and the Simonyan and Zisserman model [3] (VGG-16),
‘attention’ of this unified network.                               which has 13 shareable convolutional layers.
                                                                      To generate region proposals, we slide a small
                                                                   network over the convolutional feature map output
into a convolutional layer for detecting multiple class-           by the last shared convolutional layer. This small
specific objects. The MultiBox methods [26], [27] gen-             network takes as input an n × n spatial window of
erate region proposals from a network whose last                   the input convolutional feature map. Each sliding
fully-connected layer simultaneously predicts mul-                 window is mapped to a lower-dimensional feature
tiple class-agnostic boxes, generalizing the “single-              (256-d for ZF and 512-d for VGG, with ReLU [33]
box” fashion of OverFeat. These class-agnostic boxes               following). This feature is fed into two sibling fully-
are used as proposals for R-CNN [5]. The MultiBox                  connected layers—a box-regression layer (reg) and a
proposal network is applied on a single image crop or              box-classification layer (cls). We use n = 3 in this
multiple large image crops (e.g., 224×224), in contrast            paper, noting that the effective receptive field on the
to our fully convolutional scheme. MultiBox does not               input image is large (171 and 228 pixels for ZF and
share features between the proposal and detection                  VGG, respectively). This mini-network is illustrated
networks. We discuss OverFeat and MultiBox in more                 at a single position in Figure 3 (left). Note that be-
depth later in context with our method. Concurrent                 cause the mini-network operates in a sliding-window
with our work, the DeepMask method [28] is devel-                  fashion, the fully-connected layers are shared across
oped for learning segmentation proposals.                          all spatial locations. This architecture is naturally im-
   Shared computation of convolutions [9], [1], [29],              plemented with an n × n convolutional layer followed
[7], [2] has been attracting increasing attention for ef-          by two sibling 1 × 1 convolutional layers (for reg and
ficient, yet accurate, visual recognition. The OverFeat            cls, respectively).
paper [9] computes convolutional features from an                  3.1.1 Anchors
image pyramid for classification, localization, and de-
                                                                   At each sliding-window location, we simultaneously
tection. Adaptively-sized pooling (SPP) [1] on shared
                                                                   predict multiple region proposals, where the number
convolutional feature maps is developed for efficient
                                                                   of maximum possible proposals for each location is
region-based object detection [1], [30] and semantic
                                                                   denoted as k. So the reg layer has 4k outputs encoding
segmentation [29]. Fast R-CNN [2] enables end-to-end
                                                                   the coordinates of k boxes, and the cls layer outputs
detector training on shared convolutional features and
                                                                   2k scores that estimate probability of object or not
shows compelling accuracy and speed.
                                                                   object for each proposal4 . The k proposals are param-
                                                                   eterized relative to k reference boxes, which we call
3    FASTER R-CNN
                                                                      3. “Region” is a generic term and in this paper we only consider
Our object detection system, called Faster R-CNN, is               rectangular regions, as is common for many methods (e.g., [27], [4],
composed of two modules. The first module is a deep                [6]). “Objectness” measures membership to a set of object classes
fully convolutional network that proposes regions,                 vs. background.
                                                                      4. For simplicity we implement the cls layer as a two-class
and the second module is the Fast R-CNN detector [2]               softmax layer. Alternatively, one may use logistic regression to
that uses the proposed regions. The entire system is a             produce k scores.
                                                                                                                                                                                                                              4


                                                                                                                    person : 0.992
        2k scores           4k coordinates                   k anchor boxes                                                                           dog : 0.994
                                                                                                            horse : 0.993


  cls layer                           reg layer                                 car : 1.000                                                                                              cat : 0.982




                                                                                                   dog : 0.997                       person : 0.979



                    256-d
                            intermediate layer

                                                                                              bus : 0.996
                                                                                                                                                            boat : 0.970

                                                                                                                                                                       person : 0.983
                                                                                                 person : 0.736                                                                         person : 0.983
                                                                                                                                                                                                 person : 0.925



                                                                                                                                                                                                             person : 0.989




                sliding window
                                  conv feature map

Figure 3: Left: Region Proposal Network (RPN). Right: Example detections using RPN proposals on PASCAL
VOC 2007 test. Our method detects objects in a wide range of scales and aspect ratios.


anchors. An anchor is centered at the sliding window                    Multi-Scale Anchors as Regression References
in question, and is associated with a scale and aspect                     Our design of anchors presents a novel scheme
ratio (Figure 3, left). By default we use 3 scales and                  for addressing multiple scales (and aspect ratios). As
3 aspect ratios, yielding k = 9 anchors at each sliding                 shown in Figure 1, there have been two popular ways
position. For a convolutional feature map of a size                     for multi-scale predictions. The first way is based on
W × H (typically ∼2,400), there are W Hk anchors in                     image/feature pyramids, e.g., in DPM [8] and CNN-
total.                                                                  based methods [9], [1], [2]. The images are resized at
                                                                        multiple scales, and feature maps (HOG [8] or deep
Translation-Invariant Anchors
                                                                        convolutional features [9], [1], [2]) are computed for
   An important property of our approach is that it                     each scale (Figure 1(a)). This way is often useful but
is translation invariant, both in terms of the anchors                  is time-consuming. The second way is to use sliding
and the functions that compute proposals relative to                    windows of multiple scales (and/or aspect ratios) on
the anchors. If one translates an object in an image,                   the feature maps. For example, in DPM [8], models
the proposal should translate and the same function                     of different aspect ratios are trained separately using
should be able to predict the proposal in either lo-                    different filter sizes (such as 5×7 and 7×5). If this way
cation. This translation-invariant property is guaran-                  is used to address multiple scales, it can be thought
teed by our method5 . As a comparison, the MultiBox                     of as a “pyramid of filters” (Figure 1(b)). The second
method [27] uses k-means to generate 800 anchors,                       way is usually adopted jointly with the first way [8].
which are not translation invariant. So MultiBox does                      As a comparison, our anchor-based method is built
not guarantee that the same proposal is generated if                    on a pyramid of anchors, which is more cost-efficient.
an object is translated.                                                Our method classifies and regresses bounding boxes
   The translation-invariant property also reduces the                  with reference to anchor boxes of multiple scales and
model size. MultiBox has a (4 + 1) × 800-dimensional                    aspect ratios. It only relies on images and feature
fully-connected output layer, whereas our method has                    maps of a single scale, and uses filters (sliding win-
a (4 + 2) × 9-dimensional convolutional output layer                    dows on the feature map) of a single size. We show by
in the case of k = 9 anchors. As a result, our output                   experiments the effects of this scheme for addressing
layer has 2.8 × 104 parameters (512 × (4 + 2) × 9                       multiple scales and sizes (Table 8).
for VGG-16), two orders of magnitude fewer than                            Because of this multi-scale design based on anchors,
MultiBox’s output layer that has 6.1 × 106 parameters                   we can simply use the convolutional features com-
(1536 × (4 + 1) × 800 for GoogleNet [34] in MultiBox                    puted on a single-scale image, as is also done by
[27]). If considering the feature projection layers, our                the Fast R-CNN detector [2]. The design of multi-
proposal layers still have an order of magnitude fewer                  scale anchors is a key component for sharing features
parameters than MultiBox6 . We expect our method                        without extra cost for addressing scales.
to have less risk of overfitting on small datasets, like
PASCAL VOC.                                                             3.1.2 Loss Function
                                                                        For training RPNs, we assign a binary class label
  5. As is the case of FCNs [7], our network is translation invariant   (of being an object or not) to each anchor. We as-
up to the network’s total stride.                                       sign a positive label to two kinds of anchors: (i) the
  6. Considering the feature projection layers, our proposal layers’    anchor/anchors with the highest Intersection-over-
parameter count is 3 × 3 × 512 × 512 + 512 × 6 × 9 = 2.4 × 106 ;
MultiBox’s proposal layers’ parameter count is 7 × 7 × (64 + 96 +       Union (IoU) overlap with a ground-truth box, or (ii) an
64 + 64) × 1536 + 1536 × 5 × 800 = 27 × 106 .                           anchor that has an IoU overlap higher than 0.7 with
                                                                                                                         5



any ground-truth box. Note that a single ground-truth           be thought of as bounding-box regression from an
box may assign positive labels to multiple anchors.             anchor box to a nearby ground-truth box.
Usually the second condition is sufficient to determine            Nevertheless, our method achieves bounding-box
the positive samples; but we still adopt the first              regression by a different manner from previous RoI-
condition for the reason that in some rare cases the            based (Region of Interest) methods [1], [2]. In [1],
second condition may find no positive sample. We                [2], bounding-box regression is performed on features
assign a negative label to a non-positive anchor if its         pooled from arbitrarily sized RoIs, and the regression
IoU ratio is lower than 0.3 for all ground-truth boxes.         weights are shared by all region sizes. In our formula-
Anchors that are neither positive nor negative do not           tion, the features used for regression are of the same
contribute to the training objective.                           spatial size (3 × 3) on the feature maps. To account
   With these definitions, we minimize an objective             for varying sizes, a set of k bounding-box regressors
function following the multi-task loss in Fast R-CNN            are learned. Each regressor is responsible for one scale
[2]. Our loss function for an image is defined as:              and one aspect ratio, and the k regressors do not share
                                                                weights. As such, it is still possible to predict boxes of
                            1 X
          L({pi }, {ti }) =      Lcls (pi , p∗i )               various sizes even though the features are of a fixed
                          Ncls i                                size/scale, thanks to the design of anchors.
                                                          (1)
                         1 X ∗
                     +λ        p Lreg (ti , t∗i ).
                        Nreg i i                                3.1.3 Training RPNs
                                                                The RPN can be trained end-to-end by back-
Here, i is the index of an anchor in a mini-batch and           propagation and stochastic gradient descent (SGD)
pi is the predicted probability of anchor i being an            [35]. We follow the “image-centric” sampling strategy
object. The ground-truth label p∗i is 1 if the anchor           from [2] to train this network. Each mini-batch arises
is positive, and is 0 if the anchor is negative. ti is a        from a single image that contains many positive and
vector representing the 4 parameterized coordinates             negative example anchors. It is possible to optimize
of the predicted bounding box, and t∗i is that of the           for the loss functions of all anchors, but this will
ground-truth box associated with a positive anchor.             bias towards negative samples as they are dominate.
The classification loss Lcls is log loss over two classes       Instead, we randomly sample 256 anchors in an image
(object vs. not object). For the regression loss, we use        to compute the loss function of a mini-batch, where
Lreg (ti , t∗i ) = R(ti − t∗i ) where R is the robust loss      the sampled positive and negative anchors have a
function (smooth L1 ) defined in [2]. The term p∗i Lreg         ratio of up to 1:1. If there are fewer than 128 positive
means the regression loss is activated only for positive        samples in an image, we pad the mini-batch with
anchors (p∗i = 1) and is disabled otherwise (p∗i = 0).          negative ones.
The outputs of the cls and reg layers consist of {pi }             We randomly initialize all new layers by drawing
and {ti } respectively.                                         weights from a zero-mean Gaussian distribution with
   The two terms are normalized by Ncls and Nreg                standard deviation 0.01. All other layers (i.e., the
and weighted by a balancing parameter λ. In our                 shared convolutional layers) are initialized by pre-
current implementation (as in the released code), the           training a model for ImageNet classification [36], as
cls term in Eqn.(1) is normalized by the mini-batch             is standard practice [5]. We tune all layers of the
size (i.e., Ncls = 256) and the reg term is normalized          ZF net, and conv3 1 and up for the VGG net to
by the number of anchor locations (i.e., Nreg ∼ 2, 400).        conserve memory [2]. We use a learning rate of 0.001
By default we set λ = 10, and thus both cls and                 for 60k mini-batches, and 0.0001 for the next 20k
reg terms are roughly equally weighted. We show                 mini-batches on the PASCAL VOC dataset. We use a
by experiments that the results are insensitive to the          momentum of 0.9 and a weight decay of 0.0005 [37].
values of λ in a wide range (Table 9). We also note             Our implementation uses Caffe [38].
that the normalization as above is not required and
could be simplified.
   For bounding box regression, we adopt the param-             3.2   Sharing Features for RPN and Fast R-CNN
eterizations of the 4 coordinates following [5]:                Thus far we have described how to train a network
                                                                for region proposal generation, without considering
        tx = (x − xa )/wa ,     ty = (y − ya )/ha ,
                                                                the region-based object detection CNN that will utilize
        tw = log(w/wa ),      th = log(h/ha ),                  these proposals. For the detection network, we adopt
                                                          (2)
        t∗x = (x∗ − xa )/wa ,    t∗y = (y ∗ − ya )/ha ,         Fast R-CNN [2]. Next we describe algorithms that
        t∗w = log(w∗ /wa ),     t∗h = log(h∗ /ha ),             learn a unified network composed of RPN and Fast
                                                                R-CNN with shared convolutional layers (Figure 2).
where x, y, w, and h denote the box’s center coordi-               Both RPN and Fast R-CNN, trained independently,
nates and its width and height. Variables x, xa , and           will modify their convolutional layers in different
x∗ are for the predicted box, anchor box, and ground-           ways. We therefore need to develop a technique that
truth box respectively (likewise for y, w, h). This can         allows for sharing convolutional layers between the
                                                                                                                        6


     Table 1: the learned average proposal size for each anchor using the ZF net (numbers for s = 600).
          anchor 1282 , 2:1 1282 , 1:1 1282 , 1:2 2562 , 2:1 2562 , 1:1 2562 , 1:2 5122 , 2:1 5122 , 1:1 5122 , 1:2
         proposal 188×111 113×114 70×92           416×229 261×284 174×332 768×437 499×501 355×715



two networks, rather than learning two separate net-            fix the shared convolutional layers and only fine-tune
works. We discuss three ways for training networks              the layers unique to RPN. Now the two networks
with features shared:                                           share convolutional layers. Finally, keeping the shared
   (i) Alternating training. In this solution, we first train   convolutional layers fixed, we fine-tune the unique
RPN, and use the proposals to train Fast R-CNN.                 layers of Fast R-CNN. As such, both networks share
The network tuned by Fast R-CNN is then used to                 the same convolutional layers and form a unified
initialize RPN, and this process is iterated. This is the       network. A similar alternating training can be run
solution that is used in all experiments in this paper.         for more iterations, but we have observed negligible
   (ii) Approximate joint training. In this solution, the       improvements.
RPN and Fast R-CNN networks are merged into one
                                                                3.3 Implementation Details
network during training as in Figure 2. In each SGD
iteration, the forward pass generates region propos-            We train and test both region proposal and object
als which are treated just like fixed, pre-computed             detection networks on images of a single scale [1], [2].
proposals when training a Fast R-CNN detector. The              We re-scale the images such that their shorter side
backward propagation takes place as usual, where for            is s = 600 pixels [2]. Multi-scale feature extraction
the shared layers the backward propagated signals               (using an image pyramid) may improve accuracy but
from both the RPN loss and the Fast R-CNN loss                  does not exhibit a good speed-accuracy trade-off [2].
are combined. This solution is easy to implement. But           On the re-scaled images, the total stride for both ZF
this solution ignores the derivative w.r.t. the proposal        and VGG nets on the last convolutional layer is 16
boxes’ coordinates that are also network responses,             pixels, and thus is ∼10 pixels on a typical PASCAL
so is approximate. In our experiments, we have em-              image before resizing (∼500×375). Even such a large
pirically found this solver produces close results, yet         stride provides good results, though accuracy may be
reduces the training time by about 25-50% comparing             further improved with a smaller stride.
with alternating training. This solver is included in              For anchors, we use 3 scales with box areas of 1282 ,
our released Python code.                                       2562 , and 5122 pixels, and 3 aspect ratios of 1:1, 1:2,
                                                                and 2:1. These hyper-parameters are not carefully cho-
   (iii) Non-approximate joint training. As discussed
                                                                sen for a particular dataset, and we provide ablation
above, the bounding boxes predicted by RPN are
                                                                experiments on their effects in the next section. As dis-
also functions of the input. The RoI pooling layer
                                                                cussed, our solution does not need an image pyramid
[2] in Fast R-CNN accepts the convolutional features
                                                                or filter pyramid to predict regions of multiple scales,
and also the predicted bounding boxes as input, so
                                                                saving considerable running time. Figure 3 (right)
a theoretically valid backpropagation solver should
                                                                shows the capability of our method for a wide range
also involve gradients w.r.t. the box coordinates. These
                                                                of scales and aspect ratios. Table 1 shows the learned
gradients are ignored in the above approximate joint
                                                                average proposal size for each anchor using the ZF
training. In a non-approximate joint training solution,
                                                                net. We note that our algorithm allows predictions
we need an RoI pooling layer that is differentiable
                                                                that are larger than the underlying receptive field.
w.r.t. the box coordinates. This is a nontrivial problem
                                                                Such predictions are not impossible—one may still
and a solution can be given by an “RoI warping” layer
                                                                roughly infer the extent of an object if only the middle
as developed in [15], which is beyond the scope of this
                                                                of the object is visible.
paper.
                                                                   The anchor boxes that cross image boundaries need
4-Step Alternating Training. In this paper, we adopt            to be handled with care. During training, we ignore
a pragmatic 4-step training algorithm to learn shared           all cross-boundary anchors so they do not contribute
features via alternating optimization. In the first step,       to the loss. For a typical 1000 × 600 image, there
we train the RPN as described in Section 3.1.3. This            will be roughly 20000 (≈ 60 × 40 × 9) anchors in
network is initialized with an ImageNet-pre-trained             total. With the cross-boundary anchors ignored, there
model and fine-tuned end-to-end for the region pro-             are about 6000 anchors per image for training. If the
posal task. In the second step, we train a separate             boundary-crossing outliers are not ignored in training,
detection network by Fast R-CNN using the proposals             they introduce large, difficult to correct error terms in
generated by the step-1 RPN. This detection net-                the objective, and training does not converge. During
work is also initialized by the ImageNet-pre-trained            testing, however, we still apply the fully convolutional
model. At this point the two networks do not share              RPN to the entire image. This may generate cross-
convolutional layers. In the third step, we use the             boundary proposal boxes, which we clip to the image
detector network to initialize RPN training, but we             boundary.
                                                                                                                               7


Table 2: Detection results on PASCAL VOC 2007 test set (trained on VOC 2007 trainval). The detectors are
Fast R-CNN with ZF, but using various proposal methods for training and testing.
                      train-time region proposals           test-time region proposals
                           method          # boxes            method           # proposals      mAP (%)
                           SS                  2000            SS                  2000            58.7
                           EB                  2000            EB                  2000            58.6
                      RPN+ZF, shared           2000       RPN+ZF, shared           300             59.9
                     ablation experiments follow below
                     RPN+ZF, unshared          2000      RPN+ZF, unshared           300            58.7
                           SS                  2000          RPN+ZF                 100            55.1
                           SS                  2000          RPN+ZF                 300            56.8
                           SS                  2000          RPN+ZF                1000            56.3
                           SS                  2000      RPN+ZF (no NMS)           6000            55.2
                           SS                  2000       RPN+ZF (no cls)           100            44.6
                           SS                  2000       RPN+ZF (no cls)           300            51.4
                           SS                  2000       RPN+ZF (no cls)          1000            55.8
                           SS                  2000       RPN+ZF (no reg)           300            52.1
                           SS                  2000       RPN+ZF (no reg)          1000            51.3
                           SS                  2000         RPN+VGG                300             59.2



  Some RPN proposals highly overlap with each                   IoU. SS has an mAP of 58.7% and EB has an mAP
other. To reduce redundancy, we adopt non-maximum               of 58.6% under the Fast R-CNN framework. RPN
suppression (NMS) on the proposal regions based on              with Fast R-CNN achieves competitive results, with
their cls scores. We fix the IoU threshold for NMS              an mAP of 59.9% while using up to 300 proposals8 .
at 0.7, which leaves us about 2000 proposal regions             Using RPN yields a much faster detection system than
per image. As we will show, NMS does not harm the               using either SS or EB because of shared convolutional
ultimate detection accuracy, but substantially reduces          computations; the fewer proposals also reduce the
the number of proposals. After NMS, we use the                  region-wise fully-connected layers’ cost (Table 5).
top-N ranked proposal regions for detection. In the             Ablation Experiments on RPN. To investigate the be-
following, we train Fast R-CNN using 2000 RPN pro-              havior of RPNs as a proposal method, we conducted
posals, but evaluate different numbers of proposals at          several ablation studies. First, we show the effect of
test-time.                                                      sharing convolutional layers between the RPN and
                                                                Fast R-CNN detection network. To do this, we stop
4     E XPERIMENTS                                              after the second step in the 4-step training process.
4.1    Experiments on PASCAL VOC                                Using separate networks reduces the result slightly to
We comprehensively evaluate our method on the                   58.7% (RPN+ZF, unshared, Table 2). We observe that
PASCAL VOC 2007 detection benchmark [11]. This                  this is because in the third step when the detector-
dataset consists of about 5k trainval images and 5k             tuned features are used to fine-tune the RPN, the
test images over 20 object categories. We also provide          proposal quality is improved.
results on the PASCAL VOC 2012 benchmark for a                     Next, we disentangle the RPN’s influence on train-
few models. For the ImageNet pre-trained network,               ing the Fast R-CNN detection network. For this pur-
we use the “fast” version of ZF net [32] that has               pose, we train a Fast R-CNN model by using the
5 convolutional layers and 3 fully-connected layers,            2000 SS proposals and ZF net. We fix this detector
and the public VGG-16 model7 [3] that has 13 con-               and evaluate the detection mAP by changing the
volutional layers and 3 fully-connected layers. We              proposal regions used at test-time. In these ablation
primarily evaluate detection mean Average Precision             experiments, the RPN does not share features with
(mAP), because this is the actual metric for object             the detector.
detection (rather than focusing on object proposal                 Replacing SS with 300 RPN proposals at test-time
proxy metrics).                                                 leads to an mAP of 56.8%. The loss in mAP is because
   Table 2 (top) shows Fast R-CNN results when                  of the inconsistency between the training/testing pro-
trained and tested using various region proposal                posals. This result serves as the baseline for the fol-
methods. These results use the ZF net. For Selective            lowing comparisons.
Search (SS) [4], we generate about 2000 proposals by               Somewhat surprisingly, the RPN still leads to a
the “fast” mode. For EdgeBoxes (EB) [6], we generate            competitive result (55.1%) when using the top-ranked
the proposals by the default EB setting tuned for 0.7             8. For RPN, the number of proposals (e.g., 300) is the maximum
                                                                number for an image. RPN may produce fewer proposals after
    7. www.robots.ox.ac.uk/∼vgg/research/very deep/             NMS, and thus the average number of proposals is smaller.
                                                                                                                         8


Table 3: Detection results on PASCAL VOC 2007 test set. The detector is Fast R-CNN and VGG-16. Training
data: “07”: VOC 2007 trainval, “07+12”: union set of VOC 2007 trainval and VOC 2012 trainval. For RPN,
the train-time proposals for Fast R-CNN are 2000. † : this number was reported in [2]; using the repository
provided by this paper, this result is higher (68.1).
                                 method              # proposals             data           mAP (%)
                               SS                          2000               07             66.9†
                               SS                          2000             07+12            70.0
                         RPN+VGG, unshared                 300                07             68.5
                          RPN+VGG, shared                   300               07             69.9
                          RPN+VGG, shared                   300             07+12            73.2
                          RPN+VGG, shared                   300           COCO+07+12         78.8


Table 4: Detection results on PASCAL VOC 2012 test set. The detector is Fast R-CNN and VGG-16. Training
data: “07”: VOC 2007 trainval, “07++12”: union set of VOC 2007 trainval+test and VOC 2012 trainval. For
RPN, the train-time proposals for Fast R-CNN are 2000. † : http://host.robots.ox.ac.uk:8080/anonymous/HZJTQA.html. ‡ :
http://host.robots.ox.ac.uk:8080/anonymous/YNPLXB.html. § : http://host.robots.ox.ac.uk:8080/anonymous/XEDH10.html.
                                method             # proposals               data           mAP (%)
                               SS                      2000                12                 65.7
                               SS                      2000              07++12               68.4
                         RPN+VGG, shared†               300                12                 67.0
                         RPN+VGG, shared‡               300              07++12               70.4
                         RPN+VGG, shared§               300            COCO+07++12            75.9


Table 5: Timing (ms) on a K40 GPU, except SS proposal is evaluated in a CPU. “Region-wise” includes NMS,
pooling, fully-connected, and softmax layers. See our released code for the profiling of running time.
              model              system             conv          proposal    region-wise      total     rate
               VGG         SS + Fast R-CNN           146           1510             174        1830    0.5 fps
               VGG        RPN + Fast R-CNN           141            10              47          198     5 fps
                ZF        RPN + Fast R-CNN           31             3               25          59     17 fps



100 proposals at test-time, indicating that the top-               (using RPN+ZF) to 59.2% (using RPN+VGG). This is a
ranked RPN proposals are accurate. On the other                    promising result, because it suggests that the proposal
extreme, using the top-ranked 6000 RPN proposals                   quality of RPN+VGG is better than that of RPN+ZF.
(without NMS) has a comparable mAP (55.2%), sug-                   Because proposals of RPN+ZF are competitive with
gesting NMS does not harm the detection mAP and                    SS (both are 58.7% when consistently used for training
may reduce false alarms.                                           and testing), we may expect RPN+VGG to be better
   Next, we separately investigate the roles of RPN’s              than SS. The following experiments justify this hy-
cls and reg outputs by turning off either of them                  pothesis.
at test-time. When the cls layer is removed at test-
                                                                   Performance of VGG-16. Table 3 shows the results
time (thus no NMS/ranking is used), we randomly
                                                                   of VGG-16 for both proposal and detection. Using
sample N proposals from the unscored regions. The
                                                                   RPN+VGG, the result is 68.5% for unshared features,
mAP is nearly unchanged with N = 1000 (55.8%), but
                                                                   slightly higher than the SS baseline. As shown above,
degrades considerably to 44.6% when N = 100. This
                                                                   this is because the proposals generated by RPN+VGG
shows that the cls scores account for the accuracy of
                                                                   are more accurate than SS. Unlike SS that is pre-
the highest ranked proposals.
                                                                   defined, the RPN is actively trained and benefits from
  On the other hand, when the reg layer is removed                 better networks. For the feature-shared variant, the
at test-time (so the proposals become anchor boxes),               result is 69.9%—better than the strong SS baseline, yet
the mAP drops to 52.1%. This suggests that the high-               with nearly cost-free proposals. We further train the
quality proposals are mainly due to the regressed box              RPN and detection network on the union set of PAS-
bounds. The anchor boxes, though having multiple                   CAL VOC 2007 trainval and 2012 trainval. The mAP
scales and aspect ratios, are not sufficient for accurate          is 73.2%. Figure 5 shows some results on the PASCAL
detection.                                                         VOC 2007 test set. On the PASCAL VOC 2012 test set
  We also evaluate the effects of more powerful net-               (Table 4), our method has an mAP of 70.4% trained
works on the proposal quality of RPN alone. We use                 on the union set of VOC 2007 trainval+test and VOC
VGG-16 to train the RPN, and still use the above                   2012 trainval. Table 6 and Table 7 show the detailed
detector of SS+ZF. The mAP improves from 56.8%                     numbers.
                                                                                                                                                                                            9


Table 6: Results on PASCAL VOC 2007 test set with Fast R-CNN detectors and VGG-16. For RPN, the train-time
proposals for Fast R-CNN are 2000. RPN∗ denotes the unsharing feature version.
 method   # box      data           mAP     areo   bike    bird   boat    bottle    bus   car    cat   chair   cow   table   dog   horse   mbike person plant   sheep   sofa   train   tv

  SS      2000        07            66.9   74.5 78.3 69.2 53.2 36.6 77.3 78.2 82.0 40.7 72.7 67.9 79.6 79.2 73.0 69.0 30.1 65.4 70.2 75.8 65.8
  SS      2000       07+12          70.0   77.0 78.1 69.3 59.4 38.3 81.6 78.6 86.7 42.8 78.8 68.9 84.7 82.0 76.6 69.9 31.8 70.1 74.8 80.4 70.4
 RPN∗      300        07            68.5   74.1 77.2 67.7 53.9 51.0 75.1 79.2 78.9 50.7 78.0 61.1 79.1 81.9 72.2 75.9 37.2 71.4 62.5 77.4 66.4
 RPN       300        07            69.9   70.0 80.6 70.1 57.3 49.9 78.2 80.4 82.0 52.2 75.3 67.2 80.3 79.8 75.0 76.3 39.1 68.3 67.3 81.1 67.6
 RPN       300       07+12          73.2   76.5 79.0 70.9 65.5 52.1 83.1 84.7 86.4 52.0 81.9 65.7 84.8 84.6 77.5 76.7 38.8 73.6 73.9 83.0 72.6
 RPN       300    COCO+07+12        78.8   84.3 82.0 77.7 68.9 65.7 88.1 88.4 88.9 63.6 86.3 70.8 85.9 87.6 80.1 82.3 53.6 80.4 75.8 86.6 78.9



Table 7: Results on PASCAL VOC 2012 test set with Fast R-CNN detectors and VGG-16. For RPN, the train-time
proposals for Fast R-CNN are 2000.
 method   # box      data           mAP     areo   bike    bird   boat    bottle    bus   car    cat   chair   cow   table   dog   horse   mbike person plant   sheep   sofa   train   tv

  SS      2000        12            65.7   80.3 74.7 66.9 46.9 37.7 73.9 68.6 87.7 41.7 71.1 51.1 86.0 77.8 79.8 69.8 32.1 65.5 63.8 76.4 61.7
  SS      2000      07++12          68.4   82.3 78.4 70.8 52.3 38.7 77.8 71.6 89.3 44.2 73.0 55.0 87.5 80.5 80.8 72.0 35.1 68.3 65.7 80.4 64.2
 RPN       300        12            67.0   82.3 76.4 71.0 48.4 45.2 72.1 72.3 87.3 42.2 73.7 50.0 86.8 78.7 78.4 77.4 34.5 70.1 57.1 77.1 58.9
 RPN       300      07++12          70.4   84.9 79.8 74.3 53.9 49.8 77.5 75.9 88.5 45.6 77.1 55.3 86.9 81.7 80.9 79.6 40.1 72.6 60.9 81.2 61.5
 RPN       300    COCO+07++12       75.9   87.4 83.6 76.8 62.9 59.6 81.9 82.0 91.3 54.9 82.6 59.0 89.0 85.5 84.7 84.1 52.2 78.9 65.5 85.4 70.2




Table 8: Detection results of Faster R-CNN on PAS-                                              3 scales and 3 aspect ratios (69.9% mAP in Table 8).
CAL VOC 2007 test set using different settings of                                               If using just one anchor at each position, the mAP
anchors. The network is VGG-16. The training data                                               drops by a considerable margin of 3-4%. The mAP
is VOC 2007 trainval. The default setting of using 3                                            is higher if using 3 scales (with 1 aspect ratio) or 3
scales and 3 aspect ratios (69.9%) is the same as that                                          aspect ratios (with 1 scale), demonstrating that using
in Table 3.                                                                                     anchors of multiple sizes as the regression references
       settings         anchor scales              aspect ratios mAP (%)                        is an effective solution. Using just 3 scales with 1
                                1282                       1:1                     65.8         aspect ratio (69.8%) is as good as using 3 scales with
 1 scale, 1 ratio
                           2562                1:1                                 66.7         3 aspect ratios on this dataset, suggesting that scales
                           1282          {2:1, 1:1, 1:2}                           68.8
1 scale, 3 ratios                                                                               and aspect ratios are not disentangled dimensions for
                           2562          {2:1, 1:1, 1:2}                           67.9
                                                                                                the detection accuracy. But we still adopt these two
3 scales, 1 ratio {128 , 2562 , 5122 }
                       2                       1:1                                 69.8
                                                                                                dimensions in our designs to keep our system flexible.
3 scales, 3 ratios {1282 , 2562 , 5122 } {2:1, 1:1, 1:2}                           69.9
                                                                                                   In Table 9 we compare different values of λ in Equa-
                                                                                                tion (1). By default we use λ = 10 which makes the
Table 9: Detection results of Faster R-CNN on PAS-                                              two terms in Equation (1) roughly equally weighted
CAL VOC 2007 test set using different values of λ                                               after normalization. Table 9 shows that our result is
in Equation (1). The network is VGG-16. The training                                            impacted just marginally (by ∼ 1%) when λ is within
data is VOC 2007 trainval. The default setting of using                                         a scale of about two orders of magnitude (1 to 100).
λ = 10 (69.9%) is the same as that in Table 3.                                                  This demonstrates that the result is insensitive to λ in
            λ                 0.1           1              10            100                    a wide range.
          mAP (%)            67.2          68.9           69.9           69.1                   Analysis of Recall-to-IoU. Next we compute the
                                                                                                recall of proposals at different IoU ratios with ground-
                                                                                                truth boxes. It is noteworthy that the Recall-to-IoU
                                                                                                metric is just loosely [19], [20], [21] related to the
  In Table 5 we summarize the running time of the
                                                                                                ultimate detection accuracy. It is more appropriate to
entire object detection system. SS takes 1-2 seconds
                                                                                                use this metric to diagnose the proposal method than
depending on content (on average about 1.5s), and
                                                                                                to evaluate it.
Fast R-CNN with VGG-16 takes 320ms on 2000 SS
                                                                                                  In Figure 4, we show the results of using 300, 1000,
proposals (or 223ms if using SVD on fully-connected
                                                                                                and 2000 proposals. We compare with SS and EB, and
layers [2]). Our system with VGG-16 takes in total
                                                                                                the N proposals are the top-N ranked ones based on
198ms for both proposal and detection. With the con-
                                                                                                the confidence generated by these methods. The plots
volutional features shared, the RPN alone only takes
                                                                                                show that the RPN method behaves gracefully when
10ms computing the additional layers. Our region-
                                                                                                the number of proposals drops from 2000 to 300. This
wise computation is also lower, thanks to fewer pro-
                                                                                                explains why the RPN has a good ultimate detection
posals (300 per image). Our system has a frame-rate
                                                                                                mAP when using as few as 300 proposals. As we
of 17 fps with the ZF net.
                                                                                                analyzed before, this property is mainly attributed to
Sensitivities to Hyper-parameters. In Table 8 we                                                the cls term of the RPN. The recall of SS and EB drops
investigate the settings of anchors. By default we use                                          more quickly than RPN when the proposals are fewer.
                                                                                                                                                           10


                                       ϯϬϬƉƌŽƉŽƐĂůƐ                           ϭϬϬϬƉƌŽƉŽƐĂůƐ                           ϮϬϬϬƉƌŽƉŽƐĂůƐ
                           ϭ                                           ϭ                                        ϭ

                          Ϭ͘ϴ                                         Ϭ͘ϴ                                      Ϭ͘ϴ

                          Ϭ͘ϲ                                         Ϭ͘ϲ                                      Ϭ͘ϲ
                 ZĞĐĂůů          ^^                                          ^^                                       ^^
                          Ϭ͘ϰ                                         Ϭ͘ϰ                                      Ϭ͘ϰ
                                                                                                                  
                          Ϭ͘Ϯ    ZWE&                               Ϭ͘Ϯ    ZWE&                            Ϭ͘Ϯ    ZWE&
                                 ZWEs''                                     ZWEs''                                  ZWEs''
                           Ϭ                                           Ϭ                                        Ϭ
                           Ϭ͘ϱ   Ϭ͘ϲ      Ϭ͘ϳ         Ϭ͘ϴ   Ϭ͘ϵ   ϭ    Ϭ͘ϱ   Ϭ͘ϲ   Ϭ͘ϳ         Ϭ͘ϴ   Ϭ͘ϵ   ϭ    Ϭ͘ϱ   Ϭ͘ϲ   Ϭ͘ϳ         Ϭ͘ϴ   Ϭ͘ϵ    ϭ
                                                /Žh                                      /Žh                                      /Žh

                      Figure 4: Recall vs. IoU overlap ratio on the PASCAL VOC 2007 test set.

Table 10: One-Stage Detection vs. Two-Stage Proposal + Detection. Detection results are on the PASCAL
VOC 2007 test set using the ZF model and Fast R-CNN. RPN uses unshared features.
                                                            proposals                                            detector                       mAP (%)
             Two-Stage               RPN + ZF, unshared                                 300          Fast R-CNN + ZF, 1 scale                       58.7
             One-Stage           dense, 3 scales, 3 aspect ratios                      20000         Fast R-CNN + ZF, 1 scale                       53.8
             One-Stage           dense, 3 scales, 3 aspect ratios                      20000         Fast R-CNN + ZF, 5 scales                      53.9



One-Stage Detection vs. Two-Stage Proposal + De-                                          region proposals with sliding windows leads to ∼6%
tection. The OverFeat paper [9] proposes a detection                                      degradation in both papers. We also note that the one-
method that uses regressors and classifiers on sliding                                    stage system is slower as it has considerably more
windows over convolutional feature maps. OverFeat                                         proposals to process.
is a one-stage, class-specific detection pipeline, and ours
is a two-stage cascade consisting of class-agnostic pro-
posals and class-specific detections. In OverFeat, the                                    4.2        Experiments on MS COCO
region-wise features come from a sliding window of                                        We present more results on the Microsoft COCO
one aspect ratio over a scale pyramid. These features                                     object detection dataset [12]. This dataset involves 80
are used to simultaneously determine the location and                                     object categories. We experiment with the 80k images
category of objects. In RPN, the features are from                                        on the training set, 40k images on the validation set,
square (3×3) sliding windows and predict proposals                                        and 20k images on the test-dev set. We evaluate the
relative to anchors with different scales and aspect                                      mAP averaged for IoU ∈ [0.5 : 0.05 : 0.95] (COCO’s
ratios. Though both methods use sliding windows, the                                      standard metric, simply denoted as mAP@[.5, .95])
region proposal task is only the first stage of Faster R-                                 and mAP@0.5 (PASCAL VOC’s metric).
CNN—the downstream Fast R-CNN detector attends                                               There are a few minor changes of our system made
to the proposals to refine them. In the second stage of                                   for this dataset. We train our models on an 8-GPU
our cascade, the region-wise features are adaptively                                      implementation, and the effective mini-batch size be-
pooled [1], [2] from proposal boxes that more faith-                                      comes 8 for RPN (1 per GPU) and 16 for Fast R-CNN
fully cover the features of the regions. We believe                                       (2 per GPU). The RPN step and Fast R-CNN step are
these features lead to more accurate detections.                                          both trained for 240k iterations with a learning rate
   To compare the one-stage and two-stage systems,                                        of 0.003 and then for 80k iterations with 0.0003. We
we emulate the OverFeat system (and thus also circum-                                     modify the learning rates (starting with 0.003 instead
vent other differences of implementation details) by                                      of 0.001) because the mini-batch size is changed. For
one-stage Fast R-CNN. In this system, the “proposals”                                     the anchors, we use 3 aspect ratios and 4 scales
are dense sliding windows of 3 scales (128, 256, 512)                                     (adding 642 ), mainly motivated by handling small
and 3 aspect ratios (1:1, 1:2, 2:1). Fast R-CNN is                                        objects on this dataset. In addition, in our Fast R-CNN
trained to predict class-specific scores and regress box                                  step, the negative samples are defined as those with
locations from these sliding windows. Because the                                         a maximum IoU with ground truth in the interval of
OverFeat system adopts an image pyramid, we also                                          [0, 0.5), instead of [0.1, 0.5) used in [1], [2]. We note
evaluate using convolutional features extracted from                                      that in the SPPnet system [1], the negative samples
5 scales. We use those 5 scales as in [1], [2].                                           in [0.1, 0.5) are used for network fine-tuning, but the
   Table 10 compares the two-stage system and two                                         negative samples in [0, 0.5) are still visited in the SVM
variants of the one-stage system. Using the ZF model,                                     step with hard-negative mining. But the Fast R-CNN
the one-stage system has an mAP of 53.9%. This is                                         system [2] abandons the SVM step, so the negative
lower than the two-stage system (58.7%) by 4.8%.                                          samples in [0, 0.1) are never visited. Including these
This experiment justifies the effectiveness of cascaded                                   [0, 0.1) samples improves mAP@0.5 on the COCO
region proposals and object detection. Similar obser-                                     dataset for both Fast R-CNN and Faster R-CNN sys-
vations are reported in [2], [39], where replacing SS                                     tems (but the impact is negligible on PASCAL VOC).
                                                                                                                      11


           Table 11: Object detection results (%) on the MS COCO dataset. The model is VGG-16.
                                                                 COCO val                   COCO test-dev
  method                        proposals training data      mAP@.5 mAP@[.5, .95]         mAP@.5 mAP@[.5, .95]
  Fast R-CNN [2]                   SS, 2000 COCO train           -             -             35.9          19.7
  Fast R-CNN [impl. in this paper] SS, 2000 COCO train          38.6          18.9           39.3          19.3
  Faster R-CNN                     RPN, 300 COCO train          41.5          21.2           42.1          21.5
  Faster R-CNN                     RPN, 300 COCO trainval        -             -             42.7          21.9



  The rest of the implementation details are the same       Table 12: Detection mAP (%) of Faster R-CNN on
as on PASCAL VOC. In particular, we keep using              PASCAL VOC 2007 test set and 2012 test set us-
300 proposals and single-scale (s = 600) testing. The       ing different training data. The model is VGG-16.
testing time is still about 200ms per image on the          “COCO” denotes that the COCO trainval set is used
COCO dataset.                                               for training. See also Table 6 and Table 7.
                                                                  training data           2007 test    2012 test
  In Table 11 we first report the results of the Fast             VOC07                      69.9         67.0
R-CNN system [2] using the implementation in this                 VOC07+12                   73.2          -
paper. Our Fast R-CNN baseline has 39.3% mAP@0.5                  VOC07++12                   -           70.4
on the test-dev set, higher than that reported in [2].            COCO (no VOC)              76.1         73.0
We conjecture that the reason for this gap is mainly              COCO+VOC07+12              78.8          -
due to the definition of the negative samples and also            COCO+VOC07++12              -           75.9
the changes of the mini-batch sizes. We also note that
the mAP@[.5, .95] is just comparable.
                                                            4.3 From MS COCO to PASCAL VOC
  Next we evaluate our Faster R-CNN system. Using
the COCO training set to train, Faster R-CNN has            Large-scale data is of crucial importance for improv-
42.1% mAP@0.5 and 21.5% mAP@[.5, .95] on the                ing deep neural networks. Next, we investigate how
COCO test-dev set. This is 2.8% higher for mAP@0.5          the MS COCO dataset can help with the detection
and 2.2% higher for mAP@[.5, .95] than the Fast R-          performance on PASCAL VOC.
CNN counterpart under the same protocol (Table 11).            As a simple baseline, we directly evaluate the
This indicates that RPN performs excellent for im-          COCO detection model on the PASCAL VOC dataset,
proving the localization accuracy at higher IoU thresh-     without fine-tuning on any PASCAL VOC data. This
olds. Using the COCO trainval set to train, Faster R-       evaluation is possible because the categories on
CNN has 42.7% mAP@0.5 and 21.9% mAP@[.5, .95] on            COCO are a superset of those on PASCAL VOC. The
the COCO test-dev set. Figure 6 shows some results          categories that are exclusive on COCO are ignored in
on the MS COCO test-dev set.                                this experiment, and the softmax layer is performed
                                                            only on the 20 categories plus background. The mAP
Faster R-CNN in ILSVRC & COCO 2015 compe-                   under this setting is 76.1% on the PASCAL VOC 2007
titions We have demonstrated that Faster R-CNN              test set (Table 12). This result is better than that trained
benefits more from better features, thanks to the fact      on VOC07+12 (73.2%) by a good margin, even though
that the RPN completely learns to propose regions by        the PASCAL VOC data are not exploited.
neural networks. This observation is still valid even          Then we fine-tune the COCO detection model on
when one increases the depth substantially to over          the VOC dataset. In this experiment, the COCO model
100 layers [18]. Only by replacing VGG-16 with a 101-       is in place of the ImageNet-pre-trained model (that
layer residual net (ResNet-101) [18], the Faster R-CNN      is used to initialize the network weights), and the
system increases the mAP from 41.5%/21.2% (VGG-             Faster R-CNN system is fine-tuned as described in
16) to 48.4%/27.2% (ResNet-101) on the COCO val             Section 3.2. Doing so leads to 78.8% mAP on the
set. With other improvements orthogonal to Faster R-        PASCAL VOC 2007 test set. The extra data from
CNN, He et al. [18] obtained a single-model result of       the COCO set increases the mAP by 5.6%. Table 6
55.7%/34.9% and an ensemble result of 59.0%/37.4%           shows that the model trained on COCO+VOC has
on the COCO test-dev set, which won the 1st place           the best AP for every individual category on PASCAL
in the COCO 2015 object detection competition. The          VOC 2007. Similar improvements are observed on the
same system [18] also won the 1st place in the ILSVRC       PASCAL VOC 2012 test set (Table 12 and Table 7). We
2015 object detection competition, surpassing the sec-      note that the test-time speed of obtaining these strong
ond place by absolute 8.5%. RPN is also a building          results is still about 200ms per image.
block of the 1st-place winning entries in ILSVRC 2015
localization and COCO 2015 segmentation competi-            5    C ONCLUSION
tions, for which the details are available in [18] and      We have presented RPNs for efficient and accurate
[15] respectively.                                          region proposal generation. By sharing convolutional
                                                                                                                                                                                                                                                                                                                                    12

                                                                                                                                                               person : 0.918              cow : 0.995
                                                                                                                             bird : 0.902

                                                   person : 0.988
                                                                                                                                                                                                                                                                                  person : 0.992
                           car : 0.745                                 person : 0.797                       bird : 0.978
                  car : 0.955      horse : 0.991




                                                                                                                                                                 bird : 0.972                                                                                    cow : 0.998

                                                                                                         bird : 0.941




                                                                                                                                                                                                                                                                                                                 bottle : 0.726




                                                                                                                         person : 0.964       person : 0.988                                                                            person : 0.986
                   car : 0.999                                                                                                                                                                                                                person : 0 .993 : 0.959
                                                                                                                                                                                                                                                     person
                                                                                                                                   person : 0.976
                                                            person : 0.929                                    person : 0.994
                                                                                                   person : 0.991                                                                                        car : 0.997                                   car : 0.980




                                                                                                                                                                                                                                                                                         dog : 0.981




                                                                                                                                                                                                                           cow : 0.979              person : 0.998
                                           person : 0.961                                                                                                                                       cow : 0.974
                                   person : 0.958
                                                                                                                                                                                                                    cow : 0.979
                                                                                                                   bus : 0.999
                                                                  person : 0.960                                                                                                            cow : 0.892
                                                                                                                                                                                                          cow : 0.985
                                                                                                              person : 0.985                            person : 0.995
                                                                                                                                                                    person : 0.996
                                                         person : 0.757                            person : 0.994


                  dog : 0.697




                  cat : 0.998


                                                                                                                                     person : 0.917
                                                                                                                                                                                                 boat : 0.671
                                                                                                        car : 1.000                                                                                                     boat : 0.895                boat : 0.749

                                                                                                                                                                                                                           boat : 0.877


                                                                                                                                                                                                                                                 person : 0.988


                                                                                                                                                                                                                                                                                                       person : 0.995




                                                                                                                                                                                                                                                                                               bicycle :: 0.981
                                                                                                                                                                                                                                                                                 person : 0.994person     0.987
                                                                                                                                                                       person : 0.930     person : 0.940
                                                                                                                                                                                                     person : 0.893
                                                                                                                                                                                                                                                                                               bicycle : 0.972
                                                                                                                                                                                                                                                                                 bicycle : 0.977
                                             boat : 0.992
                                                                                                                                person : 0.962
                                                                                                                                                                                                                    dog : 0.987
                                                                                                 pottedplant : 0.951



                                                                                                                    bottle : 0.851
                                                                                                            bottle : 0.962


                                   boat : 0.693                                                   diningtable : 0.791
                  boat : 0.846
                                                                                                                                                                                                                                                                                                                   person : 0.948
                                                                                                                                                                                                                                                                                  person : 0.972    person : 0.919


                                                                                                                                                              pottedplant : 0.728
                          car : 1.000                  car : 0.880
                  car : 0.981
                                                                                 car : 0.982       chair : 0.630
                                                                                                                                                                                         boat : 0.995
                                                                                                                                                                                                                                                                  boat : 0.948
                                                                                                        diningtable : 0.862
                                                                                                                                                                     bottle : 0.826




                                                                                                                                                                                                                                                 boat : 0.692
                                                                                                                                                                                                                                            boat : 0.808

                                                                                                                                                                                                                                                                                          person : 0.975




                  aeroplane : 0.992                                                                                                                                                                      bird : 0.998
                                                             aeroplane : 0.986
                                                                                                                                  sheep : 0.970

                                                                                                                                                                                                                                                 bird : 0.980


                                                                                                                                                                                                                         bird : 0.806                                                   person : 0.670

                                                                                                                                                                                                                                                                                  horse : 0.984




                                                                                                  aeroplane : 0.998


                                                               pottedplant : 0.820

                                                                                                                                                                                          chair : 0.984
                                                                                                                                                                                                     diningtable : 0.997
                                 pottedplant : 0.993                                                                                                                                                                                           chair : 0.978
                                                                                                                                                                                                        chair : 0.962
                                                                                                                                                                                                                               chair : 0.976
                     pottedplant : 0.715                                                           car : 0.907
                                                                                                            person : 0.993                                           person : 0.987




                                 pottedplant : 0.940

                     pottedplant : 0.869




                         tvmonitor : 0.945
                                                                                                                                                                                              person : 0.983

                                                                                                     aeroplane : 0.978                                                                                                                                                                            bird : 0.997
                                           tvmonitor : 0.993


                                                                                 chair : 0.723
                                                                  person  : 0.968                                                                                                                          chair : 0.982         tvmonitor : 0.993             person : 0.959
                                                             bottle : 0.789


                                                                                                                                       person : 0.988
                  diningtable : 0.903                                                                                                                                                                                    bottle : 0.858
                                                                                                                                                                                         chair : 0.852             bottle : 0.616 bottle : 0.903
                                                                                                                                                                                                                                         person : 0.897
                                                                                                                                                                                           person : 0.870


                                                                                                                                                                                                                                                bottle : 0.884
                                                                                                                                                                                                                                                                                                                 bird : 0.727




Figure 5: Selected examples of object detection results on the PASCAL VOC 2007 test set using the Faster
R-CNN system. The model is VGG-16 and the training data is 07+12 trainval (73.2% mAP on the 2007 test
set). Our method detects objects of a wide range of scales and aspect ratios. Each output box is associated
with a category label and a softmax score in [0, 1]. A score threshold of 0.6 is used to display these images.
The running time for obtaining these results is 198ms per image, including all steps.


features with the down-stream detection network, the                                                                                                                                    R EFERENCES
region proposal step is nearly cost-free. Our method
enables a unified, deep-learning-based object detec-                                                                                                                                    [1]               K. He, X. Zhang, S. Ren, and J. Sun, “Spatial pyramid pooling
tion system to run at near real-time frame rates. The                                                                                                                                                     in deep convolutional networks for visual recognition,” in
learned RPN also improves region proposal quality                                                                                                                                                         European Conference on Computer Vision (ECCV), 2014.
                                                                                                                                                                                        [2]               R. Girshick, “Fast R-CNN,” in IEEE International Conference on
and thus the overall object detection accuracy.                                                                                                                                                           Computer Vision (ICCV), 2015.
                                                                                                                                                                                        [3]               K. Simonyan and A. Zisserman, “Very deep convolutional
                                                                                                                                                                                                                                                                                                                                                                                                                                                     13


                                                                                         person : 0.975958                                                                                                                                                                                             traffic light : 0.802
                                    person : 0.941
                                               person : 0.673               person : 0.928 person : 0.
                                                                                                   person : 0.823
                                                                                                                                        airplane : 0.997
                                                                                         person : 0.759
                                           person : 0.766               backpack : 0.756                                                                                                                                                       person : 0.772
                                                               person : 0.939
                                                     person : 0.976                                                                                                                                                                                person  : 0.842                                     person : 0.841
                                                                                                                                                                                  person : 0.867                                                   umbrella : 0.824     person : 0.897                                    car : 0.957
                                                                                     person : 0.950
                           handbag : 0.848                                               person : 0.805                                                                                                                                                                                                                                                            clock : 0.986
                                                                                                                                                                                                                                                          person : 0.950    person : 0.931
                                                                                                                                                                                                                                                                                                      person : 0.970                                                                 clock : 0.981
                                                                                                                                                                                                                                                        person : 0.916

                                                                                                                                                                                                                                               motorcycle : 0.713
                                                                  dog : 0.996
                                                                                                                                                                                                                                                                       bicycle : 0.891
                                                   dog : 0.691                                                                                                                                                                                     bicycle : 0.639
                              person : 0.996
                                                                                                                                                                                                                                                                                                                                                                                                                       person : 0.800
                                                                                                                                                                                                                                                                                                    motorcycle : 0.827




                                                                                                                                                                                                                                                                                                                                                                                                      person : 0.808
                                                                                                                                                                                                                                                                                                                                                                 pizza : 0.985
                                                                          person : 0.998
                                                                                                                                                                                                                                                                                                                                                       dining table : 0.956
                                                                                                                                                                                                                                                                                                                                                       pizza : 0.938
                                                                                                                                                                                                                                                              bed : 0.999
                                                                                                                                                                                                                                                                                                                                                           pizza : 0.995
                                                                                                                                                                                                                                                                                                                                                                                       pizza : 0.982
                                                                                                                                                                                  clock : 0.982




                                                                     skis : 0.919                                                                                                                                                                                                                                                                                                               bottle : 0.627




                                                                                                                                                      bowl : 0.759



                                                        giraffe : 0.989      giraffe : 0.993
                                    giraffe : 0.988                                                                                                                                                                                                                           person : 0.999



                                                                                                                                                                   broccoli : 0.953
                                                                                                                                                                                                                                                                                                                                                                                                   boat : 0.992
                                                                                                                                                                                                                                                                      person : 0.934



                                                                                                                                                                                                                                                                                             surfboard : 0.979                                                                                         umbrella : 0.885
                                                                                                                                                                                                                                                                                                                                                                 person : 0.691 person : 0.716
                                                                                                                                                                                                                                                                                                                                                       person : 0.940
                                                                                                                                                                                                                                                                                                                                                                 person : 0.854       person : 0.927
                                                                                                                                                                                                                                                                                                                                                                                                  person : 0.665
                                                                                                                                                                                                                                                                                                                                                                                                       person : 0.692          person : 0.618
                                                                                                                                                                                                                                                                                                                                                        person : 0.825person : 0.813             person : 0.864




                                                                                                                                                    teddy bear : 0.999

                                     bus : 0.999


                                                                                                                                       teddy bear : 0.738                                                       teddy bear : 0.802
                                                                                                                                                                                                                                                                                            potted plant : 0.769
                                                                                                                                                                                             teddy bear : 0.890


                                                                                                      person
                                                                                                    person : : 0.869
                                                                                                            0.970

                                                                                                                                                                                                                                                                                               bowl : 0.602                                                                                                  sink : 0.976 sink : 0.938
                                                                                                                                                                                                                                                                                                                                                                                         sink : 0.994
                                                                                                                                                                                                                                                                                                       toilet : 0.921                                                sink : 0.992
                                                                                                                                                                                                                                                                         sink : 0.969




                                                                                                                                                                                                                        book : 0.611


                                                                                                                                                                                          tv : 0.964

                                                                                                                                                                         bottle : 0.768                                                                                                                                                                                                                                      traffic light : 0.713
                                                                                                                                                                                                       laptop : 0.986                                                                                                                                                   traffic light : 0.869
                                                                                                      couch : 0.627
                                                                                                                                                                                                                                                                                                                                                                              train : 0.965
                                   couch : 0.991                                                                                                                                          mouse : 0.871                                                                                          boat : 0.613           boat : 0.746
                                                                                           couch : 0.719                                 tv : 0.959                                                                                                                  boat : 0.758
                                                                                                                                                                     keyboard : 0.956
                                                              dining table : 0.637
                                                                                                                                                                                             mouse : 0.677
                           chair : 0.631                                                                                                                                                                                                                                                 bench : 0.971

                                                                                                                                        chair : 0.644                                                                                                                                        person : 0.986

                                                                                                                                                                                                  cup : 0.720




                                                                                                                                                                                      frisbee : 0.998
                                                                                                                                                                                                                                                                                                                                                                        person : 0.723

                                                                                                                                                                                                                                                                                                                                                                              cup : 0.931
                                                                                                                                                                                                                                                                                                                                                       dining table : 0.941                       cup : 0.986
                                                                                                                                                                                                                                                                                bird : 0.968
                                                                                                                                                                                                   dog : 0.966
                                                                                                                                                                                                                                                                                                                                                                                                                                   bowl : 0.958
                                                                                            zebra : 0.996
                                                                            zebra : 0.970
                                                                                       zebra : 0.848
                                        zebra : 0.993                                                                                                                                                                                                                                                                                                                         sandwich : 0.629



                                                                                                                                                                                                                                                    bird : 0.987


                                                                                                                                                                                                                                                             bird : 0.894




                                                                                                                                                   person :tv
                                                                                                                                                            0 :.792
                                                                                                                                                                 0.711                                          person : 0.917
                                                                     refrigerator : 0.699
                                               person : 0.993

                                                                                                           bottle : 0.982
                                                                                                                                                           laptop : 0.973
                                                                                                                                                                                                                                                                                     tennis :racket
                                                                                                                                                                                                                                                                                    person    0.999 : 0.960                                                                   horse : 0.990
                                                                                                                                                                                                                                                                                                                                                                                                              bird : 0.746
                                                                                oven : 0.655                                                                                                                                                                                                                                                                                                  bird : 0.956

                                                                                                                                                              keyboard : 0.638
                                                                                                                                                                                                                                                                                                                                                             bird : 0.906
                                                                                                                                         keyboard : 0.615
                                                                                                                                                                                                            mouse : 0.981




                                                                                                              dining table : 0.888   cup : 0.990                                                       car : 0.816                                                                                                                                                                                                   toothbrush : 0.668
                                                                                                                                                                                                              person : 0.984


                                           refrigerator : 0.631                                                    pizza : 0.919
                                                                                                                                                                                                                                                                                                                                        kite : 0.934

                                                                                                                                                                                                                                                                                         clock : 0.988
                     bowl : 0.744
                                                                      bowl : 0.816
                                                      bowl : 0.710                                                                                                                                                                                                                                                                                                person : 0.998
                             bowl : 0.847

                                                                          cup : 0.807

                                                                                                                                                                                                                               pizza : 0.965



                                                                           chair : 0.772
                    oven : 0.969
                                              dining table : 0.618




Figure 6: Selected examples of object detection results on the MS COCO test-dev set using the Faster R-CNN
system. The model is VGG-16 and the training data is COCO trainval (42.7% mAP@0.5 on the test-dev set).
Each output box is associated with a category label and a softmax score in [0, 1]. A score threshold of 0.6 is
used to display these images. For each image, one color represents one object category in that image.


      networks for large-scale image recognition,” in International                                                                                                                                                                            [7]  J. Long, E. Shelhamer, and T. Darrell, “Fully convolutional
      Conference on Learning Representations (ICLR), 2015.                                                                                                                                                                                          networks for semantic segmentation,” in IEEE Conference on
[4]   J. R. Uijlings, K. E. van de Sande, T. Gevers, and A. W. Smeul-                                                                                                                                                                               Computer Vision and Pattern Recognition (CVPR), 2015.
      ders, “Selective search for object recognition,” International                                                                                                                                                                           [8] P. F. Felzenszwalb, R. B. Girshick, D. McAllester, and D. Ra-
      Journal of Computer Vision (IJCV), 2013.                                                                                                                                                                                                      manan, “Object detection with discriminatively trained part-
[5]   R. Girshick, J. Donahue, T. Darrell, and J. Malik, “Rich feature                                                                                                                                                                              based models,” IEEE Transactions on Pattern Analysis and Ma-
      hierarchies for accurate object detection and semantic seg-                                                                                                                                                                                   chine Intelligence (TPAMI), 2010.
      mentation,” in IEEE Conference on Computer Vision and Pattern                                                                                                                                                                            [9] P. Sermanet, D. Eigen, X. Zhang, M. Mathieu, R. Fergus,
      Recognition (CVPR), 2014.                                                                                                                                                                                                                     and Y. LeCun, “Overfeat: Integrated recognition, localization
[6]   C. L. Zitnick and P. Dollár, “Edge boxes: Locating object                                                                                                                                                                                    and detection using convolutional networks,” in International
      proposals from edges,” in European Conference on Computer                                                                                                                                                                                     Conference on Learning Representations (ICLR), 2014.
      Vision (ECCV), 2014.                                                                                                                                                                                                                     [10] S. Ren, K. He, R. Girshick, and J. Sun, “Faster R-CNN: Towards
                                                                                                                                              14



     real-time object detection with region proposal networks,” in          [36] O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma,
     Neural Information Processing Systems (NIPS), 2015.                         Z. Huang, A. Karpathy, A. Khosla, M. Bernstein, A. C. Berg,
[11] M. Everingham, L. Van Gool, C. K. I. Williams, J. Winn, and                 and L. Fei-Fei, “ImageNet Large Scale Visual Recognition
     A. Zisserman, “The PASCAL Visual Object Classes Challenge                   Challenge,” in International Journal of Computer Vision (IJCV),
     2007 (VOC2007) Results,” 2007.                                              2015.
[12] T.-Y. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ra-           [37] A. Krizhevsky, I. Sutskever, and G. Hinton, “Imagenet classi-
     manan, P. Dollár, and C. L. Zitnick, “Microsoft COCO: Com-                 fication with deep convolutional neural networks,” in Neural
     mon Objects in Context,” in European Conference on Computer                 Information Processing Systems (NIPS), 2012.
     Vision (ECCV), 2014.                                                   [38] Y. Jia, E. Shelhamer, J. Donahue, S. Karayev, J. Long, R. Gir-
[13] S. Song and J. Xiao, “Deep sliding shapes for amodal 3d object              shick, S. Guadarrama, and T. Darrell, “Caffe: Convolutional
     detection in rgb-d images,” arXiv:1511.02300, 2015.                         architecture for fast feature embedding,” arXiv:1408.5093, 2014.
[14] J. Zhu, X. Chen, and A. L. Yuille, “DeePM: A deep part-based           [39] K. Lenc and A. Vedaldi, “R-CNN minus R,” in British Machine
     model for object detection and semantic part localization,”                 Vision Conference (BMVC), 2015.
     arXiv:1511.07131, 2015.
[15] J. Dai, K. He, and J. Sun, “Instance-aware semantic segmenta-
     tion via multi-task network cascades,” arXiv:1512.04412, 2015.
[16] J. Johnson, A. Karpathy, and L. Fei-Fei, “Densecap: Fully
     convolutional localization networks for dense captioning,”
     arXiv:1511.07571, 2015.
[17] D. Kislyuk, Y. Liu, D. Liu, E. Tzeng, and Y. Jing, “Human cu-
     ration and convnets: Powering item-to-item recommendations
     on pinterest,” arXiv:1511.04003, 2015.
[18] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning
     for image recognition,” arXiv:1512.03385, 2015.
[19] J. Hosang, R. Benenson, and B. Schiele, “How good are de-
     tection proposals, really?” in British Machine Vision Conference
     (BMVC), 2014.
[20] J. Hosang, R. Benenson, P. Dollár, and B. Schiele, “What makes
     for effective detection proposals?” IEEE Transactions on Pattern
     Analysis and Machine Intelligence (TPAMI), 2015.
[21] N. Chavali, H. Agrawal, A. Mahendru, and D. Batra,
     “Object-Proposal Evaluation Protocol is ’Gameable’,” arXiv:
     1505.05836, 2015.
[22] J. Carreira and C. Sminchisescu, “CPMC: Automatic ob-
     ject segmentation using constrained parametric min-cuts,”
     IEEE Transactions on Pattern Analysis and Machine Intelligence
     (TPAMI), 2012.
[23] P. Arbeláez, J. Pont-Tuset, J. T. Barron, F. Marques, and J. Malik,
     “Multiscale combinatorial grouping,” in IEEE Conference on
     Computer Vision and Pattern Recognition (CVPR), 2014.
[24] B. Alexe, T. Deselaers, and V. Ferrari, “Measuring the object-
     ness of image windows,” IEEE Transactions on Pattern Analysis
     and Machine Intelligence (TPAMI), 2012.
[25] C. Szegedy, A. Toshev, and D. Erhan, “Deep neural networks
     for object detection,” in Neural Information Processing Systems
     (NIPS), 2013.
[26] D. Erhan, C. Szegedy, A. Toshev, and D. Anguelov, “Scalable
     object detection using deep neural networks,” in IEEE Confer-
     ence on Computer Vision and Pattern Recognition (CVPR), 2014.
[27] C. Szegedy, S. Reed, D. Erhan, and D. Anguelov, “Scalable,
     high-quality object detection,” arXiv:1412.1441 (v1), 2015.
[28] P. O. Pinheiro, R. Collobert, and P. Dollar, “Learning to
     segment object candidates,” in Neural Information Processing
     Systems (NIPS), 2015.
[29] J. Dai, K. He, and J. Sun, “Convolutional feature masking
     for joint object and stuff segmentation,” in IEEE Conference on
     Computer Vision and Pattern Recognition (CVPR), 2015.
[30] S. Ren, K. He, R. Girshick, X. Zhang, and J. Sun, “Ob-
     ject detection networks on convolutional feature maps,”
     arXiv:1504.06066, 2015.
[31] J. K. Chorowski, D. Bahdanau, D. Serdyuk, K. Cho, and
     Y. Bengio, “Attention-based models for speech recognition,”
     in Neural Information Processing Systems (NIPS), 2015.
[32] M. D. Zeiler and R. Fergus, “Visualizing and understanding
     convolutional neural networks,” in European Conference on
     Computer Vision (ECCV), 2014.
[33] V. Nair and G. E. Hinton, “Rectified linear units improve
     restricted boltzmann machines,” in International Conference on
     Machine Learning (ICML), 2010.
[34] C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov,
     D. Erhan, and A. Rabinovich, “Going deeper with convo-
     lutions,” in IEEE Conference on Computer Vision and Pattern
     Recognition (CVPR), 2015.
[35] Y. LeCun, B. Boser, J. S. Denker, D. Henderson, R. E. Howard,
     W. Hubbard, and L. D. Jackel, “Backpropagation applied to
     handwritten zip code recognition,” Neural computation, 1989.