来源材料

资料来源

← 首页

                                             U-Net: Convolutional Networks for Biomedical
                                                        Image Segmentation

                                                      Olaf Ronneberger, Philipp Fischer, and Thomas Brox

                                          Computer Science Department and BIOSS Centre for Biological Signalling Studies,
                                                                 University of Freiburg, Germany




arXiv:1505.04597v1 [cs.CV] 18 May 2015
                                                            ronneber@informatik.uni-freiburg.de,
                                                 WWW home page: http://lmb.informatik.uni-freiburg.de/



                                               Abstract. There is large consent that successful training of deep net-
                                               works requires many thousand annotated training samples. In this pa-
                                               per, we present a network and training strategy that relies on the strong
                                               use of data augmentation to use the available annotated samples more
                                               efficiently. The architecture consists of a contracting path to capture
                                               context and a symmetric expanding path that enables precise localiza-
                                               tion. We show that such a network can be trained end-to-end from very
                                               few images and outperforms the prior best method (a sliding-window
                                               convolutional network) on the ISBI challenge for segmentation of neu-
                                               ronal structures in electron microscopic stacks. Using the same net-
                                               work trained on transmitted light microscopy images (phase contrast
                                               and DIC) we won the ISBI cell tracking challenge 2015 in these cate-
                                               gories by a large margin. Moreover, the network is fast. Segmentation
                                               of a 512x512 image takes less than a second on a recent GPU. The full
                                               implementation (based on Caffe) and the trained networks are available
                                               at http://lmb.informatik.uni-freiburg.de/people/ronneber/u-net.


                                         1    Introduction

                                         In the last two years, deep convolutional networks have outperformed the state of
                                         the art in many visual recognition tasks, e.g. [7,3]. While convolutional networks
                                         have already existed for a long time [8], their success was limited due to the
                                         size of the available training sets and the size of the considered networks. The
                                         breakthrough by Krizhevsky et al. [7] was due to supervised training of a large
                                         network with 8 layers and millions of parameters on the ImageNet dataset with
                                         1 million training images. Since then, even larger and deeper networks have been
                                         trained [12].
                                             The typical use of convolutional networks is on classification tasks, where
                                         the output to an image is a single class label. However, in many visual tasks,
                                         especially in biomedical image processing, the desired output should include
                                         localization, i.e., a class label is supposed to be assigned to each pixel. More-
                                         over, thousands of training images are usually beyond reach in biomedical tasks.
                                         Hence, Ciresan et al. [1] trained a network in a sliding-window setup to predict
                                         the class label of each pixel by providing a local region (patch) around that pixel
2

                1 64 64

                                                                                                                                                                               128 64 64 2



     input
                                                                                                                                                                                                                            output
    image
                                                                                                                                                                                                                            segmentation
       tile

                                                                                                                                                                            392 x 392   390 x 390   388 x 388   388 x 388
                                                                                                                                                                                                                            map


         572 x 572   570 x 570   568 x 568


                                              128 128
                                                                                                                                                                256 128




                                                                                                                                                             200²    198²    196²
                                 284²        282²   280²

                                                             256 256                                                                            512     256


                                                                                                                                         104²
                                                                                                                                                                                                                conv 3x3, ReLU
                                                    140²   138²   136²                                                                                102²    100²                                              copy and crop
                                                                               512             512                        1024         512
                                                                  68²                                             56²                                                                                           max pool 2x2

                                                                                     32² 64²
                                                                         66²                                                     54²         52²
                                                                                                           1024                                                                                                 up-conv 2x2
                                                                                                     30²                28²
                                                                                                                                                                                                                conv 1x1


Fig. 1. U-net architecture (example for 32x32 pixels in the lowest resolution). Each blue
box corresponds to a multi-channel feature map. The number of channels is denoted
on top of the box. The x-y-size is provided at the lower left edge of the box. White
boxes represent copied feature maps. The arrows denote the different operations.




as input. First, this network can localize. Secondly, the training data in terms
of patches is much larger than the number of training images. The resulting
network won the EM segmentation challenge at ISBI 2012 by a large margin.
    Obviously, the strategy in Ciresan et al. [1] has two drawbacks. First, it
is quite slow because the network must be run separately for each patch, and
there is a lot of redundancy due to overlapping patches. Secondly, there is a
trade-off between localization accuracy and the use of context. Larger patches
require more max-pooling layers that reduce the localization accuracy, while
small patches allow the network to see only little context. More recent approaches
[11,4] proposed a classifier output that takes into account the features from
multiple layers. Good localization and the use of context are possible at the
same time.
    In this paper, we build upon a more elegant architecture, the so-called “fully
convolutional network” [9]. We modify and extend this architecture such that it
works with very few training images and yields more precise segmentations; see
Figure 1. The main idea in [9] is to supplement a usual contracting network by
successive layers, where pooling operators are replaced by upsampling operators.
Hence, these layers increase the resolution of the output. In order to localize, high
resolution features from the contracting path are combined with the upsampled
                                                                                     3




Fig. 2. Overlap-tile strategy for seamless segmentation of arbitrary large images (here
segmentation of neuronal structures in EM stacks). Prediction of the segmentation in
the yellow area, requires image data within the blue area as input. Missing input data
is extrapolated by mirroring



output. A successive convolution layer can then learn to assemble a more precise
output based on this information.
    One important modification in our architecture is that in the upsampling
part we have also a large number of feature channels, which allow the network
to propagate context information to higher resolution layers. As a consequence,
the expansive path is more or less symmetric to the contracting path, and yields
a u-shaped architecture. The network does not have any fully connected layers
and only uses the valid part of each convolution, i.e., the segmentation map only
contains the pixels, for which the full context is available in the input image.
This strategy allows the seamless segmentation of arbitrarily large images by an
overlap-tile strategy (see Figure 2). To predict the pixels in the border region
of the image, the missing context is extrapolated by mirroring the input image.
This tiling strategy is important to apply the network to large images, since
otherwise the resolution would be limited by the GPU memory.
    As for our tasks there is very little training data available, we use excessive
data augmentation by applying elastic deformations to the available training im-
ages. This allows the network to learn invariance to such deformations, without
the need to see these transformations in the annotated image corpus. This is
particularly important in biomedical segmentation, since deformation used to
be the most common variation in tissue and realistic deformations can be simu-
lated efficiently. The value of data augmentation for learning invariance has been
shown in Dosovitskiy et al. [2] in the scope of unsupervised feature learning.
    Another challenge in many cell segmentation tasks is the separation of touch-
ing objects of the same class; see Figure 3. To this end, we propose the use of
a weighted loss, where the separating background labels between touching cells
obtain a large weight in the loss function.
    The resulting network is applicable to various biomedical segmentation prob-
lems. In this paper, we show results on the segmentation of neuronal structures
in EM stacks (an ongoing competition started at ISBI 2012), where we out-
4

performed the network of Ciresan et al. [1]. Furthermore, we show results for
cell segmentation in light microscopy images from the ISBI cell tracking chal-
lenge 2015. Here we won with a large margin on the two most challenging 2D
transmitted light datasets.

2   Network Architecture
The network architecture is illustrated in Figure 1. It consists of a contracting
path (left side) and an expansive path (right side). The contracting path follows
the typical architecture of a convolutional network. It consists of the repeated
application of two 3x3 convolutions (unpadded convolutions), each followed by
a rectified linear unit (ReLU) and a 2x2 max pooling operation with stride 2
for downsampling. At each downsampling step we double the number of feature
channels. Every step in the expansive path consists of an upsampling of the
feature map followed by a 2x2 convolution (“up-convolution”) that halves the
number of feature channels, a concatenation with the correspondingly cropped
feature map from the contracting path, and two 3x3 convolutions, each fol-
lowed by a ReLU. The cropping is necessary due to the loss of border pixels in
every convolution. At the final layer a 1x1 convolution is used to map each 64-
component feature vector to the desired number of classes. In total the network
has 23 convolutional layers.
    To allow a seamless tiling of the output segmentation map (see Figure 2), it
is important to select the input tile size such that all 2x2 max-pooling operations
are applied to a layer with an even x- and y-size.

3   Training
The input images and their corresponding segmentation maps are used to train
the network with the stochastic gradient descent implementation of Caffe [6].
Due to the unpadded convolutions, the output image is smaller than the input
by a constant border width. To minimize the overhead and make maximum use
of the GPU memory, we favor large input tiles over a large batch size and hence
reduce the batch to a single image. Accordingly we use a high momentum (0.99)
such that a large number of the previously seen training samples determine the
update in the current optimization step.
    The energy function is computed by a pixel-wise soft-max over the final
feature map combined with thecross entropy loss function. The soft-max is
                                  PK
defined as pk (x) = exp(ak (x))/     k0 =1 exp(ak (x)) where ak (x) denotes the
                                                 0


activation in feature channel k at the pixel position x ∈ Ω with Ω ⊂ Z2 . K
is the number of classes and pk (x) is the approximated maximum-function. I.e.
pk (x) ≈ 1 for the k that has the maximum activation ak (x) and pk (x) ≈ 0 for
all other k. The cross entropy then penalizes at each position the deviation of
p`(x) (x) from 1 using
                                X
                           E=       w(x) log(p`(x) (x))                     (1)
                                x∈Ω
                                                                                     5

 a                    b                   c                    d




Fig. 3. HeLa cells on glass recorded with DIC (differential interference contrast) mi-
croscopy. (a) raw image. (b) overlay with ground truth segmentation. Different colors
indicate different instances of the HeLa cells. (c) generated segmentation mask (white:
foreground, black: background). (d) map with a pixel-wise loss weight to force the
network to learn the border pixels.


where ` : Ω → {1, . . . , K} is the true label of each pixel and w : Ω → R is
a weight map that we introduced to give some pixels more importance in the
training.
    We pre-compute the weight map for each ground truth segmentation to com-
pensate the different frequency of pixels from a certain class in the training
data set, and to force the network to learn the small separation borders that we
introduce between touching cells (See Figure 3c and d).
    The separation border is computed using morphological operations. The
weight map is then computed as
                                                                 !
                                              (d1 (x) + d2 (x))2
                 w(x) = wc (x) + w0 · exp −                                   (2)
                                                     2σ 2

where wc : Ω → R is the weight map to balance the class frequencies, d1 : Ω → R
denotes the distance to the border of the nearest cell and d2 : Ω → R the distance
to the border of the second nearest cell. In our experiments we set w0 = 10 and
σ ≈ 5 pixels.
    In deep networks with many convolutional layers and different paths through
the network, a good initialization of the weights is extremely important. Oth-
erwise, parts of the network might give excessive activations, while other parts
never contribute. Ideally the initial weights should be adapted such that each
feature map in the network has approximately unit variance. For a network with
our architecture (alternating convolution and ReLU layers) this can be achieved
by drawing p  the initial weights from a Gaussian distribution with a standard
deviation of 2/N , where N denotes the number of incoming nodes of one neu-
ron [5]. E.g. for a 3x3 convolution and 64 feature channels in the previous layer
N = 9 · 64 = 576.


3.1   Data Augmentation

Data augmentation is essential to teach the network the desired invariance and
robustness properties, when only few training samples are available. In case of
6

microscopical images we primarily need shift and rotation invariance as well as
robustness to deformations and gray value variations. Especially random elas-
tic deformations of the training samples seem to be the key concept to train
a segmentation network with very few annotated images. We generate smooth
deformations using random displacement vectors on a coarse 3 by 3 grid. The
displacements are sampled from a Gaussian distribution with 10 pixels standard
deviation. Per-pixel displacements are then computed using bicubic interpola-
tion. Drop-out layers at the end of the contracting path perform further implicit
data augmentation.


4   Experiments
We demonstrate the application of the u-net to three different segmentation
tasks. The first task is the segmentation of neuronal structures in electron mi-
croscopic recordings. An example of the data set and our obtained segmentation
is displayed in Figure 2. We provide the full result as Supplementary Material.
The data set is provided by the EM segmentation challenge [14] that was started
at ISBI 2012 and is still open for new contributions. The training data is a set of
30 images (512x512 pixels) from serial section transmission electron microscopy
of the Drosophila first instar larva ventral nerve cord (VNC). Each image comes
with a corresponding fully annotated ground truth segmentation map for cells
(white) and membranes (black). The test set is publicly available, but its seg-
mentation maps are kept secret. An evaluation can be obtained by sending the
predicted membrane probability map to the organizers. The evaluation is done
by thresholding the map at 10 different levels and computation of the “warping
error”, the “Rand error” and the “pixel error” [14].
    The u-net (averaged over 7 rotated versions of the input data) achieves with-
out any further pre- or postprocessing a warping error of 0.0003529 (the new
best score, see Table 1) and a rand-error of 0.0382.
    This is significantly better than the sliding-window convolutional network
result by Ciresan et al. [1], whose best submission had a warping error of 0.000420
and a rand error of 0.0504. In terms of rand error the only better performing


Table 1. Ranking on the EM segmentation challenge [14] (march 6th, 2015), sorted
by warping error.

      Rank     Group name           Warping Error   Rand Error   Pixel Error
               ** human values **        0.000005       0.0021       0.0010
         1.    u-net                    0.000353        0.0382       0.0611
         2.    DIVE-SCI                  0.000355       0.0305       0.0584
         3.    IDSIA [1]                 0.000420       0.0504       0.0613
         4.    DIVE                      0.000430       0.0545      0.0582
          ..
           .
        10.    IDSIA-SCI                 0.000653      0.0189        0.1027
                                                                                    7

a                      b                     c                   d




Fig. 4. Result on the ISBI cell tracking challenge. (a) part of an input image of the
“PhC-U373” data set. (b) Segmentation result (cyan mask) with manual ground truth
(yellow border) (c) input image of the “DIC-HeLa” data set. (d) Segmentation result
(random colored masks) with manual ground truth (yellow border).

    Table 2. Segmentation results (IOU) on the ISBI cell tracking challenge 2015.

                     Name               PhC-U373     DIC-HeLa
                     IMCB-SG (2014)     0.2669       0.2935
                     KTH-SE (2014)      0.7953       0.4607
                     HOUS-US (2014)     0.5323       -
                     second-best 2015   0.83         0.46
                     u-net (2015)       0.9203       0.7756

algorithms on this data set use highly data set specific post-processing methods1
applied to the probability map of Ciresan et al. [1].
    We also applied the u-net to a cell segmentation task in light microscopic im-
ages. This segmenation task is part of the ISBI cell tracking challenge 2014 and
2015 [10,13]. The first data set “PhC-U373”2 contains Glioblastoma-astrocytoma
U373 cells on a polyacrylimide substrate recorded by phase contrast microscopy
(see Figure 4a,b and Supp. Material). It contains 35 partially annotated train-
ing images. Here we achieve an average IOU (“intersection over union”) of 92%,
which is significantly better than the second best algorithm with 83% (see Ta-
ble 2). The second data set “DIC-HeLa”3 are HeLa cells on a flat glass recorded
by differential interference contrast (DIC) microscopy (see Figure 3, Figure 4c,d
and Supp. Material). It contains 20 partially annotated training images. Here we
achieve an average IOU of 77.5% which is significantly better than the second
best algorithm with 46%.


5    Conclusion
The u-net architecture achieves very good performance on very different biomed-
ical segmentation applications. Thanks to data augmentation with elastic defor-
1
  The authors of this algorithm have submitted 78 different solutions to achieve this
  result.
2
  Data set provided by Dr. Sanjay Kumar. Department of Bioengineering University
  of California at Berkeley. Berkeley CA (USA)
3
  Data set provided by Dr. Gert van Cappellen Erasmus Medical Center. Rotterdam.
  The Netherlands
8

mations, it only needs very few annotated images and has a very reasonable
training time of only 10 hours on a NVidia Titan GPU (6 GB). We provide the
full Caffe[6]-based implementation and the trained networks4 . We are sure that
the u-net architecture can be applied easily to many more tasks.


Acknowlegements
This study was supported by the Excellence Initiative of the German Federal
and State governments (EXC 294) and by the BMBF (Fkz 0316185B).


References
 1. Ciresan, D.C., Gambardella, L.M., Giusti, A., Schmidhuber, J.: Deep neural net-
    works segment neuronal membranes in electron microscopy images. In: NIPS. pp.
    2852–2860 (2012)
 2. Dosovitskiy, A., Springenberg, J.T., Riedmiller, M., Brox, T.: Discriminative un-
    supervised feature learning with convolutional neural networks. In: NIPS (2014)
 3. Girshick, R., Donahue, J., Darrell, T., Malik, J.: Rich feature hierarchies for ac-
    curate object detection and semantic segmentation. In: Proceedings of the IEEE
    Conference on Computer Vision and Pattern Recognition (CVPR) (2014)
 4. Hariharan, B., Arbelez, P., Girshick, R., Malik, J.: Hypercolumns for object seg-
    mentation and fine-grained localization (2014), arXiv:1411.5752 [cs.CV]
 5. He, K., Zhang, X., Ren, S., Sun, J.: Delving deep into rectifiers: Surpassing human-
    level performance on imagenet classification (2015), arXiv:1502.01852 [cs.CV]
 6. Jia, Y., Shelhamer, E., Donahue, J., Karayev, S., Long, J., Girshick, R., Guadar-
    rama, S., Darrell, T.: Caffe: Convolutional architecture for fast feature embedding
    (2014), arXiv:1408.5093 [cs.CV]
 7. Krizhevsky, A., Sutskever, I., Hinton, G.E.: Imagenet classification with deep con-
    volutional neural networks. In: NIPS. pp. 1106–1114 (2012)
 8. LeCun, Y., Boser, B., Denker, J.S., Henderson, D., Howard, R.E., Hubbard, W.,
    Jackel, L.D.: Backpropagation applied to handwritten zip code recognition. Neural
    Computation 1(4), 541–551 (1989)
 9. Long, J., Shelhamer, E., Darrell, T.: Fully convolutional networks for semantic
    segmentation (2014), arXiv:1411.4038 [cs.CV]
10. Maska, M., (...), de Solorzano, C.O.: A benchmark for comparison of cell tracking
    algorithms. Bioinformatics 30, 1609–1617 (2014)
11. Seyedhosseini, M., Sajjadi, M., Tasdizen, T.: Image segmentation with cascaded
    hierarchical models and logistic disjunctive normal networks. In: Computer Vision
    (ICCV), 2013 IEEE International Conference on. pp. 2168–2175 (2013)
12. Simonyan, K., Zisserman, A.: Very deep convolutional networks for large-scale
    image recognition (2014), arXiv:1409.1556 [cs.CV]
13. WWW: Web page of the cell tracking challenge, http://www.codesolorzano.com/
    celltrackingchallenge/Cell_Tracking_Challenge/Welcome.html
14. WWW: Web page of the em segmentation challenge, http://brainiac2.mit.edu/
    isbi_challenge/

4
    U-net implementation, trained networks and supplementary material available at
    http://lmb.informatik.uni-freiburg.de/people/ronneber/u-net