来源材料

资料来源

← 首页

REPORTS
                                                            larized nearly vertically. For completeness, Fig.    metamaterials would be highly desirable but is
                                                            1B shows the off-resonant case for the smaller       currently not available.
                                                            SRRs for vertical incident polarization.
                                                                                                                      References and Notes
                                                                Although these results are compatible with        1. J. B. Pendry, A. J. Holden, D. J. Robbins, W. J. Stewart,
                                                            the known selection rules of surface SHG from            IEEE Trans. Microw. Theory Tech. 47, 2075 (1999).
                                                            usual nonlinear optics (23), these selection rules    2. J. B. Pendry, Phys. Rev. Lett. 85, 3966 (2000).
                                                            do not explain the mechanism of SHG. Follow-          3. R. A. Shelby, D. R. Smith, S. Schultz, Science 292, 77 (2001).
                                                                                                                  4. T. J. Yen et al., Science 303, 1494 (2004).
                                                            ing our above argumentation on the magnetic           5. S. Linden et al., Science 306, 1351 (2004).
                                                            component of the Lorentz force, we numerically        6. C. Enkrich et al., Phys. Rev. Lett. 95, 203901 (2005).
                                                            calculate first the linear electric and magnet-       7. A. N. Grigorenko et al., Nature 438, 335 (2005).
                                                            ic field distributions (22); from these fields,       8. G. Dolling, M. Wegener, S. Linden, C. Hormann, Opt.
                                                                                                                     Express 14, 1842 (2006).
                                                            we compute the electron velocities and the
                                                                                                                  9. G. Dolling, C. Enkrich, M. Wegener, C. M. Soukoulis,
                                                            Lorentz-force field (fig. S1). In the spirit of a        S. Linden, Science 312, 892 (2006).
                                                            metamaterial, the transverse component of the        10. J. B. Pendry, D. Schurig, D. R. Smith, Science 312, 1780;
                                                            Lorentz-force field can be spatially averaged            published online 25 May 2006.
                                                                                                                 11. U. Leonhardt, Science 312, 1777 (2006); published
                                                            over the volume of the unit cell of size a by a
                                                                                                                     online 25 May 2006.
                                                            by t. This procedure delivers the driving force      12. M. W. Klein, C. Enkrich, M. Wegener, C. M. Soukoulis,
                                                            for the transverse SHG polarization. As usual,           S. Linden, Opt. Lett. 31, 1259 (2006).
                                                            the SHG intensity is proportional to the square      13. W. J. Padilla, A. J. Taylor, C. Highstrete, M. Lee, R. D. Averitt,
                                                            modulus of the nonlinear electron displacement.          Phys. Rev. Lett. 96, 107401 (2006).
                                                                                                                 14. D. R. Smith, S. Schultz, P. Markos, C. M. Soukoulis, Phys.
                                                            Thus, the SHG strength is expected to be                 Rev. B 65, 195104 (2002).
                                                            proportional to the square modulus of the            15. S. O’Brien, D. McPeake, S. A. Ramakrishna, J. B. Pendry,
                                                            driving force, and the SHG polarization is               Phys. Rev. B 69, 241101 (2004).
      Fig. 3. Theory, presented as the experiment (see      directed along the driving-force vector. Cor-        16. J. Zhou et al., Phys. Rev. Lett. 95, 223902 (2005).
      Fig. 1). The SHG source is the magnetic compo-                                                             17. A. K. Popov, V. M. Shalaev, available at http://arxiv.org/
                                                            responding results are summarized in Fig. 3 in           abs/physics/0601055 (2006).
      nent of the Lorentz force on metal electrons in       the same arrangement as Fig. 1 to allow for a        18. V. G. Veselago, Sov. Phys. Usp. 10, 509 (1968).
      the SRRs.                                             direct comparison between experiment and             19. M. Wegener, Extreme Nonlinear Optics (Springer, Berlin,
                                                            theory. The agreement is generally good, both            2004).
                                                                                                                 20. H. M. Barlow, Nature 173, 41 (1954).
          The setup for measuring the SHG is described      for linear optics and for SHG. In particular, we
                                                                                                                 21. S.-Y. Chen, M. Maksimchuk, D. Umstadter, Nature 396,
      in the supporting online material (22). We expect     find a much larger SHG signal for excitation of          653 (1998).
      that the SHG strongly depends on the resonance        those two resonances (Fig. 3, A and C), which        22. Materials and Methods are available as supporting
      that is excited. Obviously, the incident polariza-    are related to a finite magnetic-dipole moment           material on Science Online.
                                                            (perpendicular to the SRR plane) as compared         23. P. Guyot-Sionnest, W. Chen, Y. R. Shen, Phys. Rev. B 33,
      tion and the detuning of the laser wavelength
                                                                                                                     8254 (1986).
      from the resonance are of particular interest. One    with the purely electric Mie resonance (Figs.        24. We thank the groups of S. W. Koch, J. V. Moloney, and
      possibility for controlling the detuning is to        1D and 3D), despite the fact that its oscillator         C. M. Soukoulis for discussions. The research of
      change the laser wavelength for a given sample,       strength in the linear spectrum is comparable.           M.W. is supported by the Leibniz award 2000 of the
      which is difficult because of the extremely broad     The SHG polarization in the theory is strictly           Deutsche Forschungsgemeinschaft (DFG), that of S.L. through
                                                                                                                     a Helmholtz-Hochschul-Nachwuchsgruppe (VH-NG-232).
      tuning range required. Thus, we follow an             vertical for all resonances. Quantitative devia-
                                                                                                                 Supporting Online Material
      alternative route, lithographic tuning (in which      tions between the SHG signal strengths of ex-        www.sciencemag.org/cgi/content/full/313/5786/502/DC1
      the incident laser wavelength of 1.5 mm, as well      periment and theory, respectively, are probably      Materials and Methods
      as the detection system, remains fixed), and tune     due to the simplified SRR shape assumed in           Figs. S1 and S2
      the resonance positions by changing the SRR           our calculations and/or due to the simplicity of     References
      size. In this manner, we can also guarantee that      our modeling. A systematic microscopic theory        26 April 2006; accepted 22 June 2006
      the optical properties of the SRR constituent         of the nonlinear optical properties of metallic      10.1126/science.1129198
      materials are identical for all configurations. The
      blue bars in Fig. 1 summarize the measured SHG
      signals. For excitation of the LC resonance in Fig.
      1A (horizontal incident polarization), we find        Reducing the Dimensionality of
      an SHG signal that is 500 times above the noise
      level. As expected for SHG, this signal closely
      scales with the square of the incident power
                                                            Data with Neural Networks
      (Fig. 2A). The polarization of the SHG emission       G. E. Hinton* and R. R. Salakhutdinov
      is nearly vertical (Fig. 2B). The small angle with
      respect to the vertical is due to deviations from     High-dimensional data can be converted to low-dimensional codes by training a multilayer neural
      perfect mirror symmetry of the SRRs (see              network with a small central layer to reconstruct high-dimensional input vectors. Gradient descent
      electron micrographs in Fig. 1). Small detuning       can be used for fine-tuning the weights in such ‘‘autoencoder’’ networks, but this works well only if
      of the LC resonance toward smaller wavelength         the initial weights are close to a good solution. We describe an effective way of initializing the
      (i.e., to 1.3-mm wavelength) reduces the SHG          weights that allows deep autoencoder networks to learn low-dimensional codes that work much
      signal strength from 100% to 20%. For ex-             better than principal components analysis as a tool to reduce the dimensionality of data.
      citation of the Mie resonance with vertical
      incident polarization in Fig. 1D, we find a small             imensionality reduction facilitates the      finds the directions of greatest variance in the
      signal just above the noise level. For excitation
      of the Mie resonance with horizontal incident
      polarization in Fig. 1C, a small but significant
                                                            D       classification, visualization, communi-
                                                                    cation, and storage of high-dimensional
                                                            data. A simple and widely used method is
                                                                                                                 data set and represents each data point by its
                                                                                                                 coordinates along each of these directions. We
                                                                                                                 describe a nonlinear generalization of PCA that
      SHG emission is found, which is again po-             principal components analysis (PCA), which           uses an adaptive, multilayer Bencoder[ network


504                                              28 JULY 2006       VOL 313      SCIENCE        www.sciencemag.org
                                                                                                                                                                      REPORTS
to transform the high-dimensional data into a                Starting with random weights in the two                 called an Bautoencoder[ and is depicted in
low-dimensional code and a similar Bdecoder[                 networks, they can be trained together by               Fig. 1.
network to recover the data from the code.                   minimizing the discrepancy between the orig-                It is difficult to optimize the weights in
                                                             inal data and its reconstruction. The required          nonlinear autoencoders that have multiple
Department of Computer Science, University of Toronto, 6     gradients are easily obtained by using the chain        hidden layers (2–4). With large initial weights,
King’s College Road, Toronto, Ontario M5S 3G4, Canada.       rule to backpropagate error derivatives first           autoencoders typically find poor local minima;
*To whom correspondence should be addressed; E-mail:         through the decoder network and then through            with small initial weights, the gradients in the
hinton@cs.toronto.edu                                        the encoder network (1). The whole system is            early layers are tiny, making it infeasible to
                                                                                                                     train autoencoders with many hidden layers. If
                                                                                                                     the initial weights are close to a good solution,
                                                                        Decoder
                                                                                                                     gradient descent works well, but finding such
               30
                                                                                                                     initial weights requires a very different type of
                    W4                                                                                               algorithm that learns one layer of features at a
                                 Top
              500                                                                                                    time. We introduce this Bpretraining[ procedure
                                 RBM
                                                                    T                                 T
                                                                   W1                               W 1 +ε 8         for binary data, generalize it to real-valued data,
                                                             2000                             2000                   and show that it works well for a variety of
              500
                                                                    T
                                                                   W2
                                                                                                      T
                                                                                                    W 2 +ε 7         data sets.
                    W3                                       1000                             1000                       An ensemble of binary vectors (e.g., im-
              1000                                                  T
                                                                   W3
                                                                                                      T
                                                                                                    W 3 +ε 6         ages) can be modeled using a two-layer net-
                                 RBM
                                                             500                              500                    work called a Brestricted Boltzmann machine[
                                                                    T
                                                                   W4                                 T
                                                                                                    W 4 +ε 5
                                                                                                                     (RBM) (5, 6) in which stochastic, binary pixels
                                                              30    Code layer                 30
                                                                                                                     are connected to stochastic, binary feature
              1000                                                                                                   detectors using symmetrically weighted con-
                                                                   W4                               W 4 +ε 4
                    W2                                                                                               nections. The pixels correspond to Bvisible[
              2000                                           500                              500
                                 RBM                                                                                 units of the RBM because their states are
                                                                   W3                               W 3 +ε 3
                                                                                                                     observed; the feature detectors correspond to
                                                             1000                             1000
                                                                                                                     Bhidden[ units. A joint configuration (v, h) of
                                                                   W2                               W 2 +ε 2         the visible and hidden units has an energy (7)
              2000                                           2000                             2000                   given by
                    W1                                             W1                               W 1 +ε 1                         X              X
                                                                                                                     Eðv, hÞ 0 j           bi vi j          bj hj
                                                                                                                                    iZpixels             jZfeatures
                                                                                                                                     X                                  ð1Þ
                                                                                                                                j            vi hj wij
                                                                                                                                      i, j
                                 RBM         Encoder

          Pretraining                                      Unrolling                       Fine-tuning               where vi and hj are the binary states of pixel i
Fig. 1. Pretraining consists of learning a stack of restricted Boltzmann machines (RBMs), each                       and feature j, bi and bj are their biases, and wij
having only one layer of feature detectors. The learned feature activations of one RBM are used                      is the weight between them. The network as-
as the ‘‘data’’ for training the next RBM in the stack. After the pretraining, the RBMs are                          signs a probability to every possible image via
‘‘unrolled’’ to create a deep autoencoder, which is then fine-tuned using backpropagation of                         this energy function, as explained in (8). The
error derivatives.                                                                                                   probability of a training image can be raised by

Fig. 2. (A) Top to bottom:
Random samples of curves from
the test data set; reconstructions
produced by the six-dimensional
deep autoencoder; reconstruc-
tions by ‘‘logistic PCA’’ (8) using
six components; reconstructions
by logistic PCA and standard
PCA using 18 components. The
average squared error per im-
age for the last four rows is
1.44, 7.64, 2.45, 5.90. (B) Top
to bottom: A random test image
from each class; reconstructions
by the 30-dimensional autoen-
coder; reconstructions by 30-
dimensional logistic PCA and
standard PCA. The average
squared errors for the last three
rows are 3.00, 8.01, and 13.87.
(C) Top to bottom: Random
samples from the test data set;
reconstructions by the 30-
dimensional autoencoder; reconstructions by 30-dimensional PCA. The average squared errors are 126 and 135.


                                               www.sciencemag.org                SCIENCE     VOL 313           28 JULY 2006                                                   505
REPORTS
      adjusting the weights and biases to lower the          the hidden units are then updated once more so        same learning rule is used for the biases. The
      energy of that image and to raise the energy of        that they represent features of the confabula-        learning works well even though it is not
      similar, Bconfabulated[ images that the network        tion. The change in a weight is given by              exactly following the gradient of the log
      would prefer to the real data. Given a training                                                              probability of the training data (6).
      image, the binary state hj of each feature de-                                                                 A single layer of binary features is not the
                                                                   Dwij 0 e bvi hj Àdata j bvi hj Àrecon     ð2Þ
      tector
      P       j is set to 1 with probability s(bj þ                                                                best way to model the structure in a set of im-
         iviwij), where s(x) is the logistic function                                                              ages. After learning one layer of feature de-
      1/E1 þ exp (–x)^, bj is the bias of j, vi is the       where e is a learning rate, bvi hjÀdata is the        tectors, we can treat their activities—when they
      state of pixel i, and wij is the weight between i      fraction of times that the pixel i and feature        are being driven by the data—as data for
      and j. Once binary states have been chosen for         detector j are on together when the feature           learning a second layer of features. The first
      the hidden units, a Bconfabulation[ is produced        detectors are being driven by data, and               layer of feature detectors then become the
      P setting each vi to 1 with probability s(bi þ
      by                                                     bvi hjÀrecon is the corresponding fraction for        visible units for learning the next RBM. This
         jhjwij), where bi is the bias of i. The states of   confabulations. A simplified version of the           layer-by-layer learning can be repeated as many


      Fig. 3. (A) The two-
      dimensional codes for 500
      digits of each class produced
      by taking the first two prin-
      cipal components of all
      60,000 training images.
      (B) The two-dimensional
      codes found by a 784-
      1000-500-250-2 autoen-
      coder. For an alternative
      visualization, see (8).




      Fig. 4. (A) The fraction of
      retrieved documents in the
      same class as the query when
      a query document from the
      test set is used to retrieve other
      test set documents, averaged
      over all 402,207 possible que-
      ries. (B) The codes produced
      by two-dimensional LSA. (C)
      The codes produced by a 2000-
      500-250-125-2 autoencoder.




506                                              28 JULY 2006       VOL 313       SCIENCE         www.sciencemag.org
                                                                                                                                                              REPORTS
times as desired. It can be shown that adding an     pi is the intensity of pixel i and ĝpi is the         tion task, the best reported error rates are 1.6% for
extra layer always improves a lower bound on         intensity of its reconstruction.                       randomly initialized backpropagation and 1.4%
the log probability that the model assigns to the        The autoencoder consisted of an encoder            for support vector machines. After layer-by-layer
training data, provided the number of feature        with layers of size (28  28)-400-200-100-             pretraining in a 784-500-500-2000-10 network,
detectors per layer does not decrease and their      50-25-6 and a symmetric decoder. The six               backpropagation using steepest descent and a
weights are initialized correctly (9). This bound    units in the code layer were linear and all the        small learning rate achieves 1.2% (8). Pretraining
does not apply when the higher layers have           other units were logistic. The network was             helps generalization because it ensures that most
fewer feature detectors, but the layer-by-layer      trained on 20,000 images and tested on 10,000          of the information in the weights comes from
learning algorithm is nonetheless a very effec-      new images. The autoencoder discovered how             modeling the images. The very limited informa-
tive way to pretrain the weights of a deep auto-     to convert each 784-pixel image into six real          tion in the labels is used only to slightly adjust
encoder. Each layer of features captures strong,     numbers that allow almost perfect reconstruction       the weights found by pretraining.
high-order correlations between the activities of    (Fig. 2A). PCA gave much worse reconstruc-                 It has been obvious since the 1980s that
units in the layer below. For a wide variety of      tions. Without pretraining, the very deep auto-        backpropagation through deep autoencoders
data sets, this is an efficient way to pro-          encoder always reconstructs the average of the         would be very effective for nonlinear dimen-
gressively reveal low-dimensional, nonlinear         training data, even after prolonged fine-tuning        sionality reduction, provided that computers
structure.                                           (8). Shallower autoencoders with a single              were fast enough, data sets were big enough,
    After pretraining multiple layers of feature     hidden layer between the data and the code             and the initial weights were close enough to a
detectors, the model is Bunfolded[ (Fig. 1) to       can learn without pretraining, but pretraining         good solution. All three conditions are now
produce encoder and decoder networks that            greatly reduces their total training time (8).         satisfied. Unlike nonparametric methods (15, 16),
initially use the same weights. The global fine-     When the number of parameters is the same,             autoencoders give mappings in both directions
tuning stage then replaces stochastic activities     deep autoencoders can produce lower recon-             between the data and code spaces, and they can
by deterministic, real-valued probabilities and      struction errors on test data than shallow ones,       be applied to very large data sets because both
uses backpropagation through the whole auto-         but this advantage disappears as the number of         the pretraining and the fine-tuning scale linearly
encoder to fine-tune the weights for optimal         parameters increases (8).                              in time and space with the number of training
reconstruction.                                          Next, we used a 784-1000-500-250-30 auto-          cases.
    For continuous data, the hidden units of the     encoder to extract codes for all the hand-
first-level RBM remain binary, but the visible       written digits in the MNIST training set (11).
                                                                                                                 References and Notes
units are replaced by linear units with Gaussian     The Matlab code that we used for the pre-               1. D. C. Plaut, G. E. Hinton, Comput. Speech Lang. 2, 35
noise (10). If this noise has unit variance, the     training and fine-tuning is available in (8). Again,       (1987).
stochastic update rule for the hidden units          all units were logistic except for the 30 linear        2. D. DeMers, G. Cottrell, Advances in Neural Information
remains the same and the update rule for visible     units in the code layer. After fine-tuning on all          Processing Systems 5 (Morgan Kaufmann, San Mateo, CA,
                                                                                                                1993), pp. 580–587.
unit i is to sample from P  a Gaussian with unit     60,000 training images, the autoencoder was             3. R. Hecht-Nielsen, Science 269, 1860 (1995).
variance and mean bi þ j hjwij.                      tested on 10,000 new images and produced                4. N. Kambhatla, T. Leen, Neural Comput. 9, 1493
    In all our experiments, the visible units of     much better reconstructions than did PCA                   (1997).
every RBM had real-valued activities, which          (Fig. 2B). A two-dimensional autoencoder                5. P. Smolensky, Parallel Distributed Processing: Volume 1:
                                                                                                                Foundations, D. E. Rumelhart, J. L. McClelland, Eds. (MIT
were in the range E0, 1^ for logistic units. While   produced a better visualization of the data                Press, Cambridge, 1986), pp. 194–281.
training higher level RBMs, the visible units        than did the first two principal components             6. G. E. Hinton, Neural Comput. 14, 1711 (2002).
were set to the activation probabilities of the      (Fig. 3).                                               7. J. J. Hopfield, Proc. Natl. Acad. Sci. U.S.A. 79, 2554
hidden units in the previous RBM, but the                We also used a 625-2000-1000-500-30 auto-              (1982).
                                                                                                             8. See supporting material on Science Online.
hidden units of every RBM except the top one         encoder with linear input units to discover 30-         9. G. E. Hinton, S. Osindero, Y. W. Teh, Neural Comput. 18,
had stochastic binary values. The hidden units       dimensional codes for grayscale image patches              1527 (2006).
of the top RBM had stochastic real-valued            that were derived from the Olivetti face data set      10. M. Welling, M. Rosen-Zvi, G. Hinton, Advances in Neural
states drawn from a unit variance Gaussian           (12). The autoencoder clearly outperformed                 Information Processing Systems 17 (MIT Press, Cambridge,
                                                                                                                MA, 2005), pp. 1481–1488.
whose mean was determined by the input from          PCA (Fig. 2C).
                                                                                                            11. The MNIST data set is available at http://yann.lecun.com/
that RBM_s logistic visible units. This allowed          When trained on documents, autoencoders                exdb/mnist/index.html.
the low-dimensional codes to make good use of        produce codes that allow fast retrieval. We rep-       12. The Olivetti face data set is available at www.
continuous variables and facilitated compari-        resented each of 804,414 newswire stories (13)             cs.toronto.edu/ roweis/data.html.
sons with PCA. Details of the pretraining and        as a vector of document-specific probabilities         13. The Reuter Corpus Volume 2 is available at http://
                                                                                                                trec.nist.gov/data/reuters/reuters.html.
fine-tuning can be found in (8).                     of the 2000 commonest word stems, and we               14. S. C. Deerwester, S. T. Dumais, T. K. Landauer, G. W.
    To demonstrate that our pretraining algo-        trained a 2000-500-250-125-10 autoencoder on               Furnas, R. A. Harshman, J. Am. Soc. Inf. Sci. 41, 391
rithm allows us to fine-tune deep networks           half of the stories with the use P of the multiclass       (1990).
efficiently, we trained a very deep autoen-          cross-entropy error function E– i pi log ĝpi^ for     15. S. T. Roweis, L. K. Saul, Science 290, 2323 (2000).
                                                                                                            16. J. A. Tenenbaum, V. J. de Silva, J. C. Langford, Science
coder on a synthetic data set containing             the fine-tuning. The 10 code units were linear             290, 2319 (2000).
images of Bcurves[ that were generated from          and the remaining hidden units were logistic.          17. We thank D. Rumelhart, M. Welling, S. Osindero, and
three randomly chosen points in two di-              When the cosine of the angle between two codes             S. Roweis for helpful discussions, and the Natural
mensions (8). For this data set, the true in-        was used to measure similarity, the autoencoder            Sciences and Engineering Research Council of Canada for
                                                                                                                funding. G.E.H. is a fellow of the Canadian Institute for
trinsic dimensionality is known, and the             clearly outperformed latent semantic analysis
                                                                                                                Advanced Research.
relationship between the pixel intensities and       (LSA) (14), a well-known document retrieval
the six numbers used to generate them is             method based on PCA (Fig. 4). Autoencoders             Supporting Online Material
highly nonlinear. The pixel intensities lie          (8) also outperform local linear embedding, a          www.sciencemag.org/cgi/content/full/313/5786/504/DC1
between 0 and 1 and are very non-Gaussian,           recent nonlinear dimensionality reduction algo-        Materials and Methods
                                                                                                            Figs. S1 to S5
so we used logistic output units in the auto-        rithm (15).                                            Matlab Code
encoder, and the fine-tuning stage of the                Layer-by-layer pretraining can also be used
learning
   P       minimized
                   P the cross-entropy error         for classification and regression. On a widely used    20 March 2006; accepted 1 June 2006
E– i pi log p̂i – i(1 – pi ) log(1 – ĝpi)^, where   version of the MNIST handwritten digit recogni-        10.1126/science.1127647



                                          www.sciencemag.org          SCIENCE        VOL 313       28 JULY 2006                                                             507