Improved protein structure prediction using potentials from deep learning

Download PDF

Improved protein structure prediction using potentials from deep learning

Metadata

Compute Setup

The queued Nature PDF was not directly downloadable, but OpenAlex located a green OA UCL Discovery PDF, which was downloaded and extracted. The accessible text does not list the neural-network training hardware. Under the project rule, the training setup is therefore inferred only at the era level: 2018-2019 DeepMind datacenter accelerator infrastructure for deep residual neural networks, with the exact TPU/GPU count unsupported by the local source.

The source does disclose one important compute component: structure realization is CPU-based and parallelizable. Extended Data Fig. 4 says computation is measured as the product of the number of CPU-based machines and elapsed time, and that the work can be largely parallelized. That means the AlphaFold1 pipeline is not a single monolithic accelerator job. It has a learned neural-network stage, MSA/template feature extraction, potential construction, and then repeated gradient-descent structure optimization that can be spread over machines.

The local source also identifies the data/software inputs: PDB 2018-03-15, CATH 2018-03-16, Uniclust30 2017-10, and PSI-BLAST nr as of 2017-12-15, plus HHblits and Rosetta. The paper makes neural-network weights and CASP13 inputs available for research use, but not the training device details.

Bottleneck

Protein structure prediction needs accurate geometry from limited homologous sequence evidence. Classical fragment assembly and sampling procedures are expensive because they search a huge conformation space. Contact predictions reduce that search, but binary contact maps discard distance detail and can still require complex sampling to produce a full structure.

AlphaFold1's bottleneck is therefore twofold. The learned model must extract pairwise geometric constraints from MSAs even when few homologous sequences exist, and the realization stage must turn those constraints into coordinates without an unbounded sampling budget. The paper emphasizes that distance distributions convey more information than contact predictions and that the resulting potential can be optimized by simple gradient descent rather than complex sampling.

There is also a data-pipeline bottleneck. Feature extraction constructs MSAs through sequence database search and computes MSA-based features. The quality of these features depends on homologous sequence depth; the extended data reports that effective number of sequences correlates with distogram accuracy. Long targets also cost more in realization time.

Method Adaptation

AlphaFold adapts protein folding to available compute by splitting the task into dense learned inference and parallelizable continuous optimization. A deep residual convolutional network predicts distributions over inter-residue distances, or distograms, rather than only contacts. The extended figure describes residual blocks with dilated convolution applied after reducing activation dimension, and skip connections that let gradients pass through very deep networks.

The predicted distograms are converted into statistical potentials. The supplement gives the distance potential as a sum over residue pairs of negative log likelihoods, with a reference-state variant, plus torsion and smoothness terms. The paper then optimizes this potential by gradient descent to generate structures. This is a compute adaptation: replace broad stochastic fragment search with differentiable optimization over a learned potential, then run multiple restarts or repeats in parallel when more CPU-machine time is available.

The method also preserves domain-specific pipeline pieces where they are compute-effective. HHblits builds MSAs, Rosetta is used for relaxation/terms, and target T0999, a 1,589-residue target, was manually segmented based on HHpred homology matching. This is not yet AlphaFold2-style end-to-end coordinate generation; it is a hybrid system designed around existing biological search tools, residual CNN prediction, and CPU-parallel realization.

Evidence

The main benchmark evidence is CASP13. AlphaFold creates high-accuracy structures, defined in the abstract as TM score 0.7 or higher, for 24 of 43 free-modelling domains. The next best method, using sampling and contact information, reaches that threshold for 14 of 43 domains. The extended data also compares the five AlphaFold CASP13 submissions, simulated annealing with fragment assembly, and gradient-descent submissions. Gradient descent was used only for targets T0975 and later during CASP13, and the paper includes back-fill gradient-descent runs for earlier targets using the deployed system.

The paper ties this quality to compute choices. Extended Data Fig. 3 says distogram accuracy predicts realized structure lDDT and TM score, and reports a test set of 377 domains for the structure-accuracy analysis. It also says the number of effective sequences in the MSA correlates with distogram accuracy, with r=0.634. Extended Data Fig. 4 states that TM score increases with the number of repeats of gradient descent and that structure realization has a modest, parallelizable CPU-machine budget.

The evidence therefore supports the compute thesis: learned distance distributions make the downstream optimization landscape useful enough that repeated gradient descent can replace heavier sampling for many targets.

Historical Effect

AlphaFold1 showed that learned distance potentials plus differentiable optimization could outperform traditional protein-structure pipelines. Historically, it marks a transition from contact-assisted search toward learned geometric potentials. Its compute structure is especially important: neural networks predict rich constraints once, then a parallel CPU-machine realization process spends variable compute per target.

This setup prepared the ground for AlphaFold2, but it is not the same compute regime. AlphaFold1 still relies on MSAs, template/search tools, handcrafted potential terms, and gradient-based folding outside the network. The historical effect is to prove that deep learned geometry can dominate the old sampling bottleneck before fully end-to-end coordinate prediction arrives.

Limits

Training hardware is not disclosed in the accessible text. Only the CPU-based, parallelizable nature of structure realization is stated directly. The method still depends on MSA/template pipelines, HHblits/HHpred-style search, Rosetta-related terms, hand-designed potentials, and gradient-based folding rather than a fully end-to-end structure module.

The compute budget is also target-dependent. Longer proteins take longer to optimize, and T0999 required manual segmentation. The system gains from repeated gradient descent, but that is an inference-time search budget, not a fixed single forward pass. Finally, the local source supports CASP13 and extended-data evidence, but not exact training-chip counts, wall-clock training time, batch sizes, or utilization.

Links

  • Primary source: UCL Discovery PDF, extracted text
  • Alternate source resolution: OpenAlex located the UCL green OA PDF after the queued Nature PDF endpoint failed.
  • Supplementary: Nature supplementary information
  • Compute regime: history/compute_regimes/search_simulation_science_compute/README.md
  • Queue status: read_complete.