Highly accurate protein structure prediction with AlphaFold

Metadata

Reading status: read complete
Year: 2021
Compute regime: Search, simulation, and science compute (search_simulation_science_compute)
PDF: 2021-alphafold2_2021.pdf
Extracted text: 2021-alphafold2_2021.txt
OpenAlex:
Citation count source/date:
Citation count:
Reading card created: 2026-06-15

Compute Setup

The paper explicitly reports both training and inference hardware. Training uses Tensor Processing Unit v3 with batch size 1 per TPU core, for 128 TPU v3 cores total. Training examples are randomly cropped to 256 residues and assembled into batches of size 128. The model is trained to convergence on about 10 million samples, then fine-tuned with longer 384-residue crops, a larger MSA stack, and reduced learning rate. The initial training stage takes approximately one week, and fine-tuning takes about four additional days.

Inference timings are reported for a single model on NVIDIA V100 GPU. With the CASP14 configuration and ensembling, representative neural-network times are 4.8 minutes for 256 residues, 9.2 minutes for 384 residues, and 18 hours for 2500 residues. Without ensembling, inference is about 8x faster: 0.6 minutes for 256 residues, 1.1 minutes for 384 residues, and 2.1 hours for 2500 residues. A 16 GB V100 handles proteins up to about 1300 residues without ensembling; memory is approximately quadratic in residue count, so 2500 residues uses unified memory, with one V100 for computation but four GPUs requested for memory. MSA search and final relaxation add CPU time but do not require GPU or TPU.

Bottleneck

The scientific bottleneck is replacing slow experimental structure determination with prediction accurate enough to be useful. The compute bottleneck is more specific: AlphaFold2 must combine sequence database search, MSA processing, pairwise residue representations, iterative recycling, and geometry-aware structure prediction. The pair representation is Nres x Nres, so memory and compute grow roughly quadratically with protein length. That is why 2500-residue inference can take hours and require unified memory, while 256- and 384-residue cases take minutes.

MSA and template search are also a bottleneck outside accelerators. Inputs include the primary sequence, MSA sequences built with tools such as jackhmmer and HHBlits, and template coordinates where available. The search processes are tuned for high recall, and the databases include UniRef90, BFD, Uniclust30, MGnify, and PDB/template sources. Preparing those inputs is CPU-heavy and IO-heavy, but the paper separates it from GPU/TPU neural inference.

Recycling adds a controlled serial compute loop. The network's trunk and structure module refine a structural hypothesis over repeated passes. Including recycling stages, the paper describes trajectories with many intermediate structures. Recycling improves accuracy, but it means a prediction is not one feed-forward pass.

Method Adaptation

AlphaFold2 is adapted to this compute structure by factorizing the problem into MSA, pair, and structure representations. The Evoformer operates on an Nseq x Nres MSA array and an Nres x Nres pair array. It includes mechanisms for communication between MSA columns, MSA rows, and residue-pair edges, including outer-product updates and triangle-style pair updates. This is expensive, but it matches the scientific prior that co-evolutionary signals and residue-residue geometry are central to folding.

The structure module uses invariant point attention to update residue frames and predict 3D coordinates directly. This avoids relying on a separate search over conformations as the primary solver. A final Amber relaxation step removes distracting stereochemical violations, but the neural network supplies the main structure. Confidence heads such as pLDDT and predicted TM-score are compute adaptations too: they let users decide whether to trust a prediction without running new experiments.

Training uses multiple data sources and auxiliary losses: FAPE, distogram prediction, masked MSA reconstruction, side-chain loss, and violation losses during fine-tuning. It also uses self-distillation, turning predictions for hundreds of thousands of Uniclust sequences into additional training signal. CASP14 inference uses five trained models and confidence-based selection, while later non-ensembled inference is reported as 8x faster with little accuracy loss.

Evidence

CASP14 is the central benchmark evidence. On 87 CASP14 protein domains, AlphaFold reports median backbone accuracy of 0.96 Angstrom r.m.s.d.95 at 95% residue coverage, while the next best method reports 2.8 Angstrom. All-atom accuracy is 1.5 Angstrom r.m.s.d.95 compared with 3.5 Angstrom for the best alternative. The paper emphasizes that the width of a carbon atom is about 1.4 Angstrom, making the median backbone error chemically meaningful.

The compute evidence is tied to length. A single-model V100 inference with ensembling costs 4.8 minutes at 256 residues and 9.2 minutes at 384 residues, but 18 hours at 2500 residues. Without ensembling, those become 0.6 minutes, 1.1 minutes, and 2.1 hours. The same section explains the memory cliff: a 16 GB V100 handles up to about 1300 residues without ensembling, while 2500 residues needs unified memory and extra requested GPUs for memory capacity.

Ablations support the architecture choices. Removing BFD reduces mean CASP14 accuracy by 0.4 GDT, removing MGnify by 0.7 GDT, and removing both by 6.1 GDT, with some outliers losing more than 20 GDT. The MSA-depth analysis shows a threshold: below about 30 effective sequences, accuracy suffers, while gains above around 100 sequences are smaller.

Historical Effect

AlphaFold2 turned accelerator-trained neural structure prediction into scientific infrastructure. It did not merely improve an ML benchmark; it changed the practical cost curve for obtaining high-quality protein structures. Once trained, many predictions could be made in GPU minutes to GPU hours plus CPU database search, making proteome-scale prediction plausible in a way that experimental methods alone could not match.

For compute history, it is a key example of search/simulation/science workloads moving into neural architectures. AlphaFold2 uses massive sequence databases, TPU training, pairwise attention, recycling, and learned geometric modules to amortize structure inference into a trained model.

Limits

The method is still constrained by memory and input quality. Long proteins have quadratic pair-representation cost and can exceed single-GPU memory. Shallow MSAs hurt accuracy, and while templates can help, the system depends heavily on sequence databases and search pipelines. The paper also notes limitations around complexes, ligands, ions, stoichiometry, and dynamics; the model predicts static structures from sequence and homologous information rather than full biological context.

The reported training hardware is clear, but full system cost includes CPU database search, template search, storage/IO for large sequence databases, relaxation, ensembling choices, and self-distillation data generation. Those costs are not reducible to the 128 TPU v3 training statement alone.