Big Bird: Transformers for Longer Sequences

Download PDF

Big Bird: Transformers for Longer Sequences

Metadata

Compute Setup

BigBird is unusually explicit about the accelerator envelope. In the main experiments, the authors note that the models were trained on "16GB memory/chip" with batch sizes of 32-64, and attribute that feasibility to blocking plus the sparse attention structure. Appendix E gives the exact TPU slices. Base MLM pretraining for BigBird-ITC and BigBird-ETC uses max sequence length 4096, batch size 256, 12 layers, 12 heads, hidden size 768, and compute resources of 8 x 8 TPUv3. The large version keeps 4096 tokens, expands to 24 layers, 16 heads, hidden size 1024, and uses a larger pretraining batch size of 2048. The QA fine-tuning tables list 4 x 2, 4 x 4, or 4 x 8 TPUv3 slices depending on task and model size; summarization uses 4 x 4 TPUv3 for the base BigBird-RoBERTa model and 4 x 8 TPUv3 for the large BigBird-Pegasus model. Genomics pretraining and downstream runs also list 8 x 8 TPUv3.

Bottleneck

The paper frames full self-attention as a compute and memory wall rather than just a modeling inconvenience. BERT-style dense attention roughly limits practical context to 512 tokens, while the long-document workloads have much longer evidence. The QA appendix lists median and maximum instance lengths of 3258 and 77962 tokens for Natural Questions, 4900 and 32755 for TriviaQA, and 1541 and 20337 for WikiHop. In summarization, the authors point out that outputs are short, around a 200-token median, while inputs often have median length above 3000. The bottleneck is therefore concentrated in the encoder side: reading evidence or source documents dominates generation.

The second bottleneck is device structure. A mathematically sparse adjacency matrix is not automatically fast on GPUs or TPUs. The implementation appendix says hardware accelerators "shine on coalesced memory operations" and that small sporadic lookups from sliding windows or random element queries are inefficient. The paper also states that sparse matrix multiplications are not efficiently implemented on GPUs in this setting, because thousands of parallel cores need regular memory access and dense tensor operations to stay occupied.

Method Adaptation

BigBird adapts Transformer attention around both asymptotic sparsity and accelerator-friendly layout. The attention pattern combines a small set of global tokens, local window attention, and random attention. The global tokens keep a route for whole-sequence communication and are also used in the theory sections to preserve universal approximation and Turing-completeness properties. The local window keeps nearby context cheap. Random edges provide graph connectivity without materializing all pairwise token interactions. This reduces attention from dense all-pairs work to a number of attended blocks that is linear in sequence length for fixed block, window, random, and global counts.

The hardware adaptation is the more important compute-device point. Instead of using arbitrary sparse operations, BigBird "blockifies" queries and keys. Query and key tensors are reshaped into block tensors, local attention is computed by rolling copied key-block tensors, global blocks are concatenated, and only the small random component needs gathers. The final packed key tensor has shape proportional to the number of query blocks times the small attended block budget, so the attention scores are produced by dense tensor multiplication. This is exactly the kind of operation TPU/GPU matrix units handle efficiently. In other words, BigBird does not merely remove entries from the attention matrix; it repacks the remaining entries so sparsity can be batched.

The encoder-decoder design also follows the compute profile. For summarization, the paper uses sparse BigBird attention only on the encoder and leaves full attention in the decoder, because output sequences are much shorter than input sequences. That choice spends the sparse-attention engineering where it buys the most memory and bandwidth relief.

Evidence

The scale evidence is concrete. MLM pretraining uses documents split or packed to 4096 tokens. Held-out BPC improves from RoBERTa at sequence length 512, 1.846 base and 1.496 large, to BigBird-ETC at sequence length 4096, 1.611 base and 1.274 large; Longformer at the same length is reported as 1.705 base and 1.358 large. Those numbers are not just accuracy claims: they show that an 8x longer context can be trained on the listed 16 GB/chip TPUv3 setup.

Downstream tasks show the value of spending the compute on longer inputs. In document classification, BigBird reports Arxiv F1 of 92.31 versus 87.42 for RoBERTa and 87.96 for the prior state of the art; the same table shows smaller gains on IMDb, where only a small fraction of documents exceed 512 tokens. In long-document summarization, BigBird-Pegasus reaches ROUGE-1/2/L of 46.63/19.02/41.77 on Arxiv, 46.32/20.65/42.33 on PubMed, and 60.64/42.46/50.01 on BigPatent. The genomics experiments make the same compute point in another domain: DNA MLM BPC improves from BERT at 512 tokens, 1.23, to BigBird at 4096 tokens, 1.12; promoter prediction reaches F1 99.9, and chromatin histone-mark AUC improves from 85.6 for DeepSea to 88.7.

Historical Effect

BigBird made sparse Transformer attention look like a practical TPU-era scaling path rather than a purely theoretical trick. Its lasting compute lesson is that long context needs both a graph pattern and a memory layout. The random/local/global pattern addressed connectivity and expressivity; the block implementation addressed the actual TPU/GPU execution model. The paper also helped normalize the idea that long-context models should be evaluated on workloads whose inputs really exceed 512 tokens, including QA, long-document classification, summarization, and genomics.

Limits

The main limit is that BigBird is not a universal free lunch over dense attention. The sparse pattern is more complex to implement and tune than ordinary attention, and task-specific global-token choices matter. Several best results use carefully structured global tokens or the ETC variant, not just a generic random sparse mask. The experiments mostly cap at 4096 tokens, so documents with tens of thousands of tokens still require truncation, splitting, or sliding-window handling. Finally, the compute tables rely on sizable TPUv3 slices; the method reduces memory growth, but the reported training and fine-tuning setup is still an industrial accelerator setup.

Links