Transformers
英文原文文件:transformer.md
计算解释
Dense attention 架构,天然适配加速器矩阵乘法、批处理与序列预训练。
支撑阅读卡
- Attention Is All You Need (2017,
tpu_accelerator_transformer_era) - BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding (2018,
tpu_accelerator_transformer_era) - Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer (2019,
tpu_accelerator_transformer_era) - ALBERT: A Lite BERT for Self-supervised Learning of Language Representations (2019,
tpu_accelerator_transformer_era) - Transformer-XL: Attentive Language Models Beyond a Fixed-Length Context (2019,
tpu_accelerator_transformer_era) - RoBERTa: A Robustly Optimized BERT Pretraining Approach (2019,
tpu_accelerator_transformer_era) - XLNet: Generalized Autoregressive Pretraining for Language Understanding (2019,
tpu_accelerator_transformer_era) - What Does BERT Look at? An Analysis of BERT's Attention (2019,
tpu_accelerator_transformer_era) - Language Models are Unsupervised Multitask Learners (2019,
hyperscale_dense_llm_training) - Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism (2019,
hyperscale_dense_llm_training) - DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter (2019,
efficient_edge_inference) - An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale (2020,
tpu_accelerator_transformer_era) - Language Models are Few-Shot Learners (2020,
hyperscale_dense_llm_training) - Big Bird: Transformers for Longer Sequences (2020,
sparse_memory_efficient_scaling) - Linformer: Self-Attention with Linear Complexity (2020,
sparse_memory_efficient_scaling) - Longformer: The Long-Document Transformer (2020,
sparse_memory_efficient_scaling) - Reformer: The Efficient Transformer (2020,
sparse_memory_efficient_scaling) - Zero-Shot Text-to-Image Generation (2021,
generative_media_compute) - Training Compute-Optimal Large Language Models (2022,
hyperscale_dense_llm_training) - PaLM: Scaling Language Modeling with Pathways (2022,
hyperscale_dense_llm_training) - FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness (2022,
sparse_memory_efficient_scaling) - Scalable Diffusion Models with Transformers (2022,
generative_media_compute) - Gemini: A Family of Highly Capable Multimodal Models (2023,
hyperscale_dense_llm_training) - Mistral 7B (2023,
hyperscale_dense_llm_training) - Textbooks Are All You Need (2023,
hyperscale_dense_llm_training) - A Survey of Large Language Models (2023,
hyperscale_dense_llm_training) - FlashAttention-2: Faster Attention with Better Parallelism and Work Partitioning (2023,
sparse_memory_efficient_scaling) - The Llama 3 Herd of Models (2024,
hyperscale_dense_llm_training) - Mixtral of Experts (2024,
sparse_memory_efficient_scaling) - DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model (2024,
sparse_memory_efficient_scaling) - Gemma 3 Technical Report (2025,
efficient_edge_inference) - BitNet b1.58 2B4T Technical Report (2025,
efficient_edge_inference) - Gold-medalist Performance in Solving Olympiad Geometry with AlphaGeometry2 (2025,
search_simulation_science_compute) - Kimi K2: Open Agentic Intelligence (2025,
sparse_memory_efficient_scaling)
后续计算范式下过时或退居次要的内容
仅通过已链接的阅读卡追踪,不将本方法页视为独立证据来源。