Mistral 7B

Download PDF

Mistral 7B

Metadata

Compute Setup

The paper does not list the exact training hardware, total training tokens, optimizer, or batch size. It thanks CoreWeave for cluster help and CINECA/EuroHPC Leonardo for resources, but it does not name an accelerator model. Under the project rule, the hardware setting should therefore be inferred only at the era/toolchain level: a 2023 CUDA GPU training and serving environment in the A100/H100 generation, because the paper depends on FlashAttention, xFormers, vLLM, and cloud deployment paths on AWS, GCP, and Azure. The exact device count and training duration are unsupported by the local source and should not be filled in.

The paper is much more concrete about inference-time structure. Mistral 7B is a 7B-parameter Transformer with dimension 4,096, 32 layers, 32 attention heads, head dimension 128, hidden dimension 14,336, 8 key/value heads, vocabulary size 32,000, context length 8,192, and sliding-window size 4,096. The reference implementation is designed for local or cloud deployment using vLLM and SkyPilot. The architecture is therefore documented primarily as a serving-efficient 7B model rather than as a fully disclosed pretraining run.

Bottleneck

Mistral's bottleneck is serving long-context dense Transformers under latency and memory constraints. During autoregressive decoding, key/value cache memory grows with the number of generated or prefetched tokens, and each new token must attend over available prior context. The paper states that vanilla attention has quadratic operation count in sequence length and that inference suffers higher latency and lower throughput as cache availability shrinks.

The practical bottleneck is not simply "7B is small." A small model must still keep enough quality to replace larger public baselines, and it must do so while maintaining throughput for real-time applications. The paper specifically links grouped-query attention to faster decoding, lower memory requirement, larger batch sizes, and higher throughput. Sliding-window attention targets the long-context side: it bounds local attention while allowing information to move forward across layers.

The inference-memory arithmetic supports that framing. BF16 weights for a 7B model are about 14 GB. With 32 layers, 8 KV heads, head dimension 128, and an 8192-token full-context cache, batch-1 KV cache would be about 1.1 GB; at batch 32 it would be about 34 GB, larger than the weights. The rolling sliding-window cache is therefore not cosmetic: it ties cache growth to the 4096-token window, roughly halving the configured 8192-token cache and preventing unbounded growth on longer prompts.

Knowledge capacity remains a separate compute bottleneck. The authors' "equivalent model size" discussion says Mistral 7B behaves like a much larger Llama 2 model on reasoning/comprehension/STEM, but the compression rate is lower on knowledge benchmarks, plausibly because a 7B model has limited parameter memory for storing facts.

Method Adaptation

Mistral adapts the Transformer around inference IO. Grouped-query attention keeps 32 query heads but only 8 key/value heads. That reduces the amount of KV data that must be stored and read during decoding, which is exactly the pressure point for batched serving. It is a memory-bandwidth and cache-size optimization, not just a parameter-count reduction.

Sliding-window attention sets a window size W = 4096. Each token attends to at most W tokens from the previous layer, but because the model has stacked layers, information can move forward by up to W tokens per layer. With 32 layers, the paper gives a theoretical attention span of about 131K tokens, while the configured context length remains 8192. In practice, this trades global all-to-all attention for bounded local attention plus layerwise propagation.

The serving implementation follows from that structure. A rolling buffer cache stores keys and values at position i mod W, so cache size stops growing once the sequence exceeds the window. For a 32K-token sequence, the paper reports an 8x cache-memory reduction without quality impact. Long prompts are handled with chunked prefill: the prompt is processed in smaller chunks, often the window size, with attention over the current chunk and cache rather than over the entire prompt at once. The paper also reports that changes to FlashAttention and xFormers produce a 2x speed improvement over vanilla attention for a 16K sequence with W = 4096.

Evidence

The architecture table is the main compute evidence: 32 heads, 8 KV heads, 4096 sliding window, and 8192 context. The paper gives two direct efficiency results: 2x speed improvement at 16K sequence length from FlashAttention/xFormers changes, and 8x KV-cache memory reduction at 32K from the rolling buffer cache.

Benchmark evidence supports the claim that these serving-oriented changes did not simply make a weak small model. In the table comparing public baselines, Mistral 7B reaches 60.1% MMLU, 81.3% HellaSwag, 75.3% WinoGrande, 83.0% PIQA, 55.5% ARC-Challenge, 28.8% NaturalQuestions, 69.9% TriviaQA, 30.5% HumanEval, 47.5% MBPP, 13.1% MATH, and 52.2% GSM8K. The same table gives Llama 2 13B at 55.6% MMLU, 80.7% HellaSwag, 72.9% WinoGrande, 80.8% PIQA, 48.8% ARC-Challenge, 29.0% NaturalQuestions, 69.6% TriviaQA, 18.9% HumanEval, 35.4% MBPP, 6.0% MATH, and 34.3% GSM8K. Thus the 7B model beats Llama 2 13B on nearly all listed metrics and is roughly tied on NaturalQuestions.

For instruction tuning, Mistral 7B-Instruct reports MT-Bench 6.84 +/- 0.07, above Llama 2 13B Chat's 6.65 in the table. A human preference comparison as of October 6, 2023 favored Mistral 7B outputs 5,020 times versus 4,143 for Llama 2 13B. These are not training-efficiency numbers, but they show why the lower serving footprint mattered.

Historical Effect

Mistral 7B made inference-oriented architecture a visible part of open LLM competition. LLaMA had already emphasized smaller models trained longer; Mistral added that a small dense model could gain practical advantage from attention/cache design as well as from pretraining scale. GQA and sliding-window attention became part of the public recipe for models meant to be served cheaply at useful context lengths.

The paper also sharpened the three-way tradeoff among model capability, training cost, and inference cost. Its conclusion explicitly argues that the field should not associate capability only with training cost. Historically, that is why Mistral 7B mattered: it made a 7B checkpoint feel like a deployment target rather than merely a research toy.

Limits

The largest limit for this reading card is disclosure. The paper does not list training hardware, number of accelerators, training tokens, training duration, optimizer settings, or full data mixture. Any precise claim about the training cluster would be unsupported by the local source.

The architecture has limits too. Sliding windows reduce attention cost, but they also restrict direct access to far-past tokens; long-range information must propagate through layers rather than be attended globally at each layer. The authors claim no quality impact for the rolling cache in their reported setting, but the paper does not exhaustively test all long-context retrieval behaviors. Finally, the 7B parameter count limits factual storage, and the guardrail/self-reflection section is a lightweight prompt-based demonstration rather than a full safety training method.

Links

  • Compute regime: history/compute_regimes/hyperscale_dense_llm_training/README.md
  • Source PDF and extracted text are listed in metadata above.
  • Queue status: read_complete.
  • Method index: transformer
  • Ledger updates: compute bottlenecks