# Benchmarks - Lean Models

Offloading performance and model quality, measured on real hardware.

## Offloading Performance

Token generation rates with full expert offloading on an RTX 3090 (24 GB VRAM, 64 GB RAM, NVMe SSD).

| Model | Quant | Prefill | Decode | VRAM Cache Hit |
|-------|-------|---------|--------|----------------|
| lean-agent-35b | Q4_K_M | 10-15 tok/s | 6.7-7.6 tok/s | 93.1% |
| lean-agent-122b | Q4_K_M | - | 2.3 tok/s | - |
| lean-think-398b | Q4_K_M | - | Testing in progress | - |

All measurements with profile-guided preloading and speculative router prefetch enabled. Model load <1s. Preload throughput: 5.9 GB/s. CopyEngine prefetch hit rate: 81.3%.

## Engine Features

- **93% VRAM cache hit rate** - LRU cache with profile-guided preloading
- **81% speculative prefetch hit rate** - router predicts next-layer experts ahead of computation
- **5.9 GB/s** expert preload throughput via async background I/O
- **0.83s** model load time
- **Bit-identical output** vs llama.cpp on same GGUF weights
- **Multi-GPU** pipeline parallelism with bit-identical output

## Hardware Reference Configurations

| Tier | VRAM | RAM | NVMe | Target Models |
|------|------|-----|------|---------------|
| Minimal | 12 GB | 16 GB | 1.8 TB | lean-agent-35b, lean-coder-80b |
| Prosumer | 24 GB | 32 GB | 1.8 TB | lean-agent-122b |
| Enthusiast | 48 GB | 64 GB | 1.8 TB | lean-reason-397b, lean-think-398b |

## Model Quality Benchmarks

Results coming soon. MMLU, HumanEval, BFCL, IFEval, GSM8K, MATH - run via lm-evaluation-harness against lean serve.

## Methodology

**Offloading benchmarks** measure tok/s, VRAM cache hit rate, prefetch hit rate, and expert preload throughput via `lean bench`.

**Cross-validation** compares lean-engine output against llama.cpp on the same GGUF weights with greedy decoding. 4/10 prompts match token-for-token; 6/10 diverge only in freeform thinking text due to expected FP precision differences.

**Quality benchmarks** will be run via `lm-evaluation-harness` against the `lean serve` OpenAI-compatible API. Models must match base model scores before release.

**Hardware:** All benchmarks run locally on reference configurations. No cloud GPUs. Results are reproducible.
