# Models - Lean Models

MoE models from 35B to 398B parameters. Run models larger than your VRAM - expert offloading handles the rest.

## lean-agent-35b

35B total parameters, 3B active per token. General-purpose agent model for tool calling, structured output, and multi-step reasoning.

- **Architecture:** GDN hybrid MoE
- **Base model:** Qwen3.5-35B-A3B
- **Total / Active:** 35B / 3B
- **Min VRAM:** 12 GB
- **GGUF sizes:** Q3_K_M 16.3 GB · Q4_K_M 21.4 GB · Q5_K_M 25.0 GB · Q6_K 30.0 GB · Q8_0 36.9 GB

A 21 GB model (Q4_K_M) that runs on 12 GB VRAM. 6.7-7.6 tok/s decode on an RTX 3090.

## lean-coder-80b

80B total parameters, 3B active per token. 512 experts per layer, 10 active. Code-specialized model for generation, debugging, and software engineering.

- **Architecture:** MoE (512 experts)
- **Base model:** Qwen3-Coder-Next
- **Total / Active:** 80B / 3B
- **Min VRAM:** 12 GB
- **GGUF sizes:** Q3_K_M 36.7 GB · Q4_K_M 48.7 GB · Q5_K_M 57.0 GB · Q6_K 65.8 GB · Q8_0 84.8 GB

A 48.7 GB model (Q4_K_M) that runs on 12 GB VRAM.

## lean-agent-122b

122B total parameters, 10B active per token. 256 experts per layer, 8 active. Advanced agent model for complex orchestration and long-context workflows.

- **Architecture:** GDN hybrid MoE
- **Base model:** Qwen3.5-122B-A10B
- **Total / Active:** 122B / 10B
- **Min VRAM:** 24 GB
- **GGUF sizes:** Q3_K_M 56.6 GB · Q4_K_M 75.0 GB · Q5_K_M 87.8 GB · Q6_K 105.7 GB · Q8_0 129.9 GB

A 75 GB model (Q4_K_M) that runs on 24 GB VRAM. 2.3 tok/s decode on an RTX 3090.

## lean-reason-397b

397B total parameters, 17B active per token. Frontier-scale reasoning from a massive expert pool.

- **Architecture:** GDN hybrid MoE
- **Base model:** Qwen3.5-397B-A17B
- **Total / Active:** 397B / 17B
- **Min VRAM:** 48 GB
- **GGUF sizes:** Q3_K_M 177.4 GB · Q4_K_M 244.1 GB · Q5_K_M 293.7 GB · Q6_K 326.6 GB · Q8_0 421.5 GB

A 244 GB model (Q4_K_M) that runs on 48 GB VRAM. Frontier-scale reasoning on your hardware.

## lean-think-398b

398B total parameters, ~13B active per token. 256 experts per MoE layer, 4 active + 1 shared. Extended reasoning with chain-of-thought and agentic RL post-training.

- **Architecture:** afmoe (interleaved SWA + global attention)
- **Base model:** Arcee Trinity-Large-Thinking
- **Total / Active:** 398B / ~13B
- **Min VRAM:** 48 GB
- **License:** Apache 2.0
- **GGUF sizes:** Q3_K_M 181.4 GB · Q4_K_M 241.9 GB · Q5_K_M 283.6 GB · Q6_K 343.2 GB · Q8_0 423.7 GB

A 242 GB model (Q4_K_M) that runs on 48 GB VRAM. The "you can't run this without expert offloading" model.

## Model Format

All models ship in `.lmpack` - a format designed for expert offloading. The lean runtime keeps the hot path in VRAM and transparently pages in the rest from RAM and NVMe as needed. Combined with speculative prefetching and profile-guided preloading, it delivers interactive speeds on hardware that would otherwise be far too small.

```bash
lean pull lean-agent-35b
lean run lean-agent-35b
```

Single binary, 15 MB. Download once, run forever. No cloud dependency.
