4.1 GGUF Quantization Selection Guide
Source: llama.cpp Discussions + E2E Networks Guide
Quantization Levels
Quantization Levels (GGUF): ├── Q8_0: ~8 bits/weight, minimal quality loss ├── Q6_K: ~6.5 bits/weight, excellent balance ├── Q5_K_M: ~5.5 bits/weight, good quality ├── Q4_K_M: ~4.8 bits/weight, recommended default ├── Q4_K_S: ~4.5 bits/weight, smaller but acceptable ├── Q3_K_M: ~3.8 bits/weight, noticeable degradation └── Q2_K: ~2.5 bits/weight, significant quality loss
Model Size Selection by VRAM
| VRAM | Recommended Models |
|---|---|
| 8GB | 7B-8B models at Q4_K_M |
| 12GB | 7B-8B at Q8_0, or 13B at Q4_K_M |
| 16GB | 13B at Q6_K, or 30B at Q4_K_M |
| 24GB | 30B at Q6_K, or 70B at Q3_K_M |
| 48GB | 70B at Q6_K |
| 80GB+ | 70B at Q8_0, or larger models |
Key Insight
Blind testing shows Q6 sometimes beats Q8 for instruction following. Higher-benchmark models function better at lower quants.
4.2 EXL2 vs GGUF Decision Matrix
Source: llama.cpp GitHub Discussions
Use EXL2 When:
Use EXL2 when: ├── GPU-only inference (no CPU offload) ├── Need 4-bit KV cache (quarter memory for context) ├── Long context (16K+) ├── Nvidia GPU with CUDA └── Speed is priority over compatibility
Use GGUF When:
Use GGUF when: ├── Mixed CPU+GPU inference (GPU-poor) ├── Apple Silicon / M-series ├── Need broad compatibility ├── Prefer simpler tooling (llama.cpp, Ollama) └── Want to leverage both CPU and GPU
Performance Note
EXL2 can be ~2x faster than GGUF for GPU-only inference, but GGUF offers superior compatibility.