Quantization - AI/ML Recipes - AI Center of Excellence

4.1 GGUF Quantization Selection Guide

Source: llama.cpp Discussions + E2E Networks Guide

Quantization Levels

Quantization Levels (GGUF):
├── Q8_0:   ~8 bits/weight,   minimal quality loss
├── Q6_K:   ~6.5 bits/weight, excellent balance
├── Q5_K_M: ~5.5 bits/weight, good quality
├── Q4_K_M: ~4.8 bits/weight, recommended default
├── Q4_K_S: ~4.5 bits/weight, smaller but acceptable
├── Q3_K_M: ~3.8 bits/weight, noticeable degradation
└── Q2_K:   ~2.5 bits/weight, significant quality loss

Model Size Selection by VRAM

VRAM	Recommended Models
8GB	7B-8B models at Q4_K_M
12GB	7B-8B at Q8_0, or 13B at Q4_K_M
16GB	13B at Q6_K, or 30B at Q4_K_M
24GB	30B at Q6_K, or 70B at Q3_K_M
48GB	70B at Q6_K
80GB+	70B at Q8_0, or larger models

Key Insight

Blind testing shows Q6 sometimes beats Q8 for instruction following. Higher-benchmark models function better at lower quants.

4.2 EXL2 vs GGUF Decision Matrix

Source: llama.cpp GitHub Discussions

Use EXL2 When:

Use EXL2 when:
├── GPU-only inference (no CPU offload)
├── Need 4-bit KV cache (quarter memory for context)
├── Long context (16K+)
├── Nvidia GPU with CUDA
└── Speed is priority over compatibility

Use GGUF When:

Use GGUF when:
├── Mixed CPU+GPU inference (GPU-poor)
├── Apple Silicon / M-series
├── Need broad compatibility
├── Prefer simpler tooling (llama.cpp, Ollama)
└── Want to leverage both CPU and GPU

Performance Note

EXL2 can be ~2x faster than GGUF for GPU-only inference, but GGUF offers superior compatibility.