1.1 Optimized llama.cpp Server Configuration

Source: llama.cpp GitHub Discussions

Production Configuration

# Production-optimized llama-server launch
./llama-server \
  --model /path/to/model.gguf \
  --n-gpu-layers 99 \
  --ctx-size 32768 \
  --threads 2 \
  --batch-size 1024 \
  --ubatch-size 512 \
  --parallel 4 \
  --cont-batching \
  --mlock \
  --flash-attn \
  --host 0.0.0.0 \
  --port 8080
Key Insights
  • Counter-intuitively, reducing threads from 32 to 2 can improve performance for GPU-accelerated models
  • --flash-attn enables Flash Attention optimization (supported by most recent models)
  • --mlock prevents OS from swapping model to hard drive (use if you have sufficient RAM/VRAM)
  • --cont-batching enables continuous batching for better throughput
Pitfall

Flash attention with context >8K tokens can produce gibberish in some Qwen2 models.

Workaround: Enable flash attention (-fa) AND fully offload to GPU (-ngl 80)

1.2 Ollama Modelfile for Performance

Source: Ollama Documentation + Ollama Tuning Guide

Modelfile Configuration

# Modelfile for optimized inference
FROM llama3.1:8b

# Context window (default 4096)
PARAMETER num_ctx 32768

# Batch size for prompt processing
PARAMETER num_batch 256

# GPU layers to offload (999 = all)
PARAMETER num_gpu 999

# Thread count for CPU operations
PARAMETER num_thread 6

# Temperature for generation
PARAMETER temperature 0.7

# System prompt
SYSTEM You are a helpful AI assistant.

Runtime Override

# Temporarily increase context window
ollama run llama3.1 --parameter num_ctx 8192

Monitor Performance

# View current context length in use
ollama ps

# Get token generation metrics
ollama run gemma3:12b --verbose
Key Insight

Reducing context window significantly improves performance on consumer GPUs. Use 4K-8K for daily tasks, reserve larger contexts for document processing.

1.3 vLLM Production Deployment

Source: vLLM Documentation + Google Cloud vLLM Guide

Single GPU Deployment

vllm serve meta-llama/Llama-3.1-8B-Instruct \
  --host 0.0.0.0 \
  --port 8000 \
  --gpu-memory-utilization 0.9 \
  --max-model-len 32768 \
  --enable-prefix-caching

Multi-GPU with Tensor Parallelism

vllm serve meta-llama/Llama-3.1-70B-Instruct \
  --host 0.0.0.0 \
  --port 8000 \
  --tensor-parallel-size 4 \
  --gpu-memory-utilization 0.9 \
  --max-model-len 32768

With Speculative Decoding (up to 2.8x speedup)

vllm serve facebook/opt-6.7b \
  --host 0.0.0.0 \
  --port 8000 \
  --gpu-memory-utilization 0.8 \
  --speculative-config '{"model": "facebook/opt-125m", "num_speculative_tokens": 5}'

Memory Optimization

# Start with gpu_memory_utilization=0.98
# If OOM, decrease by 1% until stable
# Enable prefix caching for shared system prompts
Key Insight

vLLM excels at high-throughput multi-user scenarios. For single-user, llama.cpp may be faster.

1.4 DeepSeek R1 Inference Settings

Source: Unsloth DeepSeek Guide

Recommended Settings

# Recommended settings for DeepSeek R1
llama-server \
  --model deepseek-r1.gguf \
  --temp 0.6 \
  --top-p 0.95 \
  --min-p 0.01 \
  --ctx-size 16384 \
  --n-gpu-layers 99

Hardware Requirements

Configuration Performance
Minimum: 64GB RAM ~1 token/s without GPU
Optimal: 180GB unified memory or 180GB combined RAM+VRAM 5+ tokens/s
8B distilled version 20GB RAM sufficient

Quick Reference: Essential Flags

Flag Description Default
-m MODEL_PATH Model file path -
-c CONTEXT_SIZE KV cache size 512
-ngl N GPU layers (99 = all) 0
-b BATCH_SIZE Batch size for prompt 512
--mlock Lock model in RAM off
-fa Enable flash attention off
--cont-batching Continuous batching off
-t THREADS CPU threads auto