LLM Serving & Inference - AI/ML Recipes

1.1 Optimized llama.cpp Server Configuration

Production Configuration

# Production-optimized llama-server launch
./llama-server \
  --model /path/to/model.gguf \
  --n-gpu-layers 99 \
  --ctx-size 32768 \
  --threads 2 \
  --batch-size 1024 \
  --ubatch-size 512 \
  --parallel 4 \
  --cont-batching \
  --mlock \
  --flash-attn \
  --host 0.0.0.0 \
  --port 8080

Key Insights

Counter-intuitively, reducing threads from 32 to 2 can improve performance for GPU-accelerated models
--flash-attn enables Flash Attention optimization (supported by most recent models)
--mlock prevents OS from swapping model to hard drive (use if you have sufficient RAM/VRAM)
--cont-batching enables continuous batching for better throughput

Pitfall

Flash attention with context >8K tokens can produce gibberish in some Qwen2 models.

Workaround: Enable flash attention (-fa) AND fully offload to GPU (-ngl 80)

1.2 Ollama Modelfile for Performance

Source: Ollama Documentation + Ollama Tuning Guide

Modelfile Configuration

# Modelfile for optimized inference
FROM llama3.1:8b

# Context window (default 4096)
PARAMETER num_ctx 32768

# Batch size for prompt processing
PARAMETER num_batch 256

# GPU layers to offload (999 = all)
PARAMETER num_gpu 999

# Thread count for CPU operations
PARAMETER num_thread 6

# Temperature for generation
PARAMETER temperature 0.7

# System prompt
SYSTEM You are a helpful AI assistant.

Runtime Override

# Temporarily increase context window
ollama run llama3.1 --parameter num_ctx 8192

Monitor Performance

# View current context length in use
ollama ps

# Get token generation metrics
ollama run gemma3:12b --verbose

Key Insight

Reducing context window significantly improves performance on consumer GPUs. Use 4K-8K for daily tasks, reserve larger contexts for document processing.

1.3 vLLM Production Deployment

Source: vLLM Documentation + Google Cloud vLLM Guide

Single GPU Deployment

vllm serve meta-llama/Llama-3.1-8B-Instruct \
  --host 0.0.0.0 \
  --port 8000 \
  --gpu-memory-utilization 0.9 \
  --max-model-len 32768 \
  --enable-prefix-caching

Multi-GPU with Tensor Parallelism

vllm serve meta-llama/Llama-3.1-70B-Instruct \
  --host 0.0.0.0 \
  --port 8000 \
  --tensor-parallel-size 4 \
  --gpu-memory-utilization 0.9 \
  --max-model-len 32768

With Speculative Decoding (up to 2.8x speedup)

vllm serve facebook/opt-6.7b \
  --host 0.0.0.0 \
  --port 8000 \
  --gpu-memory-utilization 0.8 \
  --speculative-config '{"model": "facebook/opt-125m", "num_speculative_tokens": 5}'

Memory Optimization

# Start with gpu_memory_utilization=0.98
# If OOM, decrease by 1% until stable
# Enable prefix caching for shared system prompts

Key Insight

vLLM excels at high-throughput multi-user scenarios. For single-user, llama.cpp may be faster.

1.4 DeepSeek R1 Inference Settings

Source: Unsloth DeepSeek Guide

Recommended Settings

# Recommended settings for DeepSeek R1
llama-server \
  --model deepseek-r1.gguf \
  --temp 0.6 \
  --top-p 0.95 \
  --min-p 0.01 \
  --ctx-size 16384 \
  --n-gpu-layers 99

Hardware Requirements

Configuration	Performance
Minimum: 64GB RAM	~1 token/s without GPU
Optimal: 180GB unified memory or 180GB combined RAM+VRAM	5+ tokens/s
8B distilled version	20GB RAM sufficient

Quick Reference: Essential Flags

Flag	Description	Default
`-m MODEL_PATH`	Model file path	-
`-c CONTEXT_SIZE`	KV cache size	512
`-ngl N`	GPU layers (99 = all)	0
`-b BATCH_SIZE`	Batch size for prompt	512
`--mlock`	Lock model in RAM	off
`-fa`	Enable flash attention	off
`--cont-batching`	Continuous batching	off
`-t THREADS`	CPU threads	auto