1.1 Optimized llama.cpp Server Configuration
Source: llama.cpp GitHub Discussions
Production Configuration
# Production-optimized llama-server launch ./llama-server \ --model /path/to/model.gguf \ --n-gpu-layers 99 \ --ctx-size 32768 \ --threads 2 \ --batch-size 1024 \ --ubatch-size 512 \ --parallel 4 \ --cont-batching \ --mlock \ --flash-attn \ --host 0.0.0.0 \ --port 8080
- Counter-intuitively, reducing threads from 32 to 2 can improve performance for GPU-accelerated models
--flash-attnenables Flash Attention optimization (supported by most recent models)--mlockprevents OS from swapping model to hard drive (use if you have sufficient RAM/VRAM)--cont-batchingenables continuous batching for better throughput
Flash attention with context >8K tokens can produce gibberish in some Qwen2 models.
Workaround: Enable flash attention (-fa) AND fully offload to GPU (-ngl 80)
1.2 Ollama Modelfile for Performance
Source: Ollama Documentation + Ollama Tuning Guide
Modelfile Configuration
# Modelfile for optimized inference FROM llama3.1:8b # Context window (default 4096) PARAMETER num_ctx 32768 # Batch size for prompt processing PARAMETER num_batch 256 # GPU layers to offload (999 = all) PARAMETER num_gpu 999 # Thread count for CPU operations PARAMETER num_thread 6 # Temperature for generation PARAMETER temperature 0.7 # System prompt SYSTEM You are a helpful AI assistant.
Runtime Override
# Temporarily increase context window ollama run llama3.1 --parameter num_ctx 8192
Monitor Performance
# View current context length in use ollama ps # Get token generation metrics ollama run gemma3:12b --verbose
Reducing context window significantly improves performance on consumer GPUs. Use 4K-8K for daily tasks, reserve larger contexts for document processing.
1.3 vLLM Production Deployment
Source: vLLM Documentation + Google Cloud vLLM Guide
Single GPU Deployment
vllm serve meta-llama/Llama-3.1-8B-Instruct \ --host 0.0.0.0 \ --port 8000 \ --gpu-memory-utilization 0.9 \ --max-model-len 32768 \ --enable-prefix-caching
Multi-GPU with Tensor Parallelism
vllm serve meta-llama/Llama-3.1-70B-Instruct \ --host 0.0.0.0 \ --port 8000 \ --tensor-parallel-size 4 \ --gpu-memory-utilization 0.9 \ --max-model-len 32768
With Speculative Decoding (up to 2.8x speedup)
vllm serve facebook/opt-6.7b \
--host 0.0.0.0 \
--port 8000 \
--gpu-memory-utilization 0.8 \
--speculative-config '{"model": "facebook/opt-125m", "num_speculative_tokens": 5}'
Memory Optimization
# Start with gpu_memory_utilization=0.98 # If OOM, decrease by 1% until stable # Enable prefix caching for shared system prompts
vLLM excels at high-throughput multi-user scenarios. For single-user, llama.cpp may be faster.
1.4 DeepSeek R1 Inference Settings
Source: Unsloth DeepSeek Guide
Recommended Settings
# Recommended settings for DeepSeek R1 llama-server \ --model deepseek-r1.gguf \ --temp 0.6 \ --top-p 0.95 \ --min-p 0.01 \ --ctx-size 16384 \ --n-gpu-layers 99
Hardware Requirements
| Configuration | Performance |
|---|---|
| Minimum: 64GB RAM | ~1 token/s without GPU |
| Optimal: 180GB unified memory or 180GB combined RAM+VRAM | 5+ tokens/s |
| 8B distilled version | 20GB RAM sufficient |
Quick Reference: Essential Flags
| Flag | Description | Default |
|---|---|---|
-m MODEL_PATH |
Model file path | - |
-c CONTEXT_SIZE |
KV cache size | 512 |
-ngl N |
GPU layers (99 = all) | 0 |
-b BATCH_SIZE |
Batch size for prompt | 512 |
--mlock |
Lock model in RAM | off |
-fa |
Enable flash attention | off |
--cont-batching |
Continuous batching | off |
-t THREADS |
CPU threads | auto |