Problem: Response becomes gibberish with context >8K tokens
Enable flash attention (-fa) AND fully offload to GPU (-ngl 80)
Source: llama.cpp Discussions
Problem: Model seems to forget earlier conversation
Default context is only 2048-4096 tokens. Increase with num_ctx parameter in your Modelfile or runtime.
Source: Ollama Context Documentation
Problem: Retrieved chunks are irrelevant
- Use hybrid search (BM25 + dense)
- Add reranking step
- Reduce chunk size (512 tokens)
- Increase chunk overlap (10-20%)
Source: Anthropic Contextual Retrieval
Problem: Out of memory during serving
- Start at
gpu_memory_utilization=0.98 - Decrease by 1% until stable
- Reduce
max_model_len - Enable chunked prefill (V1 default)
Source: vLLM Optimization Guide
Problem: Fine-tuned model performs worse on general tasks
- Reduce rank (r) - try 8 or 16
- Increase training diversity
- Use smaller learning rate
- Early stopping with validation
Source: Unsloth LoRA Guide
Problem: AI modifies code you didn't ask to change
Add explicit constraint: "(Do not change anything I did not ask for)"
Source: Cursor Tips