Flash Attention Gibberish

Problem: Response becomes gibberish with context >8K tokens

Solution

Enable flash attention (-fa) AND fully offload to GPU (-ngl 80)

Source: llama.cpp Discussions

Ollama Context Window Confusion

Problem: Model seems to forget earlier conversation

Solution

Default context is only 2048-4096 tokens. Increase with num_ctx parameter in your Modelfile or runtime.

Source: Ollama Context Documentation

RAG Retrieval Failures

Problem: Retrieved chunks are irrelevant

Solutions
  1. Use hybrid search (BM25 + dense)
  2. Add reranking step
  3. Reduce chunk size (512 tokens)
  4. Increase chunk overlap (10-20%)

Source: Anthropic Contextual Retrieval

vLLM OOM Errors

Problem: Out of memory during serving

Solutions
  1. Start at gpu_memory_utilization=0.98
  2. Decrease by 1% until stable
  3. Reduce max_model_len
  4. Enable chunked prefill (V1 default)

Source: vLLM Optimization Guide

LoRA Overfitting

Problem: Fine-tuned model performs worse on general tasks

Solutions
  1. Reduce rank (r) - try 8 or 16
  2. Increase training diversity
  3. Use smaller learning rate
  4. Early stopping with validation

Source: Unsloth LoRA Guide

Cursor AI Ignoring Instructions

Problem: AI modifies code you didn't ask to change

Solution

Add explicit constraint: "(Do not change anything I did not ask for)"

Source: Cursor Tips