Troubleshooting - AI/ML Recipes - AI Center of Excellence

Flash Attention Gibberish

Problem: Response becomes gibberish with context >8K tokens

Solution

Enable flash attention (-fa) AND fully offload to GPU (-ngl 80)

Ollama Context Window Confusion

Problem: Model seems to forget earlier conversation

Solution

Default context is only 2048-4096 tokens. Increase with num_ctx parameter in your Modelfile or runtime.

RAG Retrieval Failures

Problem: Retrieved chunks are irrelevant

Solutions

vLLM OOM Errors

Problem: Out of memory during serving

Solutions

LoRA Overfitting

Problem: Fine-tuned model performs worse on general tasks

Solutions

Cursor AI Ignoring Instructions

Problem: AI modifies code you didn't ask to change

Solution

Add explicit constraint: "(Do not change anything I did not ask for)"