1 Self-RAG Implementation
Architecture
Query → Retrieve → Grade Documents → [Relevant?]
│
┌───────────────────┴───────────────────┐
▼ ▼
[Yes: Generate] [No: Web Search or Retry]
│ │
└───────────────────┬───────────────────┘
▼
Grade Generation
│
┌───────────────────┴───────────────────┐
▼ ▼
[Supported] [Hallucination]
│ │
▼ ▼
Return Regenerate
Key Components
- Retrieval evaluator - grades chunk relevance
- Generation grader - checks for hallucinations
- Knowledge refinement - filters irrelevant content
- Fallback mechanisms - web search, regeneration
Source: LangChain Self-RAG Blog
2 Production LLM Deployment Checklist
Pre-Deployment
- Benchmark model on target hardware
- Test with expected concurrent load
- Verify context window requirements
- Set appropriate quantization level
- Configure health check endpoints
Runtime Configuration
- Set gpu_memory_utilization appropriately
- Configure batch size based on latency requirements
- Enable continuous batching for throughput
- Set up model caching for common prompts
- Configure rate limiting
Monitoring
- Track tokens/second
- Monitor GPU memory usage
- Log request latencies (p50, p95, p99)
- Set up alerts for OOM conditions
- Track queue depth for concurrent requests
Security
- Input validation and sanitization
- Output filtering for sensitive data
- Rate limiting per user/API key
- Logging for audit trails
3 Embedding Model Selection
Decision Matrix
| Use Case | Recommended Model |
|---|---|
| English general purpose | BGE-base-en-v1.5 or E5-base-v2 |
| Multilingual | BGE-M3 or Jina-embeddings-v3 |
| Code search | CodeBERT or StarEncoder |
| Long documents (8K+) | Jina-embeddings-v3 (8192 tokens) |
| Low latency required | MiniLM-L6-v2 (fastest) |
| Maximum accuracy | BGE-large-en-v1.5 |
Configuration Tips
- Normalize embeddings for cosine similarity
- BGE requires query prefix: "Represent this sentence: "
- E5 doesn't need prefix prompts (simpler integration)
- Match embedding dimension to vector DB configuration
Source: Modal Embedding Guide