1 Self-RAG Implementation

Architecture

Query → Retrieve → Grade Documents → [Relevant?] │ ┌───────────────────┴───────────────────┐ ▼ ▼ [Yes: Generate] [No: Web Search or Retry] │ │ └───────────────────┬───────────────────┘ ▼ Grade Generation │ ┌───────────────────┴───────────────────┐ ▼ ▼ [Supported] [Hallucination] │ │ ▼ ▼ Return Regenerate

Key Components

  1. Retrieval evaluator - grades chunk relevance
  2. Generation grader - checks for hallucinations
  3. Knowledge refinement - filters irrelevant content
  4. Fallback mechanisms - web search, regeneration

Source: LangChain Self-RAG Blog

2 Production LLM Deployment Checklist

Pre-Deployment

  • Benchmark model on target hardware
  • Test with expected concurrent load
  • Verify context window requirements
  • Set appropriate quantization level
  • Configure health check endpoints

Runtime Configuration

  • Set gpu_memory_utilization appropriately
  • Configure batch size based on latency requirements
  • Enable continuous batching for throughput
  • Set up model caching for common prompts
  • Configure rate limiting

Monitoring

  • Track tokens/second
  • Monitor GPU memory usage
  • Log request latencies (p50, p95, p99)
  • Set up alerts for OOM conditions
  • Track queue depth for concurrent requests

Security

  • Input validation and sanitization
  • Output filtering for sensitive data
  • Rate limiting per user/API key
  • Logging for audit trails

3 Embedding Model Selection

Decision Matrix

Use Case Recommended Model
English general purpose BGE-base-en-v1.5 or E5-base-v2
Multilingual BGE-M3 or Jina-embeddings-v3
Code search CodeBERT or StarEncoder
Long documents (8K+) Jina-embeddings-v3 (8192 tokens)
Low latency required MiniLM-L6-v2 (fastest)
Maximum accuracy BGE-large-en-v1.5

Configuration Tips

  • Normalize embeddings for cosine similarity
  • BGE requires query prefix: "Represent this sentence: "
  • E5 doesn't need prefix prompts (simpler integration)
  • Match embedding dimension to vector DB configuration

Source: Modal Embedding Guide