RAG Systems - AI/ML Recipes - AI Center of Excellence

2.1 Hybrid Search with BM25 + Dense Retrieval

Source: Anthropic Contextual Retrieval + VectorHub

from langchain.retrievers import EnsembleRetriever
from langchain_community.retrievers import BM25Retriever
from langchain_community.vectorstores import FAISS
from langchain_community.embeddings import HuggingFaceEmbeddings

# Initialize BM25 for keyword search
bm25_retriever = BM25Retriever.from_documents(documents)
bm25_retriever.k = 10

# Initialize dense retriever
embeddings = HuggingFaceEmbeddings(model_name="BAAI/bge-base-en-v1.5")
faiss_vectorstore = FAISS.from_documents(documents, embeddings)
faiss_retriever = faiss_vectorstore.as_retriever(search_kwargs={"k": 10})

# Combine with Reciprocal Rank Fusion
ensemble_retriever = EnsembleRetriever(
    retrievers=[bm25_retriever, faiss_retriever],
    weights=[0.4, 0.6]  # Adjust based on use case
)

Performance Impact

Combining BM25 + dense retrieval can reduce failed retrievals by 49%. Adding reranking improves this to 67%.

2.2 Optimal Chunking Configuration

Source: Stack Overflow Chunking Guide + Weaviate Chunking Strategies

Standard Recursive Chunking

from langchain.text_splitter import RecursiveCharacterTextSplitter

# Recommended production settings
text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=512,           # tokens (sweet spot for most use cases)
    chunk_overlap=50,         # 10-20% of chunk_size
    length_function=len,
    separators=["\n\n", "\n", " ", ""],  # Hierarchical
    is_separator_regex=False
)

# For code/APIs: chunk by function boundaries
# For policy docs: chunk by section

Semantic Chunking (Advanced)

from langchain_experimental.text_splitter import SemanticChunker
from langchain_community.embeddings import HuggingFaceEmbeddings

embeddings = HuggingFaceEmbeddings(model_name="BAAI/bge-base-en-v1.5")
semantic_chunker = SemanticChunker(
    embeddings,
    breakpoint_threshold_type="percentile"
)

Key Insight

Chunk size should balance context preservation with retrieval precision. 400-800 tokens is generally optimal. Include 10-20% overlap to preserve context at boundaries.

2.3 Reranking with Cross-Encoders

Source: Analytics Vidhya

from sentence_transformers import CrossEncoder

# Initialize reranker
reranker = CrossEncoder('BAAI/bge-reranker-base')

def rerank_documents(query, documents, top_k=5):
    """Rerank retrieved documents for precision."""
    pairs = [[query, doc.page_content] for doc in documents]
    scores = reranker.predict(pairs)

    # Sort by score and return top_k
    ranked_docs = sorted(
        zip(documents, scores),
        key=lambda x: x[1],
        reverse=True
    )
    return [doc for doc, score in ranked_docs[:top_k]]

# Usage: First retrieve top 50, then rerank to top 5-10
initial_docs = retriever.get_relevant_documents(query)[:50]
final_docs = rerank_documents(query, initial_docs, top_k=5)

Note

Reranking requires GPU for reasonable speed. For API-based alternatives, consider Cohere Rerank.

2.4 ColBERT Late Interaction with RAGatouille

Source: RAGatouille GitHub + Jina ColBERT

Basic ColBERT Setup

from ragatouille import RAGPretrainedModel

# Initialize ColBERT model
RAG = RAGPretrainedModel.from_pretrained("colbert-ir/colbertv2.0")

# Index documents
RAG.index(
    collection=[doc.page_content for doc in documents],
    index_name="my_index",
    max_document_length=256,
    split_documents=True
)

# Search with late interaction
results = RAG.search(query="your query here", k=5)

Two-Stage Pipeline

# Stage 1: Fast initial retrieval (thousands of candidates)
initial_results = bm25_retriever.get_relevant_documents(query)[:1000]

# Stage 2: Precise reranking with ColBERT (top-K)
final_results = RAG.rerank(
    query=query,
    documents=[doc.page_content for doc in initial_results],
    k=10
)

Key Insight

ColBERT generalizes better to new domains than dense embeddings and is extremely data-efficient for training.