2.1 Hybrid Search with BM25 + Dense Retrieval
Source: Anthropic Contextual Retrieval + VectorHub
from langchain.retrievers import EnsembleRetriever
from langchain_community.retrievers import BM25Retriever
from langchain_community.vectorstores import FAISS
from langchain_community.embeddings import HuggingFaceEmbeddings
# Initialize BM25 for keyword search
bm25_retriever = BM25Retriever.from_documents(documents)
bm25_retriever.k = 10
# Initialize dense retriever
embeddings = HuggingFaceEmbeddings(model_name="BAAI/bge-base-en-v1.5")
faiss_vectorstore = FAISS.from_documents(documents, embeddings)
faiss_retriever = faiss_vectorstore.as_retriever(search_kwargs={"k": 10})
# Combine with Reciprocal Rank Fusion
ensemble_retriever = EnsembleRetriever(
retrievers=[bm25_retriever, faiss_retriever],
weights=[0.4, 0.6] # Adjust based on use case
)
Combining BM25 + dense retrieval can reduce failed retrievals by 49%. Adding reranking improves this to 67%.
2.2 Optimal Chunking Configuration
Source: Stack Overflow Chunking Guide + Weaviate Chunking Strategies
Standard Recursive Chunking
from langchain.text_splitter import RecursiveCharacterTextSplitter
# Recommended production settings
text_splitter = RecursiveCharacterTextSplitter(
chunk_size=512, # tokens (sweet spot for most use cases)
chunk_overlap=50, # 10-20% of chunk_size
length_function=len,
separators=["\n\n", "\n", " ", ""], # Hierarchical
is_separator_regex=False
)
# For code/APIs: chunk by function boundaries
# For policy docs: chunk by section
Semantic Chunking (Advanced)
from langchain_experimental.text_splitter import SemanticChunker
from langchain_community.embeddings import HuggingFaceEmbeddings
embeddings = HuggingFaceEmbeddings(model_name="BAAI/bge-base-en-v1.5")
semantic_chunker = SemanticChunker(
embeddings,
breakpoint_threshold_type="percentile"
)
Chunk size should balance context preservation with retrieval precision. 400-800 tokens is generally optimal. Include 10-20% overlap to preserve context at boundaries.
2.3 Reranking with Cross-Encoders
Source: Analytics Vidhya
from sentence_transformers import CrossEncoder
# Initialize reranker
reranker = CrossEncoder('BAAI/bge-reranker-base')
def rerank_documents(query, documents, top_k=5):
"""Rerank retrieved documents for precision."""
pairs = [[query, doc.page_content] for doc in documents]
scores = reranker.predict(pairs)
# Sort by score and return top_k
ranked_docs = sorted(
zip(documents, scores),
key=lambda x: x[1],
reverse=True
)
return [doc for doc, score in ranked_docs[:top_k]]
# Usage: First retrieve top 50, then rerank to top 5-10
initial_docs = retriever.get_relevant_documents(query)[:50]
final_docs = rerank_documents(query, initial_docs, top_k=5)
Reranking requires GPU for reasonable speed. For API-based alternatives, consider Cohere Rerank.
2.4 ColBERT Late Interaction with RAGatouille
Source: RAGatouille GitHub + Jina ColBERT
Basic ColBERT Setup
from ragatouille import RAGPretrainedModel
# Initialize ColBERT model
RAG = RAGPretrainedModel.from_pretrained("colbert-ir/colbertv2.0")
# Index documents
RAG.index(
collection=[doc.page_content for doc in documents],
index_name="my_index",
max_document_length=256,
split_documents=True
)
# Search with late interaction
results = RAG.search(query="your query here", k=5)
Two-Stage Pipeline
# Stage 1: Fast initial retrieval (thousands of candidates)
initial_results = bm25_retriever.get_relevant_documents(query)[:1000]
# Stage 2: Precise reranking with ColBERT (top-K)
final_results = RAG.rerank(
query=query,
documents=[doc.page_content for doc in initial_results],
k=10
)
ColBERT generalizes better to new domains than dense embeddings and is extremely data-efficient for training.