Philip Jama

Articles /Natural Language Processing /Part 4

Embeddings, Retrieval, and RAG

Vector representations, semantic search, and grounding language models in external knowledge.

NLPRAGEmbeddingsVector SearchPython

Language models encode knowledge in their parameters during pre-training. This parametric memory is powerful but has fundamental limits: it is frozen at training time, difficult to update, and prone to confident fabrication when the answer is not in the weights. Retrieval-augmented generation (RAG) addresses these limits by separating knowledge storage from reasoning: the model retrieves relevant documents at inference time and reasons over them in context.

From Word to Document Embeddings

Word2Vec and GloVe produced word-level embeddings. Modern embedding models (trained with contrastive learning objectives) produce dense vector representations for sentences and documents. These embeddings map semantic similarity to geometric proximity: texts about similar topics cluster together in high-dimensional space, and cosine similarity between vectors serves as a practical measure of semantic relatedness.

The scatter plot below shows synthetic document embeddings projected into two dimensions, colored by topic. The query point (star) connects to its nearest neighbors, illustrating how semantic search retrieves related documents.

Document embeddings colored by topic with query and nearest neighbors
Document embeddings colored by topic with query and nearest neighbors
Show Python source

Exact nearest-neighbor search over millions of vectors is expensive. Approximate nearest-neighbor (ANN) algorithms such as HNSW, IVF, and product quantization trade a small accuracy loss for orders-of-magnitude speedup. Vector databases (Pinecone, Weaviate, Qdrant, pgvector) wrap these algorithms with indexing, filtering, and persistence, making semantic search a commodity infrastructure component.

The RAG Architecture

The RAG pipeline has three stages: encode the query into an embedding, retrieve the top-k most similar chunks from a vector store, and assemble the retrieved text into the model's context window for generation. This separation of concerns (the retriever handles recall, the generator handles synthesis) means knowledge can be updated by re-indexing documents without retraining the model.

Chunking and Retrieval Quality

How documents are split into chunks determines retrieval quality. Small chunks (100--200 tokens) enable precise retrieval but lose surrounding context. Large chunks (500--1000 tokens) preserve context but dilute relevance. Overlapping windows, hierarchical chunking, and parent-document retrieval are strategies for managing this trade-off. The right chunking strategy depends on the document type and the kinds of questions the system needs to answer.

The following chart illustrates the tension between retrieval precision and answer quality as chunk size varies, with a sweet spot in the 150–300 token range.

Retrieval precision and answer quality vs. chunk size
Retrieval precision and answer quality vs. chunk size
Show Python source

For a deeper treatment of how information degrades across processing stages (summarization, chunking, and multi-hop retrieval), see Fidelity in LLM Information Processing. For graph-structured retrieval as an alternative to flat chunking, see GraphRAG.

Enterprise Knowledge Bases

RAG made it practical to build Q&A systems over proprietary documents without fine-tuning a model on internal data. Customer support teams, legal departments, and engineering organizations use RAG to surface answers from policy manuals, case law, and internal documentation. The production challenges are less about the model and more about retrieval: ensuring the right chunks surface for a given question, handling documents that update frequently, and managing the quality feedback loop between retrieval precision and answer accuracy.

RAG improves factual grounding, but the model's behavior itself matters. The next article examines how fine-tuning and alignment shape what language models do with the knowledge they have.

View all articles in Natural Language Processing

Collaborate

If you're exploring related work and need hands-on help, I'm open to consulting and advisory. Get in touch