RAG Easily Explained

4 min readNov 11, 2023

Retrieval-Augmented Generation (RAG) is a technique used in natural language processing that combines the capabilities of pre-trained language models like GPT with a retrieval-based system. Using RAG we can enhance a model’s performance by retrieving relevant information from a large external database of text and integrating this information into the model’s response generation process.

Imagine GPT as a smart student and RAG as its helpful friend who has access to a big library of books (which in this case, is a database of information snippets). Now, GPT is smart, but it doesn’t know everything off the top of its head — stuff like very niche topics or internal documentation has to be added in or ‘augmented’. This is where RAG comes in. RAG quickly searches through a database we set up (like flipping through a lot of books) to find relevant information about the specific themes or topics in GPT’s prompt.

How RAG works

1. Document Embeddings:

Documents are split into snippets and transformed into embeddings. These are high-dimensional vectors representing the semantic content of the text. Each dimension in the vector captures certain aspects of the text’s meaning.

2. Vector Database:

These embeddings are stored in a vector database, which allows for efficient similarity searches. The database maintains a link between each embedding and the original raw-text document it represents.

3. Prompt Processing:

When a user inputs a prompt, it is also converted into an embedding using the same or a similar process as used for the document snippets. This creates a vector that represents the semantic meaning of the user’s query.

4. Embedding Comparison:

The user’s prompt embedding is then compared with the embeddings in the vector database. This comparison is typically done using similarity metrics that can determine which document embeddings are most similar to the query embedding.

5. Selection of Similar Embeddings:

The embeddings that are most similar to the prompt’s embedding are identified. The corresponding raw-text documents linked to these embeddings are retrieved.

6. Integration with GPT:

The selected raw-text documents are then used to augment the original prompt. However, this integration isn’t as simple as just adding raw text to the prompt. The system might process and condense the information or use specific parts of the text that are most relevant to the query.

This augmented input (original prompt plus context from retrieved documents) is then fed into GPT, which generates a response based on both its pre-trained knowledge and the additional context provided.

It’s important to note the complexity and sophistication involved in each step, especially in terms of how text is transformed into embeddings and how these embeddings are used to augment GPT’s capabilities.

In the context of Retrieval-Augmented Generation (RAG) for models like GPT, the selection of similarity metrics and the number of “most similar” embeddings chosen are crucial for the system’s effectiveness. Here’s a breakdown:

Similarity Metrics

The similarity metrics used in RAG are designed to measure how close or relevant the embeddings of the document snippets are to the embedding of the user’s prompt. Commonly used metrics include:

Cosine Similarity

This is perhaps the most popular metric in text embedding similarity tasks. Cosine similarity measures the cosine of the angle between two vectors in a multi-dimensional space. It essentially assesses how similar the directions of two vectors are. The value ranges from -1 to 1, where 1 means exactly the same direction (highly similar), 0 indicates orthogonality (no similarity), and -1 indicates completely opposite directions.

Euclidean Distance

Less common but still used, Euclidean distance measures the “straight-line” distance between two points (or vectors) in multi-dimensional space. In contrast to cosine similarity, a smaller Euclidean distance indicates higher similarity.

Dot Product

The dot product of two vectors can also be a measure of similarity, especially after normalizing the vectors. It’s similar to cosine similarity but without normalizing for the magnitude of the vectors.

Number of “Most Similar” Embeddings

The number of embeddings selected as “most similar” depends on the specific implementation and requirements of the RAG system. Factors influencing this number include:

Performance Considerations: Selecting more embeddings can provide more information but at the cost of increased computational resources and processing time.
Quality of Results: More embeddings might improve the quality of the generated response up to a point, but beyond that, it may introduce noise or irrelevant information.
Typical Range: In practice, systems may retrieve a handful to several dozen of the most similar embeddings. The exact number can be tuned based on experimental results to optimize the balance between response quality and computational efficiency.

In summary, the similarity metrics and the number of embeddings selected are key factors that influence how well a RAG system can augment a model like GPT.

RAG is useful because it allows models to produce responses that are not only based on their pre-trained knowledge but also informed by additional, potentially more recent or specific information, leading to more accurate, informative, and contextually relevant answers.