Skip to contentSkip to navigationSkip to topbar
On this page

What is Retrieval-Augmented Generation (RAG)?


Retrieval-augmented generation (RAG) is a technique in which relevant content (typically text) gets retrieved and pre-processed to augment the prompt that gets passed to a Large Language Model to provide up-to-date knowledge to the model, improve accuracy and reduce the risk of hallucinations.


How does retrieval-augmented generation work?

how-does-retrieval-augmented-generation-work page anchor

Different projects might choose different architectures and setups for their "RAG pipelines" but at a high level they all follow similar steps, which are outlined below.

Storing information in a retrievable fashion

storing-information-in-a-retrievable-fashion page anchor

In order to be able to retrieve relevant information for RAG, the stored information needs to be searchable. The most common approach for this is to use a vector database(link takes you to an external page). For this, the information that should gets stored first gets turned into "chunks" if the information is too large.

From there the content gets turned into "embeddings" using an embedding model. At a high level this is a model that will turn text (or other data) into a series of numbers (called a vector) that represents the text. This vector can then be used later to compare the similarity of different pieces of content.

This vector along with the raw information gets then stored in a database that is optimized for similarity searching among different vectors. All of this process typically happens separately on a regular basis and not part of the actual RAG process.

Retrieval of relevant documentation

retrieval-of-relevant-documentation page anchor

Once the data is stored, if you want to use an LLM to generate an answer to a question you first retrieve relevant data from your vector database. This is done by turning the question into an vector using the same embedding model that was used to process the original information. Then you use the vector database to perform a "similarity search" to find similar chunks of information.

Augment the prompt with the data

augment-the-prompt-with-the-data page anchor

After the data is retrieved it then gets processed to augment the prompt you want to send to the LLM to generate the final information. This might include additional steps like further filtering the retrieved information or summarizing it using another LLM.

Generation of the answer

generation-of-the-answer page anchor

Lastly, the now-augmented prompt including the original question then gets sent to the LLM to generate the final response to the question. By including the relevant information through RAG it increases the likelihood of an accurate answer and also enables you to have access to timely data that the model might not have been trained on.


Example of how Twilio uses RAG internally

example-of-how-twilio-uses-rag-internally page anchor

Twilio uses RAG for a multitude of applications. One is within the RFP Genie tool(link takes you to an external page) that Twilio built to help our sales teams with filling out RFPs. For this all relevant Twilio information such as documentation on various topics was placed into a vector database and the team built a RAG system on top that can be used to generate responses to questions to make sure Twilio's sales team always gets accurate and up-to-date information.

Need some help?

Terms of service

Copyright © 2024 Twilio Inc.