Retrieval-augmented generation (RAG) is a technique in which relevant content (typically text) gets retrieved and pre-processed to augment the prompt that gets passed to a Large Language Model to provide up-to-date knowledge to the model, improve accuracy and reduce the risk of hallucinations.
Different projects might choose different architectures and setups for their "RAG pipelines" but at a high level they all follow similar steps, which are outlined below.
In order to be able to retrieve relevant information for RAG, the stored information needs to be searchable. The most common approach for this is to use a vector database. For this, the information that should gets stored first gets turned into "chunks" if the information is too large.
From there the content gets turned into "embeddings" using an embedding model. At a high level this is a model that will turn text (or other data) into a series of numbers (called a vector) that represents the text. This vector can then be used later to compare the similarity of different pieces of content.
This vector along with the raw information gets then stored in a database that is optimized for similarity searching among different vectors. All of this process typically happens separately on a regular basis and not part of the actual RAG process.
Once the data is stored, if you want to use an LLM to generate an answer to a question you first retrieve relevant data from your vector database. This is done by turning the question into an vector using the same embedding model that was used to process the original information. Then you use the vector database to perform a "similarity search" to find similar chunks of information.
After the data is retrieved it then gets processed to augment the prompt you want to send to the LLM to generate the final information. This might include additional steps like further filtering the retrieved information or summarizing it using another LLM.
Lastly, the now-augmented prompt including the original question then gets sent to the LLM to generate the final response to the question. By including the relevant information through RAG it increases the likelihood of an accurate answer and also enables you to have access to timely data that the model might not have been trained on.
Twilio uses RAG for a multitude of applications. One is within the RFP Genie tool that Twilio built to help our sales teams with filling out RFPs. For this all relevant Twilio information such as documentation on various topics was placed into a vector database and the team built a RAG system on top that can be used to generate responses to questions to make sure Twilio's sales team always gets accurate and up-to-date information.