BY_PAULO NUNES

Chatting with Your Data: The Potential of Retrieval-Augmented Generation (RAG)

retrieval-augmented-generation.jpeg

Large Language Models

Information Retrieval

Retrieval Augmented Generation

The rapid advancements in Generative AI, and specifically of Large Language Models, have given birth to numerous technologies, some of them promising to revolutionize the way we interact with data. 

 

In this article, we are talking about the potential for a direct dialogue with data, in particular, through Retrieval-Augmented Generation, commonly known as RAG.

 

 

What is Retrieval-Augmented Generation?

 

At this point you’re probably wondering, what’s Retrieval-Augmented Generation (RAG)?  In layman's terms, Retrieval-Augmented Generation is a method that enables applications to allow you to "chat" with your data, almost like conversing with a well-informed librarian who's always up-to-date.

 

Retrieval-Augmented Generation gets its name from “augmenting” the Generative Model (as in Large Language Model or LLM) with retrieved information from specific databases, knowledge bases or other sources.

 

Let's break this down a bit more.

 

 

Information Retrieval

 

RAG uses a technique called “Information Retrieval” or, getting more specific about it, “Semantic Search”.  What semantic search does is use word embeddings to represent words. And if you’re scratching your head about what word embeddings are, just know that they’re a vectorial representation of words. It allows for words which are semantically close to have similar vectorial representations in an Euclidean space.   

 

This means that semantically close words will be “close” to each other in that space, or have similar vectors. So, when a user enters a question, it can be converted to a vector. Then, a vector similarity search algorithm can be ran on a database or search engine,and provide semantically similar results.

semantic-search.jpg

The Mechanics of RAG

 

How does RAG actually work?

 

In its simplest form,  the RAG architecture uses two kinds of models: an Embeddings Model and a Language Generation Model (aka LLM). It also includes a Vector Database to store the vectors, and a mechanism to bring the data from the Data Sources.

rag-process.jpg

Here's a step-by-step of how it operates:

 

  • Data Aggregation: an ETL or a similar mechanism can consolidate various data forms—be it structured databases, blogs, news feeds, or even chat transcripts—into a standardized format, making use of the Embeddings model to store the vector representations in the Vector Database.
  • Retrieval: These vector databases are designed for rapid searches, ensuring that the system can swiftly locate and utilize the right information.
  • Generation: When you pose a question to the system, it searches the vector database to find the most relevant and updated information. This information, combined with the foundational knowledge of the LLM, produces a detailed and accurate response. In other words, you build a prompt that includes the original question, the history of the conversation (potentially), the search results, and then ask the LLM to predict what comes next, using only the results of the search. 

 

A huge advantage of this process is that it will limit LLMs hallucinations and allow answers to be actually referenced, which means that RAG architecture is particularly useful to tackle some LLMs’ well-known issues and limitations. Let’s take a closer look at what these limitations are.

 

 

The Limitations of Large Language Models (LLMs)

 

Most of us are already familiar with the capabilities of Large Language Models. They’re able to generate text based on text (input) and a context (short-term memory), using a gigantic neural network, often composed of  thousands of millions of parameters (or billions, if you’re based in the USA). The most prominent example nowadays is GPT-4 from OpenAI (closed source, available through API and the very well-known application ChatGPT).

 

Still, despite LLMs’ huge potential, there are a few limitations that should be kept in mind, and which Retrieval-Augmented Generation can help minimize.

 

 

Data might not be not up-to-date

 

LLMs can be robust, can generate human-like text, and are remarkably efficient at doing it. However, they are trained on a finite set of data, which is sometimes based on outdated information. For example, at the time of the writing of this article, GPT-4 is providing results from data it was trained only up to September 2021. That’s more than 2 years ago. Inevitably, it may have an impact on some of the outcomes it provides.

 

Retrieval-Augmented Generation tackles this by making use of the Vector Database that contains all the up-to-date information. 

 

 

LLMs are not able to reference their source

 

If you ask a model a question about something on which the model was trained on (for example, on Wikipedia), it will be able to answer, but it will not be able to tell you where this information came from. That’s because these kinds of models are not able to provide the source of their information.

 

With RAG, knowledge articles are stored in the Vector Database together with references to the original source. 

 

 

Hallucinations

 

Models also tend to “hallucinate”. This means that they may make up things as they go. There are parameters (e.g. temperature) that can be adjusted to minimize this, but there is no guarantee that an answer will be correct, or that the model hasn’t just “invented” a new fact. Regarding these hallucinations, models can sometimes produce very plausible fiction, and other times simply obvious nonsense. 

 

Through RAG hallucinations are limited, since the LLM is prompted to explicitly use only the results of the search. This limits hallucinations.

 

 

No access to private data 

 

On top of this, LLMs don’t know anything about an organizations’ private data. That’s simply because they’ve never actually “seen it”. Meaning, they were not trained using that data.

 

This can be mitigated with a technique called “fine-tuning”, which means training a base model on more specific data, in order to “specialize” it. CODEX, for example, is a model fine-tuned for understanding programming languages. 

 

However, this is often not practical, because it’s either too resource consuming, requires very specialized skills, or there is simply not enough data (quality and quantity) to actually train a useful model.

 

As discussed above, RAG combines retrieval of private data with language generation, which allows for access to private data, performing in many cases better than the fine-tuning approach.

large-language-models-are-static-1024x439.jpeg

Practical Applications of RAG

 

Now that you understand what RAG is, how it works and its relevance, imagine the potential of such a system in areas like:

  • Knowledge Worker Augmentation: Workers can access real-time data, industry insights, and organizational knowledge at the tip of their fingers, leading to improved decision-making and productivity.
  • Customer Service: Chatbots powered by RAG can offer timely, precise, and contextually accurate responses, greatly enhancing user satisfaction
  • Software Development: assistants or agents 

 

 

In Conclusion

 

What's particularly promising about the RAG architecture is its ability to tackle the limitations of LLMs. By constantly updating its knowledge repository, the Retrieval-Augmented Generation (RAG) approach ensures that the data it draws upon is up to date. Moreover, by being able to reference the exact source of its information, it promotes transparency and trustworthiness, both of which are crucial in today's data-driven age.