RAG (Retrieval Augmented Generation) — a simple and clear explanation

People keep asking me what RAG is (in the context of large language models), and I keep wanting to share a link to an article on habr that would explain the concept of RAG (Retrieval Augmented Generation) in simple terms yet in detail, including instructions on how to implement this “beast.” But there is still no such article, so I decided not to wait and write one myself.

If something in the article is unclear or there are aspects of RAG that need to be elaborated on, don’t hesitate—leave comments; I’ll break it down and add more if necessary. Okay, let’s go…

RAG (Retrieval Augmented Generation) is a way of working with large language models where the user writes their question, and you programmatically mix in additional information from some external sources to that question and feed it, as a whole, into the language model. In other words, you add additional information to the model’s context, based on which the language model can give the user a more complete and accurate answer.

The simplest example:

The user asks the model: “What is the current dollar exchange rate?”
Obviously, the language model has no idea what the dollar rate is RIGHT NOW; it has to get that information from somewhere to answer the question.
What should you do? Exactly: open the first link in Google for the query “dollar to ruble exchange rate” and add the page content to the user’s question so the LLM can answer correctly using that information.

A slightly more complex example:

Suppose you’re building an automated technical support agent based on an LLM. You have a knowledge base of questions and answers, or just a detailed description of functionality.
The worst idea you could have is to fine‑tune the model on your Q&A knowledge base. Why is that bad? Because the knowledge base will constantly change, and continuously fine‑tuning the model is expensive and inefficient.
RAG to the rescue. Depending on the user’s question, your RAG system should find the corresponding article in the knowledge base and feed the LLM not only the user’s question but also the part of the knowledge base relevant to the query, so the LLM can form the correct answer.

Thus, the phrase Retrieval Augmented Generation describes what’s happening as precisely as possible:

Retrieval – search and retrieval of relevant information. The part of the system responsible for search and retrieval is called — a retriever (retriever).

Retrieval Augmented — augmenting the user’s query with the retrieved relevant information.

Retrieval Augmented Generation — generating the user’s answer with the additionally retrieved relevant information taken into account.

And now the main question I’m asked when I explain RAG at this primitive level: so what’s hard about it? You find a chunk of data and add it to the context — PROFIT! But the devil, as always, is in the details, and the very first, most obvious “details” encountered by someone who intends to “build RAG in a day” are the following:

Fuzzy search — you can’t just take the user’s query and find all chunks in the knowledge base by exact match. What algorithm should you use to search so that it finds relevant and only relevant “chunks” of text?
Size of knowledge base articles — what size “chunks” of text should you feed to the LLM for it to compose an answer from them?
What if several articles are found in the knowledge base? And what if they’re large? How do you “trim” them, how do you combine them, maybe compress them?

These are the most basic questions anyone starting out with RAG runs into. Fortunately, at this point there is a commonly accepted approach you should start with to build RAG and then, like in the joke, file the resulting “locomotive” with a rasp until you get the “fighter jet” you need.

So what does the initial, basic “locomotive” called RAG consist of? Its core workflow is as follows:

The entire knowledge base is “cut” into small pieces of text, the so‑called chunks. The size of these chunks can vary from a few lines to a few paragraphs, i.e., roughly from 100 to 1000 words.
Next, these chunks are encoded using an embedder and turned into embeddings, in other words into vectors, some sets of numbers. It’s assumed that these numbers encode the latent meaning of the whole chunk, and you can search by that meaning.
Then all these resulting vectors are stored in a special database, where they sit and wait until that very search operation is performed on them (for the most relevant chunks, i.e., those closest in meaning to the search query).
When the user sends their question to the LLM, the text of the query is encoded by exactly the same algorithm (usually by the same embedder) into another embedding, and then a search is performed over the database containing our chunk embeddings for the embeddings (vectors) that are closest “in meaning.” In practice, as a rule, the cosine similarity between the query vector and each chunk’s vector is computed, and then the top N vectors closest to the query are selected.
Next, the text of the chunks corresponding to those found vectors, together with the user’s query, is combined into a single context and fed into the language model. That is, the model “thinks” the user not only wrote a question but also provided data on which the answer should be based.

As you can see from the description, nothing complicated: split the text into pieces (chunks), encode each piece, find the pieces closest in meaning, and feed their text to the large language model together with the user’s query. Moreover, this whole process is typically already implemented in a so‑called pipeline, and all you need is to run the pipeline (from some ready‑made library).

After writing the first version and feeding a few documents, you’ll find that in 100% of cases the resulting algorithm doesn’t work as expected, and then a long and difficult process begins of filing your RAG with a rasp until it works the way you need.

Below I’ll outline the main ideas and principles for filing RAG with a rasp, and in the second article I’ll cover each principle separately (this may become a series of articles so it doesn’t turn into one huge endless wall of text).

Chunk size and their number. I strongly recommend starting your experiments by varying chunk size and count. If you feed the LLM too much irrelevant information, or conversely too little, you leave the model no chance to answer correctly:

The smaller the chunk, the more precise literal (lexical) search will be; the larger the chunk, the more the search approximates semantic search.
Different user queries may require different numbers of chunks to be added to the context. You need to empirically find the threshold below which a chunk is pointless and will only clutter your context.
Chunks should overlap so that you have a chance to feed a sequence of consecutive chunks together, rather than just pieces torn out of context.
A chunk’s beginning and end should be meaningful; ideally they should coincide with the start and end of a sentence, or better yet a paragraph, so that the entire thought is contained wholly within the chunk.

Adding other search methods. Very often, “semantic” search via embeddings doesn’t produce the desired result, especially when it comes to specific terms or definitions. As a rule, TF‑IDF search is added alongside embedding search, and the results are combined in a proportion determined experimentally. Reranking the found results, for example with the BM25 algorithm, also often helps.

Query multiplication. As a rule, it’s useful to paraphrase the user’s query several times (with an LLM) and search for chunks for all query variants. In practice, you’ll make 3 to 5 variations and then merge the search results into one.

Chunk summarization. If a lot of information is found for the user’s query and it doesn’t all fit into the context, you can likewise “simplify” it with an LLM and feed, in addition to the user’s question, something like archived knowledge as context, so the LLM can use the essence (a distillation of the knowledge base) when forming the answer.

System prompt and fine‑tuning the model for the RAG format. To help the model better understand what’s required, you can also fine‑tune the LLM on the correct interaction format. In the RAG approach, the context always consists of two parts (we’re not yet considering a dialog in RAG format): the user’s question and the retrieved context. Accordingly, you can fine‑tune the model to understand precisely this format: here is the question, here is the information for the answer, produce an answer to the question. At the initial stage, you can also try to solve this via a system prompt, explaining to the model that “the question is here, the info is here, don’t mix them up!”

In conclusion, I want to highlight the most important thing you need to implement when working with RAG. From the very first minute you pick up the file and start filing RAG, the question will arise of how to evaluate the quality of what you’re getting. Filing blind is not an option, and checking quality by hand each time is not an option either, because you should be checking not one or two but dozens, preferably hundreds of questions/answers, even after a small single change. Not to mention that when you switch the embedder library or the generative LLM, you need to run at least several dozen tests and assess the quality of the result. As a rule, the following approaches are used to evaluate the quality of a RAG model:

Evaluation questions — should be written by humans. There’s no getting around this: only you (or your client) know what questions will be asked of the system, and no one but you will write the test questions.

Reference (gold) answers — ideally should also be written by humans, and preferably by different people and in two (or even three) versions, to reduce dependence on the human factor. But there are options here. For example, OpenAI Assistants on the GPT Turbo model handle this task quite well, especially if you don’t load them with large documents (over 100 pages). To obtain reference answers, you can write a script that calls the assistant (with the relevant documents loaded) and asks it all your test questions, even asking the same question several times.

Closeness of LLM answers to the reference ones — measured by several metrics such as BERTScore, BLEURT, METEOR, and even simple ROUGE. Typically, a weighted composite metric is selected empirically as a sum of the above (with appropriate coefficients) that best reflects the actual closeness of your LLM’s answers to the gold answers.

Reference chunks and closeness of the retrieved chunks to the reference ones. As a rule, a novice RAG builder, in their eagerness to quickly get good answers from the generator, completely forgets that the retrieved chunks fed into that generator account for 80% of the answer quality. Therefore, when developing RAG, primary attention must be paid to the data the retriever found. First, you need to create a database of questions and their corresponding chunks, and every time you file your RAG with a rasp, check how accurately the retriever found the chunks corresponding to the user’s query.

I look forward to your questions and comments!

Автор: habrconnect

Источник ^[1]

Сайт-источник BrainTools: https://www.braintools.ru

Путь до страницы источника: https://www.braintools.ru/article/25532

URLs in this post:

[1] Источник: https://habr.com/en/articles/990532/?utm_source=habrahabr&utm_medium=rss&utm_campaign=990532

Нажмите здесь для печати.