RAG Pipeline

Last updated: 2026-02-18

This document details the Retrieval-Augmented Generation (RAG) pipeline, focusing on how and why it is integrated into the agent’s decision-making process.

What is Retrieval-Augmented Generation (RAG)?

Retrieval-Augmented Generation (RAG) is a technique that enhances the capabilities of large language models (LLMs) by giving them access to an external knowledge base. Instead of relying solely on its training data, a RAG system first retrieves relevant information and then uses this information to generate a more accurate, evidence-based, and contextually relevant response.

RAG in the Monstermessenger Agent

In this project, RAG is used to ensure the agent’s advice is grounded in specific, trusted strategies for dealing with cyberviolence. It allows the agent to provide more than just generic support by referencing concrete information from our curated knowledge base.

The entire RAG workflow is orchestrated between the give_advice and research_strategies nodes in the agent graph.

1. Triggering the RAG Query

The RAG process is initiated within the give_advice node under specific conditions:

  • First Advice: When the agent is about to give its first piece of advice to the user after collecting context, it forces a RAG query to ensure the initial response is well-informed.
  • Explicit Need: The give_advice node can also decide to trigger a RAG query if it determines more information is needed to answer a user’s follow-up question.

When a RAG query is triggered, the give_advice node formulates a research_query based on the user’s situation and passes it to the research_strategies node via the agent’s state.

2. Executing the RAG Query

The research_strategies node acts as the executor of the RAG process, utilizing a multi-query approach for better coverage:

  1. Instead of a single query, it uses the LLM to generate multiple diverse research queries (based on the ResearchQueries model in api/agents/service1/nodes/advice.py) from the user’s situation and context.
  2. It calls the async RAGService to perform parallelized searches against the vector store using these queries.
  3. The service retrieves relevant text chunks from the knowledge base, deduplicating results across the different queries.

3. Relevance Assessment

Before synthesis, a dedicated relevance assessment step is performed:

  1. The retrieved documents are passed back to the LLM.
  2. The LLM evaluates each chunk against the situational context and the research queries.
  3. It assigns a relevance score (“LOW”, “MEDIUM”, or “HIGH”).
  4. Only chunks with “HIGH” relevance are passed to the final synthesis step, ensuring that the advice is grounded only in the most pertinent information.

4. Using the Retrieved Context

This synthesized research_result is not sent directly to the user. Instead, it is passed back to the give_advice node.

  1. The give_advice node runs for a second time, now aware that research results are ready.
  2. It enriches the main system prompt, instructing the LLM to use the provided research results in its final answer.
  3. The research_result is appended to the conversation history.
  4. The LLM generates the final, user-facing advice, which is now grounded in the information retrieved from the knowledge base.

This three-step process (Query -> Assess -> Advise) ensures that the RAG output is seamlessly integrated into the agent’s conversational flow and empathetic tone, rather than just being presented as a raw data dump.

The Retrieval Process (High-Level)

  • async RAGService: The retrieval logic is encapsulated in api/services/rag.py. This service is fully asynchronous and supports parallelized document retrieval through its abatch_search method.
  • Variant-Specific Knowledge: The service is variant-aware. It automatically queries the correct knowledge base (docs_youth or docs_adult) based on the CHATBOT_VARIANT environment variable, ensuring the retrieved information is appropriate for the target audience.

Configuration

The RAG pipeline behavior can be tuned using the following environment variables:

  • NO_RAG_QUERIES: Configures the number of diverse search queries generated by the LLM (default: 1, max: 5). Increasing this value improves coverage but adds latency.
  • MAX_OUTPUT_TOKENS: Limits the length of synthesized research results (configured via settings).

Note on the Knowledge Base

The RAG system is fed by a knowledge base of .pdf and .docx documents. The process for indexing these documents (converting them into a searchable format) is currently under review and will be detailed in a future version of this documentation.