LLMOps & Observability

Last updated: 2026-01-30

This document covers the operational aspects of managing the Large Language Model (LLM) in this project. LLMOps here refers to the combination of tools and best practices used to monitor, analyze, and improve the performance and safety of our agent.

The LLMOps strategy is built on three main pillars:

  1. Model Configuration: Defining how the LLM is loaded and configured.
  2. Observability: Tracing and debugging every LLM call to understand its behavior.
  3. Analytics: Extracting structured insights from conversations to evaluate performance and user interactions.

Model Configuration

The LLM client is configured in api/agents/service1/core/llm_client.py.

LLM Client

  • Model: The project uses gemini-2.5-flash-lite-preview-09-2025 via the langchain_google_genai library.
  • Lazy Loading: To improve application startup time, the LLM client (and other services like RAG and Analytics) are lazy-loaded. They are only initialized on their first use, not when the application server starts.
  • Temperature: The model is configured with a temperature of 0.7 to balance creativity and predictability in its responses.
  • Max Output Tokens: The maximum number of tokens the model can generate is controlled by the MAX_OUTPUT_TOKENS environment variable, defaulting to 2048.

Safety Settings

A critical configuration choice is the deliberate disabling of all default safety filters:

safety_settings= {
    HarmCategory.HARM_CATEGORY_HATE_SPEECH: HarmBlockThreshold.BLOCK_NONE,
    HarmCategory.HARM_CATEGORY_HARASSMENT: HarmBlockThreshold.BLOCK_NONE,
    # ... and others
}

This is done to allow the chatbot to discuss sensitive topics related to cyberviolence without being blocked by overly cautious default filters. Content safety is instead managed through carefully crafted prompts and the observability system described below.

Observability with Langfuse

To monitor the LLM’s behavior and debug issues, the project is integrated with Langfuse, an open-source LLM engineering platform.

Integration

The integration is configured in api/agents/service1/utils/observability.py. It initializes a Langfuse client and attaches a custom callback handler to the LangChain runtime.

The default langgraph callback handler provided by langfuse is instantiated and passed to the langgraph app during intialization of the ChatService in api/services/chat_service.py. LLM logging can be configured by the USE_LANGFUSE=true/false environment variable.

The ErrorFlagger Callback

ErrorFlagger is a custom callback that runs after every LLM call. Its primary purpose is to inspect the response metadata for a block_reason. If a response was blocked by Google’s safety filters (despite being disabled, metadata is still provided), the callback updates the corresponding trace in Langfuse. This allows developers to:

  • Get immediate visibility into when and why the LLM refuses to respond.
  • Analyze patterns of blocked content.
  • Improve prompts to avoid triggering safety filters inappropriately while maintaining a safe user experience.

Conversation Analytics

To evaluate the effectiveness of the chatbot and understand user interactions, the project has a sophisticated analytics pipeline defined in api/services/analytics_service.py.

The AnalyticsService

This service is responsible for processing completed conversations and extracting structured, queryable data. It does not just store raw text; it uses the LLM itself to perform a detailed analysis.

The process involves two main steps:

  1. Data Extraction: The service first prompts the LLM with the full conversation text and asks it to extract high-level information into a structured ConversationAnalyticsData object. This includes:
    • The original bullying message.
    • A brief summary of the conversation.
    • The user’s emotional state (mood).
    • The main strategy suggested by the bot.
  2. Data Categorization: After extracting the data, a second set of LLM calls and normalization functions are used to classify the free-text fields into predefined categories. For example, a user’s message might be categorized as “Direct insults” or “Threats”, and their mood might be categorized as “sad” or “anxious”.

This structured and categorized data is then stored in multiple tables in the Postgres database, allowing for powerful, aggregated analysis of the chatbot’s performance and the types of issues users are facing. The service also includes robust retry logic and fallbacks to ensure data is captured even if the LLM analysis fails.