Post

Building an Enterprise Knowledge Assistant with RAG

Building an Enterprise Knowledge Assistant with RAG

Why RAG?

Large Language Models (LLMs) are incredibly useful for answering questions about topics they were trained on. They make vast amounts of information easily accessible and excel at processing and formulating natural language. However, in many real-world scenarios, the knowledge a user needs is locked away in private, non-public documents and databases.

So, how can we make an LLM aware of this private data?

One option is to retrain or fine-tune the model. This is computationally expensive, offers little room for adapting to new features, and must be repeated every time new information becomes available. A better approach is to leverage the core strength of LLMs: their ability to understand and articulate information. If we present the right knowledge to an LLM at the right time, it does an amazing job of presenting that information in a user-friendly, well-structured, and easily understandable way.

The question then becomes: how do we find and present the right information to the LLM so it can answer a specific question? This is where Retrieval-Augmented Generation (RAG) comes in.


The RAG Architecture

RAG allows an LLM to retrieve precisely the information it needs from external knowledge sources before generating an answer. A RAG pipeline can consist of various components, tailored to the types of questions it needs to answer and the data available. But for this system to work, the LLM needs a standardized way to communicate with these diverse tools.

Tools and Protocols: The MCP Standard

The way an LLM interacts with its tools is via a standardized protocol. MCP (Model-Context Protocol) has become a widely used standard that enables this communication. Think of it as a common language that allows any compliant LLM to use any compliant tool. Other protocols also exist, like Google’s Agent-to-Agent (A2A) protocol, which aims to enable different AI agents to interact with each other.

Following this standard, thousands of MCP servers have been created, allowing LLMs to interact with everything from Google Maps to 3D modeling software like Blender with just a single prompt. Each server creates an interface between the LLM and a custom tool for a specific use case.

This approach offers several key advantages:

  • Modularity: In a field where LLM advancements happen at lightning speed, a modular architecture is crucial. MCP servers ensure you are not locked into one specific model; you can easily swap different LLMs.
  • Interoperability: You can expose your MCP tool server to various front-ends of your choice, like Claude Desktop or LM Studio.

In my project, I built a custom front-end to simplify the user experience, eliminating setup requirements and handling authentication automatically.

Vector Databases: Searching by Meaning

The first step in many RAG pipelines is embedding data for semantic search. To populate our vector database, we used the e5-large-v2 model, which transforms chunks of text into 1024-dimensional vectors. In this vector space, semantically similar sentences are positioned close to each other.

This allows us to retrieve information based on meaning, not just keywords. For example, if we embed documentation about factory components, components with similar functionalities or attributes will be close to each other in the vector space. When a user asks a question, we follow a simple process:

  1. Create a vector embedding of the user’s query.
  2. Retrieve the top k (e.g., 4) text chunks from the database that are closest to the query’s vector.
  3. Provide these chunks to the LLM as context to formulate its answer.

To improve retrieval quality, we can even use an intermediary LLM to reformulate the user’s initial question into a query optimized for finding the most relevant semantic information.

Graph Databases: Searching by Relationship 🔗

For our specific use case at CERN, where data involves many different entities connected in complex ways, a knowledge graph is the ideal data structure. We use a graph database to map out these relationships.

The central elements are nodes (entities) and relationships (directed edges). For instance, a Component is a node with attributes like its name, a link to its test report, or the date it was tested. Other nodes might be Facilities. These nodes are connected with relationships like TESTED_AT.

While this could be modeled in a relational database like PostgreSQL, graph databases offer significant advantages when querying complex, multi-hop relationships. It also makes it much easier for the LLM to trace and explain its reasoning path to the user.


What the Work Included 🛠️

This project involved building a full-stack solution, from data processing to a user-facing application:

  • Data Processing:

    • Scraped and parsed 245 PDF documents.
    • Extracted text, tables, and images.
    • Set up a complete data processing pipeline.
  • Infrastructure & Deployment (CI/CD):

    • Built deployment infrastructure using Platform-as-a-Service (OKD), OpenStack VMs, and CERN’s authorization proxy.
    • Configured firewalls with ufw.
    • Deployed a Dockerized Streamlit application as the user frontend.
    • Deployed and managed a Neo4j graph database.
    • Set up an Ollama instance to serve local LLMs.
  • Services & Experimentation:

    • Implemented a custom feedback service to collect user interactions (question, response, expected answer) to enable a human-in-the-loop development cycle, with a strong focus on user privacy.

    • Deployed a webserver to serve static documents to the MCP server.

    • Experimented with different LLMs, reasoning techniques (e.g., Chain of Thought), and architectural patterns.


Final Bot Capabilities 🚀

The result is a multi-step RAG bot that has evolved through multiple iterations. It now draws from several sources, including a vector database, a knowledge graph, document lookups, and custom search functions.

The bot can:

  • Answer questions about components and documents.
  • Retrieve specific identifiers like links to the database.
  • Search for components based on performance criteria.
  • Leverage expert knowledge embedded in its sources to judge the quality or suitability of a solution.
This post is licensed under CC BY 4.0 by the author.