RAG Distributed System

💡 For AI engineers and CTOs evaluating whether to run LLMs locally — and why distributed architecture is the answer to VRAM bottlenecks.

🎯 Why Running LLMs Locally Is a VRAM Nightmare (And How to Fix It)

When you try to run a 7B parameter model on your workstation, VRAM fills up instantly. Document indexing slows to a crawl. Vector searches become unusable. The solution isn't better hardware — it's distributed architecture that separates inference from indexing.

Technical subtitle: Distributed RAG architecture with Qdrant vector engine, vLLM inference, Ollama embeddings, and Arize Phoenix observability across two machines

📋 Table of Contents

� The RAG VRAM Problem

In the development of Retrieval-Augmented Generation (RAG) systems, one of the greatest challenges is managing computational resources efficiently. Running large language models (LLMs) locally often saturates VRAM, penalizing the speed of document indexing and retrieval.

flowchart TD
    subgraph Problem[&#34;Single Machine — VRAM Competition&#34;]
        A[GPU VRAM: 24GB] --&gt; B[vLLM LLM: 16GB]
        A --&gt; C[Ollama Embeddings: 4GB]
        A --&gt; D[Qdrant Vector DB: 3GB]
        A --&gt; E[System/Other: 1GB]
        B --&gt; F[⚠️ VRAM Exhausted]
        C --&gt; F
        D --&gt; F
        F --&gt; G[❌ Slow indexing]
        F --&gt; H[❌ Slow retrieval]
    end

To solve this, I designed a distributed architecture that separates responsibilities between two distinct machines, optimizing both inference and vector processing.

💡 System Architecture: Two Machines, One System

The system is divided into two main components: the Client Machine (Local) and the Server Machine.

flowchart LR
    subgraph Client[&#34;Client Machine (Local)&#34;]
        GPU[GPU 0 - Shared]
        Ollama[Ollama - Embeddings]
        Qdrant[Qdrant - Vector DB]
        Phoenix[Arize Phoenix - Observability]
        MCP[MCP Bridge &amp; Indexer]
        UI[User Interface]
        GPU --&gt; Ollama
        GPU --&gt; Qdrant
        MCP --&gt; Phoenix
        MCP --&gt; UI
    end
    
    subgraph Server[&#34;Server Machine (192.168.0.50)&#34;]
        vLLM[vLLM - LLM Inference]
        LLM[Large Language Model]
        vLLM --&gt; LLM
    end
    
    Client --&gt;|&#34;HTTP API&#34;| Server

🏗️ Client Machine (Local)

This machine handles orchestration, indexing, and the user interface.

Component	VRAM Usage	Purpose	Why It Matters
Ollama	~4GB	Generate 768-dimension embeddings	Fast document processing
Qdrant	~3GB	GPU-accelerated HNSW indexing	Millisecond vector searches
System/Other	~2GB	OS, MCP Bridge, UI	Development environment
Headroom	~15GB	Available for operations	No VRAM competition

flowchart TD
    subgraph LocalGPU[&#34;GPU 0 — Efficiently Shared&#34;]
        O1[Ollama: Embeddings]
        Q1[Qdrant: Vector Search]
        S[System: MCP + UI]
        H[Headroom: Operations]
    end
    
    O1 --&gt;|&#34;768-dim vectors&#34;| QdrantDB[(Qdrant Index)]
    QdrantDB --&gt;|&#34;HNSW search&#34;| Q1

Local Components

Component	Role	Impact
Ollama	Embedding generation (768-dim)	Agile document processing
Qdrant	Vector database with GPU acceleration	HNSW indexing for fast retrieval
Arize Phoenix	Observability and trace collection	Real-time auditing of context retrieval
MCP Bridge & Indexer	Central business logic	File watching, data flow orchestration

🖥️ Server Machine (192.168.0.50)

This machine is dedicated to the heavy lifting of model inference.

Why vLLM?

Feature	vLLM	Traditional Inference
Paged Attention	Memory-efficient tensor storage	Fixed memory allocation
Continuous Batching	Dynamic request scheduling	Batch-by-batch processing
VRAM Isolation	Doesn't affect local GPU	Complete separation
Model Swapping	8B ↔ 70B without local changes	Reinstall on local machine

vLLM runs the Large Language Model (LLM). By keeping this process on a separate server via a highly optimized inference engine like vLLM or TensorRT-LLM, we avoid saturating the VRAM of the main workstation. This ensures that heavy inferences don't affect the performance of the local indexer or development tools.

📈 Benefits of the Distributed Approach

Benefit	Single Machine	Distributed Architecture	Improvement
VRAM Efficiency	Competing for memory	Dedicated resources	100% isolation
Scalability	Upgrade entire machine	Upgrade server only	Modular growth
Model Flexibility	Fixed to local GPU	Swap 8B ↔ 70B freely	Unlimited options
Observability	Limited tracing	Arize Phoenix full visibility	Complete auditing

VRAM Efficiency: By isolating the large LLM on the server, the local GPU can be fully dedicated to generating embeddings and performing fast vector searches, without competing for memory.
Scalability: The LLM on the server can be upgraded or swapped (e.g., from an 8B to a 70B model) without altering the local indexing logic.
Observability: The integration of Arize Phoenix in the local node ensures that each "Retrieval" can be traced and evaluated to mitigate hallucinations.

🤔 Why This Matters for AI Production

This distributed pattern extends beyond RAG systems:

System Type	Similar Challenge	Distributed Solution
Chat Applications	VRAM limits response speed	Separate inference from UI
Document Processing	Indexing competes with generation	GPU sharing vs dedicated inference
Enterprise AI	Model size vs hardware cost	Swap models without downtime
Debugging	Hallucinations hard to trace	Observability at every layer

✅ Key Takeaways

VRAM is the bottleneck — running LLM + embeddings + vector DB on one GPU causes competition
Distributed architecture solves this — separate inference (server) from indexing (local)
vLLM enables flexible inference — paged attention, continuous batching, model swapping
Arize Phoenix provides observability — trace every retrieval to mitigate hallucinations
Modular scalability — upgrade server independently of local development environment

🔗 Explore the Code

Want to see the full implementation? Check out the repository: 87maxi/rag_distributed_system on GitHub:

docker-compose.yml — Deployment configuration for both machines

client/ — Local orchestration, MCP Bridge, and Indexer

server/ — vLLM inference server configuration

observability/ — Arize Phoenix tracing setup

💬 Want to Contribute?

This architecture is open-source and community-driven. Whether you want to:

Add support for additional vector databases (Pinecone, Weaviate, Milvus)

Implement multi-server load balancing with vLLM

Create a Kubernetes deployment configuration

Improve the Arize Phoenix dashboards

Fork the repository and submit a Pull Request: 87maxi/rag_distributed_system on GitHub

🔗 Continuous Learning

Stay updated with the latest in RAG architecture and distributed AI systems:

Resource Description Link

vLLM Documentation Official docs for paged attention and continuous batching vllm.ai

Qdrant Guides Vector search best practices and filter-based filtering qdrant.tech

Arize Phoenix LLM observability and evaluation frameworks arize.com

Ollama Models Library of optimized embedding models ollama.com

Docker Compose Multi-container orchestration patterns docs.docker.com

Resource	Description	Link
vLLM Documentation	Official docs for paged attention and continuous batching	vllm.ai
Qdrant Guides	Vector search best practices and filter-based filtering	qdrant.tech
Arize Phoenix	LLM observability and evaluation frameworks	arize.com
Ollama Models	Library of optimized embedding models	ollama.com
Docker Compose	Multi-container orchestration patterns	docs.docker.com

Explore more deep-dive technical content from this blog:

Article Category Description

AI-Assisted Learning: eBPF & Blockchain Deep Tech How I learned kernel-level programming with AI

Rust Rocket REST API Systems Programming Full-stack API with Rocket framework and SQLx

Hardware Interrupts & XDP Linux Internals Understanding interrupt storms and context switching

Blindando Validadores XDP Blockchain Security Kernel-level DoS mitigation with Rust

Article	Category	Description
AI-Assisted Learning: eBPF & Blockchain	Deep Tech	How I learned kernel-level programming with AI
Rust Rocket REST API	Systems Programming	Full-stack API with Rocket framework and SQLx
Hardware Interrupts & XDP	Linux Internals	Understanding interrupt storms and context switching
Blindando Validadores XDP	Blockchain Security	Kernel-level DoS mitigation with Rust