RAG Distributed System

💡 For AI engineers and CTOs evaluating whether to run LLMs locally — and why distributed architecture is the answer to VRAM bottlenecks.


🎯 Why Running LLMs Locally Is a VRAM Nightmare (And How to Fix It)

When you try to run a 7B parameter model on your workstation, VRAM fills up instantly. Document indexing slows to a crawl. Vector searches become unusable. The solution isn't better hardware — it's distributed architecture that separates inference from indexing.

Technical subtitle: Distributed RAG architecture with Qdrant vector engine, vLLM inference, Ollama embeddings, and Arize Phoenix observability across two machines


📋 Table of Contents


� The RAG VRAM Problem

In the development of Retrieval-Augmented Generation (RAG) systems, one of the greatest challenges is managing computational resources efficiently. Running large language models (LLMs) locally often saturates VRAM, penalizing the speed of document indexing and retrieval.

flowchart TD
    subgraph Problem["Single Machine — VRAM Competition"]
        A[GPU VRAM: 24GB] --> B[vLLM LLM: 16GB]
        A --> C[Ollama Embeddings: 4GB]
        A --> D[Qdrant Vector DB: 3GB]
        A --> E[System/Other: 1GB]
        B --> F[⚠️ VRAM Exhausted]
        C --> F
        D --> F
        F --> G[❌ Slow indexing]
        F --> H[❌ Slow retrieval]
    end

To solve this, I designed a distributed architecture that separates responsibilities between two distinct machines, optimizing both inference and vector processing.


💡 System Architecture: Two Machines, One System

The system is divided into two main components: the Client Machine (Local) and the Server Machine.

flowchart LR
    subgraph Client["Client Machine (Local)"]
        GPU[GPU 0 - Shared]
        Ollama[Ollama - Embeddings]
        Qdrant[Qdrant - Vector DB]
        Phoenix[Arize Phoenix - Observability]
        MCP[MCP Bridge & Indexer]
        UI[User Interface]
        GPU --> Ollama
        GPU --> Qdrant
        MCP --> Phoenix
        MCP --> UI
    end
    
    subgraph Server["Server Machine (192.168.0.50)"]
        vLLM[vLLM - LLM Inference]
        LLM[Large Language Model]
        vLLM --> LLM
    end
    
    Client -->|"HTTP API"| Server

🏗️ Client Machine (Local)

This machine handles orchestration, indexing, and the user interface.

GPU Sharing Strategy

Component VRAM Usage Purpose Why It Matters
Ollama ~4GB Generate 768-dimension embeddings Fast document processing
Qdrant ~3GB GPU-accelerated HNSW indexing Millisecond vector searches
System/Other ~2GB OS, MCP Bridge, UI Development environment
Headroom ~15GB Available for operations No VRAM competition
flowchart TD
    subgraph LocalGPU["GPU 0 — Efficiently Shared"]
        O1[Ollama: Embeddings]
        Q1[Qdrant: Vector Search]
        S[System: MCP + UI]
        H[Headroom: Operations]
    end
    
    O1 -->|"768-dim vectors"| QdrantDB[(Qdrant Index)]
    QdrantDB -->|"HNSW search"| Q1

Local Components

Component Role Impact
Ollama Embedding generation (768-dim) Agile document processing
Qdrant Vector database with GPU acceleration HNSW indexing for fast retrieval
Arize Phoenix Observability and trace collection Real-time auditing of context retrieval
MCP Bridge & Indexer Central business logic File watching, data flow orchestration

🖥️ Server Machine (192.168.0.50)

This machine is dedicated to the heavy lifting of model inference.

Why vLLM?

Feature vLLM Traditional Inference
Paged Attention Memory-efficient tensor storage Fixed memory allocation
Continuous Batching Dynamic request scheduling Batch-by-batch processing
VRAM Isolation Doesn't affect local GPU Complete separation
Model Swapping 8B ↔ 70B without local changes Reinstall on local machine

vLLM runs the Large Language Model (LLM). By keeping this process on a separate server via a highly optimized inference engine like vLLM or TensorRT-LLM, we avoid saturating the VRAM of the main workstation. This ensures that heavy inferences don't affect the performance of the local indexer or development tools.


📈 Benefits of the Distributed Approach

Benefit Single Machine Distributed Architecture Improvement
VRAM Efficiency Competing for memory Dedicated resources 100% isolation
Scalability Upgrade entire machine Upgrade server only Modular growth
Model Flexibility Fixed to local GPU Swap 8B ↔ 70B freely Unlimited options
Observability Limited tracing Arize Phoenix full visibility Complete auditing
  1. VRAM Efficiency: By isolating the large LLM on the server, the local GPU can be fully dedicated to generating embeddings and performing fast vector searches, without competing for memory.
  2. Scalability: The LLM on the server can be upgraded or swapped (e.g., from an 8B to a 70B model) without altering the local indexing logic.
  3. Observability: The integration of Arize Phoenix in the local node ensures that each "Retrieval" can be traced and evaluated to mitigate hallucinations.

🤔 Why This Matters for AI Production

This distributed pattern extends beyond RAG systems:

System Type Similar Challenge Distributed Solution
Chat Applications VRAM limits response speed Separate inference from UI
Document Processing Indexing competes with generation GPU sharing vs dedicated inference
Enterprise AI Model size vs hardware cost Swap models without downtime
Debugging Hallucinations hard to trace Observability at every layer

✅ Key Takeaways

  1. VRAM is the bottleneck — running LLM + embeddings + vector DB on one GPU causes competition
  2. Distributed architecture solves this — separate inference (server) from indexing (local)
  3. vLLM enables flexible inference — paged attention, continuous batching, model swapping
  4. Arize Phoenix provides observability — trace every retrieval to mitigate hallucinations
  5. Modular scalability — upgrade server independently of local development environment

🔗 Explore the Code

Want to see the full implementation? Check out the repository: 87maxi/rag_distributed_system on GitHub:


💬 Want to Contribute?

This architecture is open-source and community-driven. Whether you want to:

  • Add support for additional vector databases (Pinecone, Weaviate, Milvus)
  • Implement multi-server load balancing with vLLM
  • Create a Kubernetes deployment configuration
  • Improve the Arize Phoenix dashboards

Fork the repository and submit a Pull Request: 87maxi/rag_distributed_system on GitHub


🔗 Continuous Learning

Stay updated with the latest in RAG architecture and distributed AI systems:

Resource Description Link
vLLM Documentation Official docs for paged attention and continuous batching vllm.ai
Qdrant Guides Vector search best practices and filter-based filtering qdrant.tech
Arize Phoenix LLM observability and evaluation frameworks arize.com
Ollama Models Library of optimized embedding models ollama.com
Docker Compose Multi-container orchestration patterns docs.docker.com

Explore more deep-dive technical content from this blog:

Article Category Description
AI-Assisted Learning: eBPF & Blockchain Deep Tech How I learned kernel-level programming with AI
Rust Rocket REST API Systems Programming Full-stack API with Rocket framework and SQLx
Hardware Interrupts & XDP Linux Internals Understanding interrupt storms and context switching
Blindando Validadores XDP Blockchain Security Kernel-level DoS mitigation with Rust
💬

Comments

Powered by Giscus · GitHub Discussions

🤖 AI & RAG