💡 For AI engineers and CTOs evaluating whether to run LLMs locally — and why distributed architecture is the answer to VRAM bottlenecks.
🎯 Why Running LLMs Locally Is a VRAM Nightmare (And How to Fix It)
When you try to run a 7B parameter model on your workstation, VRAM fills up instantly. Document indexing slows to a crawl. Vector searches become unusable. The solution isn't better hardware — it's distributed architecture that separates inference from indexing.
Technical subtitle: Distributed RAG architecture with Qdrant vector engine, vLLM inference, Ollama embeddings, and Arize Phoenix observability across two machines
📋 Table of Contents
- The RAG VRAM Problem
- System Architecture: Two Machines, One System
- Client Machine (Local)
- Server Machine (192.168.0.50)
- Benefits of the Distributed Approach
- Why This Matters for AI Production
- Key Takeaways
- Explore the Code
- Continuous Learning
- Contribute
- Related Articles
� The RAG VRAM Problem
In the development of Retrieval-Augmented Generation (RAG) systems, one of the greatest challenges is managing computational resources efficiently. Running large language models (LLMs) locally often saturates VRAM, penalizing the speed of document indexing and retrieval.
flowchart TD
subgraph Problem["Single Machine — VRAM Competition"]
A[GPU VRAM: 24GB] --> B[vLLM LLM: 16GB]
A --> C[Ollama Embeddings: 4GB]
A --> D[Qdrant Vector DB: 3GB]
A --> E[System/Other: 1GB]
B --> F[⚠️ VRAM Exhausted]
C --> F
D --> F
F --> G[❌ Slow indexing]
F --> H[❌ Slow retrieval]
endTo solve this, I designed a distributed architecture that separates responsibilities between two distinct machines, optimizing both inference and vector processing.
💡 System Architecture: Two Machines, One System
The system is divided into two main components: the Client Machine (Local) and the Server Machine.
flowchart LR
subgraph Client["Client Machine (Local)"]
GPU[GPU 0 - Shared]
Ollama[Ollama - Embeddings]
Qdrant[Qdrant - Vector DB]
Phoenix[Arize Phoenix - Observability]
MCP[MCP Bridge & Indexer]
UI[User Interface]
GPU --> Ollama
GPU --> Qdrant
MCP --> Phoenix
MCP --> UI
end
subgraph Server["Server Machine (192.168.0.50)"]
vLLM[vLLM - LLM Inference]
LLM[Large Language Model]
vLLM --> LLM
end
Client -->|"HTTP API"| Server🏗️ Client Machine (Local)
This machine handles orchestration, indexing, and the user interface.
GPU Sharing Strategy
| Component | VRAM Usage | Purpose | Why It Matters |
|---|---|---|---|
| Ollama | ~4GB | Generate 768-dimension embeddings | Fast document processing |
| Qdrant | ~3GB | GPU-accelerated HNSW indexing | Millisecond vector searches |
| System/Other | ~2GB | OS, MCP Bridge, UI | Development environment |
| Headroom | ~15GB | Available for operations | No VRAM competition |
flowchart TD
subgraph LocalGPU["GPU 0 — Efficiently Shared"]
O1[Ollama: Embeddings]
Q1[Qdrant: Vector Search]
S[System: MCP + UI]
H[Headroom: Operations]
end
O1 -->|"768-dim vectors"| QdrantDB[(Qdrant Index)]
QdrantDB -->|"HNSW search"| Q1Local Components
| Component | Role | Impact |
|---|---|---|
| Ollama | Embedding generation (768-dim) | Agile document processing |
| Qdrant | Vector database with GPU acceleration | HNSW indexing for fast retrieval |
| Arize Phoenix | Observability and trace collection | Real-time auditing of context retrieval |
| MCP Bridge & Indexer | Central business logic | File watching, data flow orchestration |
🖥️ Server Machine (192.168.0.50)
This machine is dedicated to the heavy lifting of model inference.
Why vLLM?
| Feature | vLLM | Traditional Inference |
|---|---|---|
| Paged Attention | Memory-efficient tensor storage | Fixed memory allocation |
| Continuous Batching | Dynamic request scheduling | Batch-by-batch processing |
| VRAM Isolation | Doesn't affect local GPU | Complete separation |
| Model Swapping | 8B ↔ 70B without local changes | Reinstall on local machine |
vLLM runs the Large Language Model (LLM). By keeping this process on a separate server via a highly optimized inference engine like vLLM or TensorRT-LLM, we avoid saturating the VRAM of the main workstation. This ensures that heavy inferences don't affect the performance of the local indexer or development tools.
📈 Benefits of the Distributed Approach
| Benefit | Single Machine | Distributed Architecture | Improvement |
|---|---|---|---|
| VRAM Efficiency | Competing for memory | Dedicated resources | 100% isolation |
| Scalability | Upgrade entire machine | Upgrade server only | Modular growth |
| Model Flexibility | Fixed to local GPU | Swap 8B ↔ 70B freely | Unlimited options |
| Observability | Limited tracing | Arize Phoenix full visibility | Complete auditing |
- VRAM Efficiency: By isolating the large LLM on the server, the local GPU can be fully dedicated to generating embeddings and performing fast vector searches, without competing for memory.
- Scalability: The LLM on the server can be upgraded or swapped (e.g., from an 8B to a 70B model) without altering the local indexing logic.
- Observability: The integration of Arize Phoenix in the local node ensures that each "Retrieval" can be traced and evaluated to mitigate hallucinations.
🤔 Why This Matters for AI Production
This distributed pattern extends beyond RAG systems:
| System Type | Similar Challenge | Distributed Solution |
|---|---|---|
| Chat Applications | VRAM limits response speed | Separate inference from UI |
| Document Processing | Indexing competes with generation | GPU sharing vs dedicated inference |
| Enterprise AI | Model size vs hardware cost | Swap models without downtime |
| Debugging | Hallucinations hard to trace | Observability at every layer |
✅ Key Takeaways
- VRAM is the bottleneck — running LLM + embeddings + vector DB on one GPU causes competition
- Distributed architecture solves this — separate inference (server) from indexing (local)
- vLLM enables flexible inference — paged attention, continuous batching, model swapping
- Arize Phoenix provides observability — trace every retrieval to mitigate hallucinations
- Modular scalability — upgrade server independently of local development environment
🔗 Explore the Code
Want to see the full implementation? Check out the repository: 87maxi/rag_distributed_system on GitHub:
docker-compose.yml— Deployment configuration for both machinesclient/— Local orchestration, MCP Bridge, and Indexerserver/— vLLM inference server configurationobservability/— Arize Phoenix tracing setup
💬 Want to Contribute?
This architecture is open-source and community-driven. Whether you want to:
- Add support for additional vector databases (Pinecone, Weaviate, Milvus)
- Implement multi-server load balancing with vLLM
- Create a Kubernetes deployment configuration
- Improve the Arize Phoenix dashboards
Fork the repository and submit a Pull Request: 87maxi/rag_distributed_system on GitHub
🔗 Continuous Learning
Stay updated with the latest in RAG architecture and distributed AI systems:
Resource Description Link vLLM Documentation Official docs for paged attention and continuous batching vllm.ai Qdrant Guides Vector search best practices and filter-based filtering qdrant.tech Arize Phoenix LLM observability and evaluation frameworks arize.com Ollama Models Library of optimized embedding models ollama.com Docker Compose Multi-container orchestration patterns docs.docker.com
🔗 Related Articles
Explore more deep-dive technical content from this blog:
Article Category Description AI-Assisted Learning: eBPF & Blockchain Deep Tech How I learned kernel-level programming with AI Rust Rocket REST API Systems Programming Full-stack API with Rocket framework and SQLx Hardware Interrupts & XDP Linux Internals Understanding interrupt storms and context switching Blindando Validadores XDP Blockchain Security Kernel-level DoS mitigation with Rust