System Design: AI-Powered Legal Research & Drafting Platform
The Problem With Legal Research Today
A junior solicitor in the City spends 6 hours trawling through BAILII and Westlaw trying to locate the right Court of Appeal precedent for a contractual dispute. A law student at UCL drafts a witness statement from scratch because no template exists that matches their facts. A corporate lawyer at a Magic Circle firm cross-references the Companies Act, the FCA Handbook, and case law manually to answer a client question before the morning meeting.
The legal profession is knowledge-intensive, citation-dependent, and massively under-served by modern AI tooling. The opportunity for a UK-context legal AI—one fluent in English contract law, the Proceeds of Crime Act, the Human Rights Act, and decades of Supreme Court and Court of Appeal judgments—is enormous.
This post walks through the complete system design for such a platform: an AI-powered assistant for legal research, document analysis, and draft generation at scale.
Phase 1: Requirements Clarification
Functional Requirements
AI Legal Research Assistant
- Answer natural-language legal questions with cited case law and statute references
- Example queries: “Bail provisions under the Bail Act 1976” or “Supreme Court judgments on Article 8 privacy rights under the Human Rights Act”
- Use RAG (Retrieval-Augmented Generation) to ground answers in verified legal sources
Document Upload & Analysis
- Accept PDF, DOCX, and scanned court files
- Extract text (including OCR for scanned documents)
- Summarize documents, identify legal issues, and surface related precedents
Legal Draft Generator
- Generate structured legal drafts: notices, bail applications, contracts, affidavits, and petitions
- Combine AI generation with curated templates for jurisdiction-specific formatting
Case Law Search Engine
- Keyword search and semantic (vector) search
- Filters: court, year, judge, legal provision
- Coverage: Supreme Court, all High Courts, Indian Acts and Sections
AI Chat Interface
- Persistent conversation context
- File references inside chat
- Legal citation display with source links
Non-Functional Requirements
| Dimension | Target |
|---|---|
| Users | 100,000+ concurrent |
| Response time | < 5 seconds end-to-end |
| Uptime | 99.9% |
| Document security | Encryption at rest + in transit |
| Access control | Role-based (lawyer, student, admin) |
| Compliance | DPDP Act (India) + GDPR-equivalent |
Back-of-Envelope Estimation
| Metric | Value |
|---|---|
| Active users | 100,000 |
| Queries/user/day | 20 |
| Total daily queries | 2,000,000 |
| Avg document upload size | 2 MB |
| Documents uploaded/day | 50,000 |
| Daily document storage delta | ~100 GB |
| Vector DB entries (legal corpus) | ~50 million chunks |
| Avg embedding size (1536-dim float32) | ~6 KB |
| Total vector storage | ~300 GB |
Read-heavy workload with expensive LLM inference at the core. Caching and smart retrieval are the two biggest cost levers.
Phase 2: High-Level Architecture
The platform is organized into six layers:
[User Clients: Web / Mobile]
│
▼
[API Gateway + Load Balancer]
│
┌─────┴──────────────────────────┐
▼ ▼
[Auth Service] [WebSocket Gateway]
│
┌──────────────────────────┼──────────────────────┐
▼ ▼ ▼
[AI Query Service] [Document Processing Service] [Draft Generation Service]
│ │ │
└──────────────────────────┴───────────────────────┘
│
┌────────────────┼────────────────┐
▼ ▼ ▼
[PostgreSQL] [Vector DB] [Object Store]
(structured) (Pinecone/Milvus) (S3 / GCS)
│
┌────────────────┘
▼
[LLM Layer: GPT-4o / Claude / Llama 3]
│
▼
[Response + Citations → Client]
Phase 3: Component Deep Dives
Frontend — React + Next.js
The client is a Next.js application with three primary surfaces:
- Chat Interface: A ChatGPT-style conversation UI with legal citation cards rendered inline. WebSocket connection maintains real-time streaming responses.
- Document Workspace: Drag-and-drop upload, processing status tracker, and an annotation panel for AI-identified legal issues.
- Draft Studio: A form-driven interface where users select draft type, fill in facts, and receive a structured AI-generated document with inline editing.
Tailwind CSS handles styling. The frontend connects to the backend exclusively through API Gateway—never directly to microservices.
API Gateway + Load Balancer
AWS API Gateway (or Kong on Kubernetes) sits in front of all services. Responsibilities:
- JWT validation on every request
- Rate limiting: 100 requests/minute for free tier, 1,000 for pro
- Request routing to the correct microservice
- WebSocket protocol upgrade for the chat interface
An Application Load Balancer distributes traffic across service replicas. All services are stateless containers—session state lives in Redis.
Authentication Service
Auth stack: AWS Cognito (or Auth0) for identity management, JWT for session tokens.
- Role-based access:
student,lawyer,admin - Lawyers get access to draft generation and document uploads; students are restricted to research queries
- OAuth2 integration for Google/Microsoft login (common in legal enterprise environments)
- Document-level authorization: users can only access their own uploaded files
AI Query Service — The RAG Pipeline
This is the heart of the platform. Every legal research query flows through a Retrieval-Augmented Generation pipeline:
User Query
│
▼
[Query Understanding] ← Intent classification + entity extraction
│ (Is this a research question? Draft request? Case search?)
▼
[Embedding Generation] ← text-embedding-3-large (OpenAI) or BGE-M3 (open source)
│
▼
[Vector Search] ← Top-K retrieval from Pinecone/Milvus
│ Namespace: supreme_court | high_courts | acts | sections
▼
[Re-Ranking] ← Cohere Rerank or cross-encoder to filter top-20 → top-5
│
▼
[Context Assembly] ← Inject retrieved chunks into LLM prompt with source metadata
│
▼
[LLM Generation] ← GPT-4o / Claude 3.5 Sonnet / Llama 3 70B
│
▼
[Citation Validation] ← Verify citations exist in DB before returning to user
│
▼
[Response + Source Cards] → Client
Critical design decision: citation validation. Legal AI hallucination is a professional liability risk. Before returning any answer, a post-processing step verifies that every cited case or statute reference (AIR 2017 SC 4478) actually exists in the structured PostgreSQL database. Unverified citations are stripped and flagged. A confidence score is attached to each answer.
Prompt engineering: System prompts enforce that the model only uses provided context, never fabricates citations, and always qualifies answers with “consult a licensed advocate” disclaimers.
Document Processing Service
When a user uploads a file, an asynchronous pipeline runs:
File Upload (S3)
│
▼
[Format Detection] ← PDF / DOCX / image
│
├── PDF with text → PDFMiner / pdfplumber
├── DOCX → python-docx
└── Scanned / image PDF → AWS Textract (OCR)
│
▼
[Text Cleaning] ← Remove headers/footers, normalize whitespace, fix OCR artifacts
│
▼
[Chunking] ← Recursive character chunking, 512 tokens, 50-token overlap
│ Legal-aware: respect paragraph boundaries in judgments
▼
[Embedding Generation]
│
▼
[Vector DB Upsert] ← Namespace: user_{user_id}_docs (isolated per user)
│
▼
[Structured Extraction] ← Identify: parties, case numbers, dates, sections cited
│ Store in PostgreSQL documents table
▼
[Summary Generation] ← LLM call with full document context for <2000-word summary
│
▼
[WebSocket Notification] ← Notify client processing is complete
AWS Textract handles scanned court files with high accuracy for printed text. For handwritten notes, accuracy degrades—surface a confidence warning to users.
Draft Generation Service
Legal drafts require more structure than open-ended Q&A. The service uses a hybrid template + AI approach:
- User selects draft type and jurisdiction (e.g., “Bail Application — Crown Court, England & Wales”)
- System loads a curated structural template (headings, required sections, standard clauses for that court)
- User fills a facts form (client name, case reference, charges, key arguments)
- Service assembles a structured prompt injecting both the template skeleton and user facts
- LLM generates the narrative body within the structure
- Post-processing applies legal formatting: proper cause title format, numbering conventions, prayer clause structure
- Output rendered as editable DOCX for download
Templates are stored in PostgreSQL and versioned. Senior lawyers on the platform can contribute and review templates—a community quality layer.
Case Law Search Engine
Two search modes run in parallel and results are merged:
- Keyword Search: Elasticsearch index over case metadata (title, citation, judge, year, court, full text). Supports Boolean operators and field-specific filters.
- Semantic Search: Vector similarity search in Pinecone against pre-embedded judgment chunks. Captures conceptual matches even when exact keywords differ.
A Reciprocal Rank Fusion algorithm merges results from both modes, weighted 40% keyword / 60% semantic for legal queries (semantic captures legal reasoning patterns better than exact match).
Legal Dataset — Sources and Ingestion
The corpus is the platform’s competitive moat. Ingestion pipeline:
[Sources]
├── BAILII (British and Irish Legal Information Institute)
├── UK Supreme Court website
├── Court of Appeal & High Court judgment portals
├── legislation.gov.uk (primary + secondary legislation)
└── FCA Handbook, ICO guidance, SRA standards
[Ingestion Pipeline]
Crawler → HTML Cleaner → Metadata Extractor → Chunker → Embedder → Vector DB
│
└──→ Structured metadata → PostgreSQL
New judgments are ingested within 24 hours via scheduled crawlers. A deduplication step using MinHash LSH prevents re-processing of already-indexed documents.
Phase 4: Database Design
PostgreSQL — Structured Data
-- Core tables (simplified)
users (id, email, role, subscription_tier, created_at)
cases (
id, citation, title, court, year, judge,
full_text_s3_key, summary, date_decided,
acts_cited[], sections_cited[]
)
acts (id, title, year, ministry, full_text_s3_key)
sections (id, act_id, number, title, text)
documents (
id, user_id, filename, s3_key,
processing_status, summary,
extracted_parties, extracted_sections[],
created_at
)
queries (
id, user_id, session_id, query_text,
response_text, citations_used[],
confidence_score, created_at
)
draft_templates (
id, draft_type, court, jurisdiction,
template_structure JSONB, version, is_active
)
sessions (id, user_id, title, created_at, last_active_at)
Vector Database — Pinecone
Three namespaces with separate indexes:
legal-corpus: ~50M chunks from cases, Acts, and sections. Metadata fields:source_type,court,year,citation,act_id,section_numberuser-documents: Per-user document embeddings. Metadata:user_id,document_id,page_numberdraft-templates: Embedded template descriptions for semantic template retrieval
Redis Cache
- Session state for WebSocket connections
- Query result cache: MD5 hash of query → cached response (TTL: 1 hour for common queries)
- Rate limiting counters
- Embedding cache for repeated queries (avoid re-embedding identical strings)
Phase 5: Infrastructure & Deployment
Cloud Architecture (AWS)
[Route 53] → [CloudFront CDN] → [ALB]
│
[EKS Cluster]
│ │
[Services] [Workers]
│
┌────────────┼────────────────┐
▼ ▼ ▼
[RDS Postgres] [ElastiCache [S3 Buckets]
Multi-AZ Redis] (docs, models)
│
▼
[Pinecone] [OpenAI API / Bedrock]
Kubernetes on EKS: Each microservice is a separate Deployment with HPA (Horizontal Pod Autoscaler) based on CPU and custom metrics (queue depth for document processing).
Document Processing Workers: Run as Kubernetes Jobs triggered by SQS messages. Scale to zero when idle; burst to 50+ workers during peak upload times.
LLM Routing: A lightweight router selects the LLM based on task type and cost:
- Research Q&A: GPT-4o or Claude 3.5 Sonnet (highest accuracy for legal reasoning)
- Summarization: GPT-4o-mini or Claude Haiku (cost-efficient for high volume)
- Draft generation: GPT-4o (structured output mode for reliable formatting)
- Fallback: Llama 3 70B on AWS Bedrock (no egress cost, data residency compliance)
Security Architecture
- Encryption: AES-256 at rest (S3 SSE-KMS), TLS 1.3 in transit
- Document isolation: Each user’s uploaded documents stored in an S3 prefix with IAM conditions; vector DB namespaced per user
- PII handling: User documents are never sent to third-party LLMs without explicit consent. On-premise Llama deployment available for enterprise tier.
- Audit logging: All queries, document accesses, and draft generations logged to CloudWatch with user ID and timestamp (required for SRA and ICO compliance)
- UK GDPR & DPA 2018: Data residency in AWS eu-west-2 (London) region; data processing agreements in place with all third-party vendors
Phase 6: Key Design Decisions & Trade-offs
RAG vs. Fine-Tuned Model
Fine-tuning a model on Indian legal data would improve domain fluency but creates a static knowledge base that goes stale as new judgments are issued daily. RAG keeps knowledge current without retraining. We use RAG with a well-prompted general LLM, and accept slightly lower stylistic fluency in exchange for always-current citations.
Pinecone vs. Self-Hosted Milvus
Pinecone is fully managed and requires zero operational overhead. At 50M vectors, monthly cost is ~$700. Milvus on Kubernetes is free but requires a dedicated ops team. For a startup phase, Pinecone; for scale (500M+ vectors), migrate to self-hosted Milvus or Weaviate.
Streaming Responses vs. Batch
Legal research answers can be long. Streaming tokens via WebSocket dramatically improves perceived performance—users see the answer forming in real time rather than waiting 8–12 seconds for a complete response. All LLM calls use server-sent streaming.
Monolith vs. Microservices
We adopt microservices from day one because the five core services have wildly different scaling profiles: document processing is bursty and CPU-intensive; the AI query service is memory-intensive and latency-sensitive; the search service scales with read volume. A monolith would force all services to scale together, wasting cost.
Measuring Platform Health
Track these metrics weekly:
- Retrieval Accuracy: % of citations retrieved that are relevant (target: >90%)
- Hallucination Rate: Citations returned that don’t exist in the corpus (target: <1%)
- P95 Query Latency: 95th percentile end-to-end response time (target: <5s)
- Document Processing SLA: % of documents fully processed within 2 minutes (target: >95%)
- User Retention (D30): % of users returning after 30 days (north star metric)
Cost Estimation (100k Users, Monthly)
| Component | Cost/Month |
|---|---|
| AWS EKS + EC2 (compute) | ~$4,500 |
| RDS PostgreSQL Multi-AZ | ~$800 |
| ElastiCache Redis | ~$300 |
| S3 storage (docs + corpus) | ~$500 |
| Pinecone (50M vectors) | ~$700 |
| OpenAI API (2M queries @ GPT-4o) | ~$12,000 |
| AWS Textract (OCR) | ~$1,500 |
| CloudFront + data transfer | ~$400 |
| Misc (monitoring, logging, CDN) | ~$600 |
| Total | ~$21,300/month |
At ₹999/month per pro user, 2,500 paying users break even on infrastructure alone. At 10,000 paying users, margin is healthy. The dominant cost is LLM inference—aggressive caching of common queries (20–30% of queries are near-duplicates) meaningfully reduces this.
Future Roadmap
The architecture is designed to accommodate these without re-platforming:
- Voice Legal Assistant: Add a speech-to-text layer (Whisper API) in front of the AI Query Service; spoken queries hit the same RAG pipeline
- Judgment Outcome Prediction: Train a classification model on historical case data; serve as a separate microservice with appropriate disclaimers
- Multi-Language Support: Add a translation layer (IndicTrans2 for Hindi/regional languages) before embedding generation
- Lawyer Collaboration Tools: Shared document workspaces with comment threads; built on top of the existing document service with multi-user access controls
- Court Filing Integration: API integrations with eCourts and NJDG for direct case status lookups
Conclusion: Building the Legal Brain of the Future
An AI legal platform is fundamentally a knowledge retrieval and reasoning system with extreme accuracy requirements. The architecture choices here—RAG over fine-tuning, citation validation before every response, hybrid keyword+semantic search—all flow from one principle: in law, a wrong answer isn’t just unhelpful, it’s harmful.
The technology stack (Next.js → FastAPI → Kinesis/SQS → RAG pipeline → PostgreSQL + Pinecone) is proven at scale. The real moat is the legal corpus: the breadth, freshness, and quality of the indexed UK legal data—BAILII judgments, legislation.gov.uk statutes, FCA and SRA guidance—will determine whether this platform becomes essential to the legal profession.
Build the data pipeline right, validate every citation, and ship the simplest version that gives a junior solicitor their first correct answer in 10 seconds instead of 6 hours. The rest follows. ⚖️



