AI-Powered Legal Research & Drafting Platform

System Design: AI-Powered Legal Research & Drafting Platform

The Problem With Legal Research Today

A junior solicitor in the City spends 6 hours trawling through BAILII and Westlaw trying to locate the right Court of Appeal precedent for a contractual dispute. A law student at UCL drafts a witness statement from scratch because no template exists that matches their facts. A corporate lawyer at a Magic Circle firm cross-references the Companies Act, the FCA Handbook, and case law manually to answer a client question before the morning meeting.

The legal profession is knowledge-intensive, citation-dependent, and massively under-served by modern AI tooling. The opportunity for a UK-context legal AI—one fluent in English contract law, the Proceeds of Crime Act, the Human Rights Act, and decades of Supreme Court and Court of Appeal judgments—is enormous.

This post walks through the complete system design for such a platform: an AI-powered assistant for legal research, document analysis, and draft generation at scale.

Phase 1: Requirements Clarification

Functional Requirements

AI Legal Research Assistant

Answer natural-language legal questions with cited case law and statute references
Example queries: “Bail provisions under the Bail Act 1976” or “Supreme Court judgments on Article 8 privacy rights under the Human Rights Act”
Use RAG (Retrieval-Augmented Generation) to ground answers in verified legal sources

Document Upload & Analysis

Accept PDF, DOCX, and scanned court files
Extract text (including OCR for scanned documents)
Summarize documents, identify legal issues, and surface related precedents

Legal Draft Generator

Generate structured legal drafts: notices, bail applications, contracts, affidavits, and petitions
Combine AI generation with curated templates for jurisdiction-specific formatting

Case Law Search Engine

Keyword search and semantic (vector) search
Filters: court, year, judge, legal provision
Coverage: Supreme Court, all High Courts, Indian Acts and Sections

AI Chat Interface

Persistent conversation context
File references inside chat
Legal citation display with source links

Non-Functional Requirements

Dimension	Target
Users	100,000+ concurrent
Response time	< 5 seconds end-to-end
Uptime	99.9%
Document security	Encryption at rest + in transit
Access control	Role-based (lawyer, student, admin)
Compliance	DPDP Act (India) + GDPR-equivalent

Back-of-Envelope Estimation

Metric	Value
Active users	100,000
Queries/user/day	20
Total daily queries	2,000,000
Avg document upload size	2 MB
Documents uploaded/day	50,000
Daily document storage delta	~100 GB
Vector DB entries (legal corpus)	~50 million chunks
Avg embedding size (1536-dim float32)	~6 KB
Total vector storage	~300 GB

Read-heavy workload with expensive LLM inference at the core. Caching and smart retrieval are the two biggest cost levers.

Phase 2: High-Level Architecture

The platform is organized into six layers:

[User Clients: Web / Mobile]
          │
          ▼
[API Gateway + Load Balancer]
          │
    ┌─────┴──────────────────────────┐
    ▼                                ▼
[Auth Service]           [WebSocket Gateway]
                                     │
          ┌──────────────────────────┼──────────────────────┐
          ▼                          ▼                       ▼
[AI Query Service]    [Document Processing Service]  [Draft Generation Service]
          │                          │                       │
          └──────────────────────────┴───────────────────────┘
                                     │
                    ┌────────────────┼────────────────┐
                    ▼                ▼                 ▼
             [PostgreSQL]     [Vector DB]         [Object Store]
             (structured)    (Pinecone/Milvus)    (S3 / GCS)
                                     │
                    ┌────────────────┘
                    ▼
          [LLM Layer: GPT-4o / Claude / Llama 3]
                    │
                    ▼
          [Response + Citations → Client]

Phase 3: Component Deep Dives

Frontend — React + Next.js

The client is a Next.js application with three primary surfaces:

Chat Interface: A ChatGPT-style conversation UI with legal citation cards rendered inline. WebSocket connection maintains real-time streaming responses.
Document Workspace: Drag-and-drop upload, processing status tracker, and an annotation panel for AI-identified legal issues.
Draft Studio: A form-driven interface where users select draft type, fill in facts, and receive a structured AI-generated document with inline editing.

Tailwind CSS handles styling. The frontend connects to the backend exclusively through API Gateway—never directly to microservices.

API Gateway + Load Balancer

AWS API Gateway (or Kong on Kubernetes) sits in front of all services. Responsibilities:

JWT validation on every request
Rate limiting: 100 requests/minute for free tier, 1,000 for pro
Request routing to the correct microservice
WebSocket protocol upgrade for the chat interface

An Application Load Balancer distributes traffic across service replicas. All services are stateless containers—session state lives in Redis.

Authentication Service

Auth stack: AWS Cognito (or Auth0) for identity management, JWT for session tokens.

Role-based access: student, lawyer, admin
Lawyers get access to draft generation and document uploads; students are restricted to research queries
OAuth2 integration for Google/Microsoft login (common in legal enterprise environments)
Document-level authorization: users can only access their own uploaded files

AI Query Service — The RAG Pipeline

This is the heart of the platform. Every legal research query flows through a Retrieval-Augmented Generation pipeline:

User Query
    │
    ▼
[Query Understanding]     ← Intent classification + entity extraction
    │                        (Is this a research question? Draft request? Case search?)
    ▼
[Embedding Generation]    ← text-embedding-3-large (OpenAI) or BGE-M3 (open source)
    │
    ▼
[Vector Search]           ← Top-K retrieval from Pinecone/Milvus
    │                        Namespace: supreme_court | high_courts | acts | sections
    ▼
[Re-Ranking]              ← Cohere Rerank or cross-encoder to filter top-20 → top-5
    │
    ▼
[Context Assembly]        ← Inject retrieved chunks into LLM prompt with source metadata
    │
    ▼
[LLM Generation]          ← GPT-4o / Claude 3.5 Sonnet / Llama 3 70B
    │
    ▼
[Citation Validation]     ← Verify citations exist in DB before returning to user
    │
    ▼
[Response + Source Cards] → Client

Critical design decision: citation validation. Legal AI hallucination is a professional liability risk. Before returning any answer, a post-processing step verifies that every cited case or statute reference (AIR 2017 SC 4478) actually exists in the structured PostgreSQL database. Unverified citations are stripped and flagged. A confidence score is attached to each answer.

Prompt engineering: System prompts enforce that the model only uses provided context, never fabricates citations, and always qualifies answers with “consult a licensed advocate” disclaimers.

Document Processing Service

When a user uploads a file, an asynchronous pipeline runs:

File Upload (S3)
    │
    ▼
[Format Detection]        ← PDF / DOCX / image
    │
    ├── PDF with text → PDFMiner / pdfplumber
    ├── DOCX → python-docx
    └── Scanned / image PDF → AWS Textract (OCR)
    │
    ▼
[Text Cleaning]           ← Remove headers/footers, normalize whitespace, fix OCR artifacts
    │
    ▼
[Chunking]                ← Recursive character chunking, 512 tokens, 50-token overlap
    │                        Legal-aware: respect paragraph boundaries in judgments
    ▼
[Embedding Generation]
    │
    ▼
[Vector DB Upsert]        ← Namespace: user_{user_id}_docs (isolated per user)
    │
    ▼
[Structured Extraction]   ← Identify: parties, case numbers, dates, sections cited
    │                        Store in PostgreSQL documents table
    ▼
[Summary Generation]      ← LLM call with full document context for <2000-word summary
    │
    ▼
[WebSocket Notification]  ← Notify client processing is complete

AWS Textract handles scanned court files with high accuracy for printed text. For handwritten notes, accuracy degrades—surface a confidence warning to users.

Draft Generation Service

Legal drafts require more structure than open-ended Q&A. The service uses a hybrid template + AI approach:

User selects draft type and jurisdiction (e.g., “Bail Application — Crown Court, England & Wales”)
System loads a curated structural template (headings, required sections, standard clauses for that court)
User fills a facts form (client name, case reference, charges, key arguments)
Service assembles a structured prompt injecting both the template skeleton and user facts
LLM generates the narrative body within the structure
Post-processing applies legal formatting: proper cause title format, numbering conventions, prayer clause structure
Output rendered as editable DOCX for download

Templates are stored in PostgreSQL and versioned. Senior lawyers on the platform can contribute and review templates—a community quality layer.

Case Law Search Engine

Two search modes run in parallel and results are merged:

Keyword Search: Elasticsearch index over case metadata (title, citation, judge, year, court, full text). Supports Boolean operators and field-specific filters.
Semantic Search: Vector similarity search in Pinecone against pre-embedded judgment chunks. Captures conceptual matches even when exact keywords differ.

A Reciprocal Rank Fusion algorithm merges results from both modes, weighted 40% keyword / 60% semantic for legal queries (semantic captures legal reasoning patterns better than exact match).

Legal Dataset — Sources and Ingestion

The corpus is the platform’s competitive moat. Ingestion pipeline:

[Sources]
  ├── BAILII (British and Irish Legal Information Institute)
  ├── UK Supreme Court website
  ├── Court of Appeal & High Court judgment portals
  ├── legislation.gov.uk (primary + secondary legislation)
  └── FCA Handbook, ICO guidance, SRA standards

[Ingestion Pipeline]
  Crawler → HTML Cleaner → Metadata Extractor → Chunker → Embedder → Vector DB
                                │
                                └──→ Structured metadata → PostgreSQL

New judgments are ingested within 24 hours via scheduled crawlers. A deduplication step using MinHash LSH prevents re-processing of already-indexed documents.

Phase 4: Database Design

PostgreSQL — Structured Data

-- Core tables (simplified)

users (id, email, role, subscription_tier, created_at)

cases (
  id, citation, title, court, year, judge,
  full_text_s3_key, summary, date_decided,
  acts_cited[], sections_cited[]
)

acts (id, title, year, ministry, full_text_s3_key)

sections (id, act_id, number, title, text)

documents (
  id, user_id, filename, s3_key,
  processing_status, summary,
  extracted_parties, extracted_sections[],
  created_at
)

queries (
  id, user_id, session_id, query_text,
  response_text, citations_used[],
  confidence_score, created_at
)

draft_templates (
  id, draft_type, court, jurisdiction,
  template_structure JSONB, version, is_active
)

sessions (id, user_id, title, created_at, last_active_at)

Vector Database — Pinecone

Three namespaces with separate indexes:

legal-corpus: ~50M chunks from cases, Acts, and sections. Metadata fields: source_type, court, year, citation, act_id, section_number
user-documents: Per-user document embeddings. Metadata: user_id, document_id, page_number
draft-templates: Embedded template descriptions for semantic template retrieval

Redis Cache

Session state for WebSocket connections
Query result cache: MD5 hash of query → cached response (TTL: 1 hour for common queries)
Rate limiting counters
Embedding cache for repeated queries (avoid re-embedding identical strings)

Phase 5: Infrastructure & Deployment

Cloud Architecture (AWS)

[Route 53] → [CloudFront CDN] → [ALB]
                                  │
                         [EKS Cluster]
                           │      │
                     [Services]  [Workers]
                           │
              ┌────────────┼────────────────┐
              ▼            ▼                ▼
         [RDS Postgres]  [ElastiCache    [S3 Buckets]
          Multi-AZ        Redis]          (docs, models)
              │
              ▼
         [Pinecone]      [OpenAI API / Bedrock]

Kubernetes on EKS: Each microservice is a separate Deployment with HPA (Horizontal Pod Autoscaler) based on CPU and custom metrics (queue depth for document processing).

Document Processing Workers: Run as Kubernetes Jobs triggered by SQS messages. Scale to zero when idle; burst to 50+ workers during peak upload times.

LLM Routing: A lightweight router selects the LLM based on task type and cost:

Research Q&A: GPT-4o or Claude 3.5 Sonnet (highest accuracy for legal reasoning)
Summarization: GPT-4o-mini or Claude Haiku (cost-efficient for high volume)
Draft generation: GPT-4o (structured output mode for reliable formatting)
Fallback: Llama 3 70B on AWS Bedrock (no egress cost, data residency compliance)

Security Architecture

Encryption: AES-256 at rest (S3 SSE-KMS), TLS 1.3 in transit
Document isolation: Each user’s uploaded documents stored in an S3 prefix with IAM conditions; vector DB namespaced per user
PII handling: User documents are never sent to third-party LLMs without explicit consent. On-premise Llama deployment available for enterprise tier.
Audit logging: All queries, document accesses, and draft generations logged to CloudWatch with user ID and timestamp (required for SRA and ICO compliance)
UK GDPR & DPA 2018: Data residency in AWS eu-west-2 (London) region; data processing agreements in place with all third-party vendors

Phase 6: Key Design Decisions & Trade-offs

RAG vs. Fine-Tuned Model

Fine-tuning a model on Indian legal data would improve domain fluency but creates a static knowledge base that goes stale as new judgments are issued daily. RAG keeps knowledge current without retraining. We use RAG with a well-prompted general LLM, and accept slightly lower stylistic fluency in exchange for always-current citations.

Pinecone vs. Self-Hosted Milvus

Pinecone is fully managed and requires zero operational overhead. At 50M vectors, monthly cost is ~$700. Milvus on Kubernetes is free but requires a dedicated ops team. For a startup phase, Pinecone; for scale (500M+ vectors), migrate to self-hosted Milvus or Weaviate.

Streaming Responses vs. Batch

Legal research answers can be long. Streaming tokens via WebSocket dramatically improves perceived performance—users see the answer forming in real time rather than waiting 8–12 seconds for a complete response. All LLM calls use server-sent streaming.

Monolith vs. Microservices

We adopt microservices from day one because the five core services have wildly different scaling profiles: document processing is bursty and CPU-intensive; the AI query service is memory-intensive and latency-sensitive; the search service scales with read volume. A monolith would force all services to scale together, wasting cost.

Measuring Platform Health

Track these metrics weekly:

Retrieval Accuracy: % of citations retrieved that are relevant (target: >90%)
Hallucination Rate: Citations returned that don’t exist in the corpus (target: <1%)
P95 Query Latency: 95th percentile end-to-end response time (target: <5s)
Document Processing SLA: % of documents fully processed within 2 minutes (target: >95%)
User Retention (D30): % of users returning after 30 days (north star metric)

Cost Estimation (100k Users, Monthly)

Component	Cost/Month
AWS EKS + EC2 (compute)	~$4,500
RDS PostgreSQL Multi-AZ	~$800
ElastiCache Redis	~$300
S3 storage (docs + corpus)	~$500
Pinecone (50M vectors)	~$700
OpenAI API (2M queries @ GPT-4o)	~$12,000
AWS Textract (OCR)	~$1,500
CloudFront + data transfer	~$400
Misc (monitoring, logging, CDN)	~$600
Total	~$21,300/month

At ₹999/month per pro user, 2,500 paying users break even on infrastructure alone. At 10,000 paying users, margin is healthy. The dominant cost is LLM inference—aggressive caching of common queries (20–30% of queries are near-duplicates) meaningfully reduces this.

Future Roadmap

The architecture is designed to accommodate these without re-platforming:

Voice Legal Assistant: Add a speech-to-text layer (Whisper API) in front of the AI Query Service; spoken queries hit the same RAG pipeline
Judgment Outcome Prediction: Train a classification model on historical case data; serve as a separate microservice with appropriate disclaimers
Multi-Language Support: Add a translation layer (IndicTrans2 for Hindi/regional languages) before embedding generation
Lawyer Collaboration Tools: Shared document workspaces with comment threads; built on top of the existing document service with multi-user access controls
Court Filing Integration: API integrations with eCourts and NJDG for direct case status lookups

Conclusion: Building the Legal Brain of the Future

An AI legal platform is fundamentally a knowledge retrieval and reasoning system with extreme accuracy requirements. The architecture choices here—RAG over fine-tuning, citation validation before every response, hybrid keyword+semantic search—all flow from one principle: in law, a wrong answer isn’t just unhelpful, it’s harmful.

The technology stack (Next.js → FastAPI → Kinesis/SQS → RAG pipeline → PostgreSQL + Pinecone) is proven at scale. The real moat is the legal corpus: the breadth, freshness, and quality of the indexed UK legal data—BAILII judgments, legislation.gov.uk statutes, FCA and SRA guidance—will determine whether this platform becomes essential to the legal profession.

Build the data pipeline right, validate every citation, and ship the simplest version that gives a junior solicitor their first correct answer in 10 seconds instead of 6 hours. The rest follows. ⚖️

On This Page