FINOS Legend: The Model-Driven Platform Reshaping Financial Data
How Goldman Sachs’ internal data infrastructure became the fintech industry’s most ambitious open-source experiment in model-driven governance.
01 - The Data Problem No One Has Solved
Every engineer inside a large financial institution knows the feeling. You need an authoritative view of a single equity swap. To get it, you query a front-office booking system, reconcile against a middle-office risk engine, cross-reference a regulatory database, then ask a senior analyst which one to trust.
This is not an edge case. It is daily life at most global banks.
Data silos, inconsistent schemas, and brittle ETL pipelines are not new problems - but in financial services they are uniquely painful. Regulatory pressure (MiFID II, CCAR, EMIR, CFTC), real-time pricing demands, and complex instrument lifecycles mean a single currency option may live in hundreds of records across front, middle, and back office, with no shared semantic definition.
The modern data stack - dbt, Databricks, Snowflake - improves analytical throughput but papers over the semantic problem. These tools tell you how to move data, not what it means. Schemas drift. Business logic accumulates in undocumented SQL CTEs. Lineage gets reconstructed after the fact.
FINOS Legend attacks the problem at a different layer. It starts from data meaning - treating the logical model as the first-class artifact from which physical access, APIs, and governance all derive.
02 - Origin: From Goldman Sachs to Open Source
Legend began inside Goldman Sachs around 2013 under the names PURE (the modeling language) and Alloy (the visual tooling). By 2019, Goldman had over 1,000 data modelers encoding the firm’s entire information landscape in PURE.
“We believe this new data platform is so powerful and important that we are making it available to the world fully open and free of charge.”
- Neema Raphael, Chief Data Officer, Goldman Sachs (October 2020)
In October 2020, five modules were contributed to FINOS (Fintech Open Source Foundation) under Apache 2.0. The platform was renamed Legend.
Why open source it? FINOS provides neutral ground where competing institutions - Goldman, Morgan Stanley, Deutsche Bank, JPMorgan - can collaborate on shared infrastructure without antitrust concern. Regulatory frameworks are nearly identical across institutions. Building them privately is expensive duplication.
Timeline
2013 -------> Internal development begins (PURE + Alloy)
2019 -------> 1,000+ modelers; open-source intent announced at FINOS forum
2020 -------> Five modules contributed to FINOS under Apache 2.0
2022 -------> ISDA DRR goes live in production (CFTC swap reporting)
2023 -------> ISDA CDM hosted by FINOS; JPMorgan production deployment
The pre-launch pilot included Deutsche Bank, Morgan Stanley, RBC Capital Markets, Wells Fargo, and Itau Unibanco. Their collaborative FX option extensions to the ISDA CDM were accepted into the official release - validating the core thesis.
03 - Core Philosophy: The Model IS the System
The ETL paradigm scatters transformation logic across Spark jobs, SQL views, dbt models, and Python scripts. A single regulatory change requires hunting every encoded semantic and updating it. At scale in a financial institution, this is a change-management nightmare.
Legend centralizes semantic definition in one executable artifact: the data model.
ETL Paradigm (fragmented semantics)
Source --[SQL]--> Stage --[Python]--> Report
logic here logic here too and here...
Legend Paradigm (unified semantics)
+------------------+
| Pure Model | <-- single source of truth
+--------+---------+
/ | \
Query Services Lineage
(auto) (auto) (auto)
Key Insight: The data model is not a description of the system - it is the system. APIs, queries, governance rules, and lineage all derive from the same source artifact, eliminating drift between documentation and implementation.
04 - Platform Architecture
Legend is composed of six purpose-built components that together cover the full data lifecycle.
+------------------------------------+
| Legend Studio (Model Authoring) |
+------------------------------------+
|
v
+------------------------------------+
| Legend SDLC (Version Control) |
+------------------------------------+
|
v
+------------------------------------+
| Legend Depot (Artifact Registry) |
+------------------------------------+
|
v
+------------------------------------+
| Legend Engine (Execution Layer) |
+------------------------------------+
/ | \
/ | \
v v v
+----------+ +----------+ +--------------------+
| Legend | | Legend | | Physical Stores |
| Query | | Services | | (Oracle, Postgres, |
| (UI) | | (API) | | Kafka, Files) |
+----------+ +----------+ +--------------------+
Component Role
--------- ----
Studio Visual editor for Pure classes, mappings, runtimes, and relationship diagrams
SDLC Model-centric version control (backed by GitLab); branching, review, release for metadata entities
Depot Artifact registry enabling cross-team and cross-institution model reuse
Engine Resolves queries against mappings and runtimes; generates optimized SQL or equivalent
Query Drag-and-drop self-service UI; navigates the semantic graph without SQL
Services Promotes any query to a production-grade, SDLC-backed REST API with one click
The Request Flow
Architect SDLC + Engine Analyst Services
(Studio) Depot (Query) (API)
| | | | |
|--Author---->| | | |
| (commit) | | | |
| |--Publish--->| | |
| | |<--Build-----| |
| | | query | |
| | |--Execute--->| (internal) |
| | | | |
| | | |--Promote--->|
| | |<------------|--Versioned--|
| | | | REST API |
Throughout this flow, lineage is captured automatically - every field traces back through every transformation to its physical source. Regulatory lineage becomes a byproduct of modeling, not a separate documentation effort.
05 - The Pure Language
Pure is the intellectual core of Legend: an immutable, functional language grounded in UML class modeling and inspired by the Object Constraint Language (OCL).
Type System and Expressiveness
Pure is class-based with UML concepts - classes, properties, associations, multiplicity, inheritance, stereotypes - capturing not just structure but business rules as formal constraints.
Class finance::trade::EquityTrade
{
tradeId : String[1];
product : EquityProduct[1];
quantity : Float[1];
tradeDate : StrictDate[1];
notional : Float[0..1]; // optional
// Derived property - computed, not stored
settlementDate() {
$this.tradeDate->adjust(2, DurationUnit.BUSINESS_DAYS)
} : StrictDate[1];
}
Class finance::trade::EquityProduct
{
ticker : String[1];
exchange : Exchange[1];
currency : Currency[1];
}
SQL vs. Pure: Where Each Excels
SQL strengths Pure strengths
-------------------- ------------------------------
Set operations Deep object graph traversal
Flat relational data Derived / computed properties
Ad-hoc queries Cross-store federated queries
Broad tooling ecosystem Formal constraints + validation
Model-to-model transformations
Executable lineage
Pure’s immutability is not cosmetic - referentially transparent expressions let the Engine safely reorder and push down operations without side effects, which is critical for auditability in regulatory contexts.
Crucially, mappings between logical model and physical store are first-class Pure constructs. Business semantics and physical binding live in the same artifact, under the same version control, subject to the same governance review.
06 - Four Layers of Data Modeling
[Business Model] ---> [Mapping] ---> [Runtime] ---> [Execution Plan]
(Pure classes (Logical-to- (Environment (Inspectable query
+ constraints) physical connections) tree + lineage)
bindings)
- Business Model: Pure class hierarchy encoding domain concepts with cardinalities and constraints. Speaks the domain language, readable by business stakeholders.
- Mapping: Binds logical properties to physical columns or JSON fields. One class can map to multiple stores simultaneously - production Oracle and test PostgreSQL from the same model.
- Runtime: Database credentials and service URLs, separated from mappings. Same mapping deploys across dev/test/prod by switching runtimes. Model and mapping are reviewed once; runtime is managed operationally.
- Execution Plan: Inspectable tree of operations generated per query. Basis for automated lineage, performance analysis, and audit serialization.
07 - End-to-End Data Lifecycle
[Model] ---> [Map] ---> [Query] ---> [API] ---> [Govern] ---> [Analyze]
(Studio) (Physical (Drag & (Services) (SDLC + (Downstream
Bind) Drop) Lineage) Apps)
Each stage is fully traceable. Governance is structural - constraints enforced at model definition, lineage automatic, every change peer-reviewed through SDLC. This is categorically different from BI tools or ETL frameworks where governance is retrofitted after the fact.
vs dbt: dbt version-controls SQL transformation logic and does it well. But a dbt model is a SQL SELECT statement - it describes transformation, not meaning. Legend’s models describe meaning; transformations derive from it. The two operate at different abstraction layers and are complementary, not competitive.
08 - Real-World Use Cases
Derivatives Lifecycle at Goldman Sachs
Goldman’s Global Markets division encodes full trade lifecycles in Legend - from pre-trade pricing through execution, risk, clearing, and regulatory reporting. The platform underpins CCAR (Comprehensive Capital Analysis and Review) and enabled Goldman to serve BlackRock’s equity swap business via Axoni’s Veris network using ISDA CDM-based model mappings.
ISDA Digital Regulatory Reporting
The ISDA Common Domain Model (CDM) - an open standard for financial products and lifecycle events, hosted by FINOS since 2023 - is the most significant industry-scale Legend application. ISDA’s Digital Regulatory Reporting (DRR) initiative expresses global trade reporting rules as machine-executable CDM code.
In 2022 this went live: a bank used CDM/DRR to comply with CFTC swap reporting requirements. JPMorgan Chase followed with a production deployment. Competing institutions collaborated on open-source regulatory logic that each deployed independently - mutualizing the cost of regulatory change.
Cross-Institution Standardization
Institution A FINOS Legend (neutral ground) Institution B
+-----------+ +---------------------------+ +-----------+
| Model v1 | ------> | Shared CDM Extension | <------ | Model v1 |
+-----------+ +---------------------------+ +-----------+
|
Proposed to ISDA --> Accepted into CDM release
Deutsche Bank, Morgan Stanley, and RBC used a shared Legend instance to build FX option CDM extensions that were accepted into the official CDM release - a replicable template for collaborative, fast-moving industry standards.
09 - Advantages
Advantage What It Means in Practice
--------------------------------- ---------------------------------------------------------------------------------------------------------
Governance by design Constraints in model; lineage automatic; SDLC review mandatory. Cannot accidentally ship ungoverned data.
Business-engineering alignment Studio diagrams are readable by domain experts and engineers alike. Eliminates requirements-to-schema translation loss.
One-click API generation Query to production REST API is a modeling step, not an infrastructure project. Consistently versioned and SDLC-backed.
Cross-institution interoperability Models are SDLC-backed code - shareable across institutional boundaries via FINOS. ISDA CDM is proof.
Store agnosticism Same model executes against Oracle, PostgreSQL, Spark, Kafka, or flat files. Critical in legacy-heavy environments.
10 - Limitations: An Honest Assessment
Limitation Severity Notes
---------------------- -------- -----------------------------------------------------------------------------------------------
Pure learning curve High Bespoke language; weeks of ramp-up even for senior engineers
Operational complexity High Requires Studio, Engine, SDLC (GitLab), Depot, Query, Services - large infrastructure footprint
Narrow ecosystem fit Medium Designed for large financial institutions; poor fit outside finance
DSL monoculture risk Medium Breaking Pure changes or weak community tooling creates migration risk
Modern stack integration Medium No native connectors to Airflow, dbt, or Spark; custom work required
11 - How Legend Fits the Modern Data Stack
Abstraction Layer Map
Business Semantics --> Legend (Pure models, CDM)
Metric Definitions --> dbt Semantic Layer, AtScale, Cube
SQL Transformations --> dbt core
Compute --> Spark, Databricks
Storage --> Snowflake, BigQuery, Postgres, Oracle
Platform Primary Abstraction Governance Best Fit
-------------------- -------------------------- ---------------------------- ------------------------------------------------------------
FINOS Legend Semantic data model (Pure) Structural, model-encoded Complex financial domains; cross-institution standardization
dbt SQL transformation model Version-controlled SQL Analytics engineering on cloud warehouses
Apache Spark Distributed processing External; no semantic layer Large-scale batch and streaming
Snowflake / BigQuery SQL storage + compute Access control; external cat Analytical workloads; fast SQL
AtScale / Cube Metric and dimension model Metric consistency Consistent BI metrics across tools
Legend is not a replacement for the modern data stack. It operates above it at the semantic layer, delegating physical execution to Spark, PostgreSQL, or a cloud warehouse. The architectures are complementary.
12 - The Future: Semantic Infrastructure at Scale
[Data Contracts] [Semantic Data Layers] [AI-Ready Data]
(Formal producer-consumer interfaces) (Single governed meaning layer) (Structured semantics for LLMs)
\ | /
\ | /
v v v
+-------------------------------------------------+
| Legend's Model-Driven Architecture |
+-------------------------------------------------+
Legend sits at the convergence of three accelerating trends:
Data contracts - the idea that producer-consumer interfaces should be formally specified and enforced - are exactly what Legend already does: models are executable contracts between semantic definition and physical reality.
The ISDA CDM flywheel is the most compelling near-term trajectory. As more institutions publish models to shared FINOS repositories, network value compounds. A new entrant doesn’t model derivatives from scratch - they extend the CDM.
AI amplification is the longer bet. A Pure model that formally captures a financial instrument’s constraints, lifecycle events, and regulatory classifications is dramatically more useful as an AI inference substrate than loosely documented SQL tables. Legend may be building the semantic infrastructure that matters most in an AI-accelerated financial system.
13 - Conclusion
FINOS Legend is not the most accessible data platform available today. Pure demands real investment. The operational footprint is substantial. Adoption outside large financial institutions remains modest.
But the underlying thesis - that enterprises need a semantic layer that is executable, governed, and shared - is almost certainly correct. Legend is the most rigorous, production-proven implementation of that thesis in open source. It was built by an institution that genuinely had to solve the problem at scale, validated by a multi-institution pilot that produced real regulatory code, and is now steered by a neutral foundation.
For data engineers in financial services: Legend deserves serious evaluation as the semantic governance layer above your existing infrastructure. For engineers outside finance grappling with model-driven architecture or data contracts: Legend is a rich source of architectural thinking.
The hardest problem in enterprise data is not storage, compute, or tooling. It is meaning. Legend is one of the most serious attempts to build that foundation in the open.
Further Reading
- Official Legend Documentation
- FINOS GitHub
- Goldman Sachs Engineering Blog
- ISDA Common Domain Model
- Image source
Published on LAKSHYATANGRI.COM - Engineering and Architecture deep-dives for technical professionals.



