FINOS Legend: The Model-Driven Platform Reshaping Financial Data

How Goldman Sachs’ internal data infrastructure became the fintech industry’s most ambitious open-source experiment in model-driven governance.

01 - The Data Problem No One Has Solved

Every engineer inside a large financial institution knows the feeling. You need an authoritative view of a single equity swap. To get it, you query a front-office booking system, reconcile against a middle-office risk engine, cross-reference a regulatory database, then ask a senior analyst which one to trust.

This is not an edge case. It is daily life at most global banks.

Data silos, inconsistent schemas, and brittle ETL pipelines are not new problems - but in financial services they are uniquely painful. Regulatory pressure (MiFID II, CCAR, EMIR, CFTC), real-time pricing demands, and complex instrument lifecycles mean a single currency option may live in hundreds of records across front, middle, and back office, with no shared semantic definition.

The modern data stack - dbt, Databricks, Snowflake - improves analytical throughput but papers over the semantic problem. These tools tell you how to move data, not what it means. Schemas drift. Business logic accumulates in undocumented SQL CTEs. Lineage gets reconstructed after the fact.

FINOS Legend attacks the problem at a different layer. It starts from data meaning - treating the logical model as the first-class artifact from which physical access, APIs, and governance all derive.

02 - Origin: From Goldman Sachs to Open Source

Legend began inside Goldman Sachs around 2013 under the names PURE (the modeling language) and Alloy (the visual tooling). By 2019, Goldman had over 1,000 data modelers encoding the firm’s entire information landscape in PURE.

“We believe this new data platform is so powerful and important that we are making it available to the world fully open and free of charge.”

Neema Raphael, Chief Data Officer, Goldman Sachs (October 2020)

In October 2020, five modules were contributed to FINOS (Fintech Open Source Foundation) under Apache 2.0. The platform was renamed Legend.

Why open source it? FINOS provides neutral ground where competing institutions - Goldman, Morgan Stanley, Deutsche Bank, JPMorgan - can collaborate on shared infrastructure without antitrust concern. Regulatory frameworks are nearly identical across institutions. Building them privately is expensive duplication.

Timeline

2013 -------> Internal development begins (PURE + Alloy)
2019 -------> 1,000+ modelers; open-source intent announced at FINOS forum
2020 -------> Five modules contributed to FINOS under Apache 2.0
2022 -------> ISDA DRR goes live in production (CFTC swap reporting)
2023 -------> ISDA CDM hosted by FINOS; JPMorgan production deployment

The pre-launch pilot included Deutsche Bank, Morgan Stanley, RBC Capital Markets, Wells Fargo, and Itau Unibanco. Their collaborative FX option extensions to the ISDA CDM were accepted into the official release - validating the core thesis.

03 - Core Philosophy: The Model IS the System

The ETL paradigm scatters transformation logic across Spark jobs, SQL views, dbt models, and Python scripts. A single regulatory change requires hunting every encoded semantic and updating it. At scale in a financial institution, this is a change-management nightmare.

Legend centralizes semantic definition in one executable artifact: the data model.

ETL Paradigm (fragmented semantics)

  Source --[SQL]--> Stage --[Python]--> Report
   logic here         logic here too    and here...

Legend Paradigm (unified semantics)

              +------------------+
              |   Pure Model     |  <-- single source of truth
              +--------+---------+
                /      |       \
           Query    Services   Lineage
           (auto)   (auto)     (auto)

Key Insight: The data model is not a description of the system - it is the system. APIs, queries, governance rules, and lineage all derive from the same source artifact, eliminating drift between documentation and implementation.

04 - Platform Architecture

Legend is composed of six purpose-built components that together cover the full data lifecycle.

+------------------------------------+
| Legend Studio (Model Authoring)    |
+------------------------------------+
                  |
                  v
+------------------------------------+
| Legend SDLC (Version Control)      |
+------------------------------------+
                  |
                  v
+------------------------------------+
| Legend Depot (Artifact Registry)   |
+------------------------------------+
                  |
                  v
+------------------------------------+
| Legend Engine (Execution Layer)    |
+------------------------------------+
          /       |       \
         /        |        \
        v         v         v
+----------+ +----------+ +--------------------+
| Legend   | | Legend   | | Physical Stores    |
| Query    | | Services | | (Oracle, Postgres, |
| (UI)     | | (API)    | |  Kafka, Files)     |
+----------+ +----------+ +--------------------+

Component     Role
---------     ----
Studio        Visual editor for Pure classes, mappings, runtimes, and relationship diagrams
SDLC          Model-centric version control (backed by GitLab); branching, review, release for metadata entities
Depot         Artifact registry enabling cross-team and cross-institution model reuse
Engine        Resolves queries against mappings and runtimes; generates optimized SQL or equivalent
Query         Drag-and-drop self-service UI; navigates the semantic graph without SQL
Services      Promotes any query to a production-grade, SDLC-backed REST API with one click

The Request Flow

Architect      SDLC +        Engine        Analyst        Services
(Studio)       Depot                       (Query)        (API)
   |             |             |             |             |
   |--Author---->|             |             |             |
   | (commit)    |             |             |             |
   |             |--Publish--->|             |             |
   |             |             |<--Build-----|             |
   |             |             |   query     |             |
   |             |             |--Execute--->| (internal)  |
   |             |             |             |             |
   |             |             |             |--Promote--->|
   |             |             |<------------|--Versioned--|
   |             |             |             |  REST API   |

Throughout this flow, lineage is captured automatically - every field traces back through every transformation to its physical source. Regulatory lineage becomes a byproduct of modeling, not a separate documentation effort.

05 - The Pure Language

Pure is the intellectual core of Legend: an immutable, functional language grounded in UML class modeling and inspired by the Object Constraint Language (OCL).

Type System and Expressiveness

Pure is class-based with UML concepts - classes, properties, associations, multiplicity, inheritance, stereotypes - capturing not just structure but business rules as formal constraints.

Class finance::trade::EquityTrade
{
  tradeId    : String[1];
  product    : EquityProduct[1];
  quantity   : Float[1];
  tradeDate  : StrictDate[1];
  notional   : Float[0..1]; // optional

  // Derived property - computed, not stored
  settlementDate() {
    $this.tradeDate->adjust(2, DurationUnit.BUSINESS_DAYS)
  } : StrictDate[1];
}

Class finance::trade::EquityProduct
{
  ticker    : String[1];
  exchange  : Exchange[1];
  currency  : Currency[1];
}

SQL vs. Pure: Where Each Excels

SQL strengths                Pure strengths
--------------------         ------------------------------
Set operations               Deep object graph traversal
Flat relational data         Derived / computed properties
Ad-hoc queries               Cross-store federated queries
Broad tooling ecosystem      Formal constraints + validation
                             Model-to-model transformations
                             Executable lineage

Pure’s immutability is not cosmetic - referentially transparent expressions let the Engine safely reorder and push down operations without side effects, which is critical for auditability in regulatory contexts.

Crucially, mappings between logical model and physical store are first-class Pure constructs. Business semantics and physical binding live in the same artifact, under the same version control, subject to the same governance review.

06 - Four Layers of Data Modeling

[Business Model] ---> [Mapping] ---> [Runtime] ---> [Execution Plan]
(Pure classes         (Logical-to-   (Environment   (Inspectable query
 + constraints)        physical       connections)   tree + lineage)
                       bindings)

Business Model: Pure class hierarchy encoding domain concepts with cardinalities and constraints. Speaks the domain language, readable by business stakeholders.
Mapping: Binds logical properties to physical columns or JSON fields. One class can map to multiple stores simultaneously - production Oracle and test PostgreSQL from the same model.
Runtime: Database credentials and service URLs, separated from mappings. Same mapping deploys across dev/test/prod by switching runtimes. Model and mapping are reviewed once; runtime is managed operationally.
Execution Plan: Inspectable tree of operations generated per query. Basis for automated lineage, performance analysis, and audit serialization.

07 - End-to-End Data Lifecycle

[Model] ---> [Map] ---> [Query] ---> [API] ---> [Govern] ---> [Analyze]
(Studio)     (Physical  (Drag &      (Services) (SDLC +       (Downstream
              Bind)      Drop)                   Lineage)      Apps)

Each stage is fully traceable. Governance is structural - constraints enforced at model definition, lineage automatic, every change peer-reviewed through SDLC. This is categorically different from BI tools or ETL frameworks where governance is retrofitted after the fact.

vs dbt: dbt version-controls SQL transformation logic and does it well. But a dbt model is a SQL SELECT statement - it describes transformation, not meaning. Legend’s models describe meaning; transformations derive from it. The two operate at different abstraction layers and are complementary, not competitive.

08 - Real-World Use Cases

Derivatives Lifecycle at Goldman Sachs

Goldman’s Global Markets division encodes full trade lifecycles in Legend - from pre-trade pricing through execution, risk, clearing, and regulatory reporting. The platform underpins CCAR (Comprehensive Capital Analysis and Review) and enabled Goldman to serve BlackRock’s equity swap business via Axoni’s Veris network using ISDA CDM-based model mappings.

ISDA Digital Regulatory Reporting

The ISDA Common Domain Model (CDM) - an open standard for financial products and lifecycle events, hosted by FINOS since 2023 - is the most significant industry-scale Legend application. ISDA’s Digital Regulatory Reporting (DRR) initiative expresses global trade reporting rules as machine-executable CDM code.

In 2022 this went live: a bank used CDM/DRR to comply with CFTC swap reporting requirements. JPMorgan Chase followed with a production deployment. Competing institutions collaborated on open-source regulatory logic that each deployed independently - mutualizing the cost of regulatory change.

Cross-Institution Standardization

Institution A         FINOS Legend (neutral ground)        Institution B
+-----------+         +---------------------------+         +-----------+
| Model v1  | ------> |  Shared CDM Extension     | <------ | Model v1  |
+-----------+         +---------------------------+         +-----------+
                                   |
                       Proposed to ISDA --> Accepted into CDM release

Deutsche Bank, Morgan Stanley, and RBC used a shared Legend instance to build FX option CDM extensions that were accepted into the official CDM release - a replicable template for collaborative, fast-moving industry standards.

09 - Advantages

Advantage                           What It Means in Practice
---------------------------------   ---------------------------------------------------------------------------------------------------------
Governance by design                Constraints in model; lineage automatic; SDLC review mandatory. Cannot accidentally ship ungoverned data.
Business-engineering alignment      Studio diagrams are readable by domain experts and engineers alike. Eliminates requirements-to-schema translation loss.
One-click API generation            Query to production REST API is a modeling step, not an infrastructure project. Consistently versioned and SDLC-backed.
Cross-institution interoperability  Models are SDLC-backed code - shareable across institutional boundaries via FINOS. ISDA CDM is proof.
Store agnosticism                   Same model executes against Oracle, PostgreSQL, Spark, Kafka, or flat files. Critical in legacy-heavy environments.

10 - Limitations: An Honest Assessment

Limitation                  Severity   Notes
----------------------      --------   -----------------------------------------------------------------------------------------------
Pure learning curve         High       Bespoke language; weeks of ramp-up even for senior engineers
Operational complexity      High       Requires Studio, Engine, SDLC (GitLab), Depot, Query, Services - large infrastructure footprint
Narrow ecosystem fit        Medium     Designed for large financial institutions; poor fit outside finance
DSL monoculture risk        Medium     Breaking Pure changes or weak community tooling creates migration risk
Modern stack integration    Medium     No native connectors to Airflow, dbt, or Spark; custom work required

11 - How Legend Fits the Modern Data Stack

Abstraction Layer Map

  Business Semantics   -->  Legend (Pure models, CDM)
  Metric Definitions   -->  dbt Semantic Layer, AtScale, Cube
  SQL Transformations  -->  dbt core
  Compute              -->  Spark, Databricks
  Storage              -->  Snowflake, BigQuery, Postgres, Oracle

Platform               Primary Abstraction          Governance                     Best Fit
--------------------   --------------------------   ----------------------------   ------------------------------------------------------------
FINOS Legend           Semantic data model (Pure)   Structural, model-encoded      Complex financial domains; cross-institution standardization
dbt                    SQL transformation model     Version-controlled SQL         Analytics engineering on cloud warehouses
Apache Spark           Distributed processing       External; no semantic layer    Large-scale batch and streaming
Snowflake / BigQuery   SQL storage + compute        Access control; external cat   Analytical workloads; fast SQL
AtScale / Cube         Metric and dimension model   Metric consistency             Consistent BI metrics across tools

Legend is not a replacement for the modern data stack. It operates above it at the semantic layer, delegating physical execution to Spark, PostgreSQL, or a cloud warehouse. The architectures are complementary.

12 - The Future: Semantic Infrastructure at Scale

[Data Contracts]                       [Semantic Data Layers]                      [AI-Ready Data]
(Formal producer-consumer interfaces)  (Single governed meaning layer)             (Structured semantics for LLMs)
                         \                        |                         /
                          \                       |                        /
                           v                      v                       v
                         +-------------------------------------------------+
                         |      Legend's Model-Driven Architecture         |
                         +-------------------------------------------------+

Legend sits at the convergence of three accelerating trends:

Data contracts - the idea that producer-consumer interfaces should be formally specified and enforced - are exactly what Legend already does: models are executable contracts between semantic definition and physical reality.

The ISDA CDM flywheel is the most compelling near-term trajectory. As more institutions publish models to shared FINOS repositories, network value compounds. A new entrant doesn’t model derivatives from scratch - they extend the CDM.

AI amplification is the longer bet. A Pure model that formally captures a financial instrument’s constraints, lifecycle events, and regulatory classifications is dramatically more useful as an AI inference substrate than loosely documented SQL tables. Legend may be building the semantic infrastructure that matters most in an AI-accelerated financial system.

13 - Conclusion

FINOS Legend is not the most accessible data platform available today. Pure demands real investment. The operational footprint is substantial. Adoption outside large financial institutions remains modest.

But the underlying thesis - that enterprises need a semantic layer that is executable, governed, and shared - is almost certainly correct. Legend is the most rigorous, production-proven implementation of that thesis in open source. It was built by an institution that genuinely had to solve the problem at scale, validated by a multi-institution pilot that produced real regulatory code, and is now steered by a neutral foundation.

For data engineers in financial services: Legend deserves serious evaluation as the semantic governance layer above your existing infrastructure. For engineers outside finance grappling with model-driven architecture or data contracts: Legend is a rich source of architectural thinking.

The hardest problem in enterprise data is not storage, compute, or tooling. It is meaning. Legend is one of the most serious attempts to build that foundation in the open.

Further Reading

Published on LAKSHYATANGRI.COM - Engineering and Architecture deep-dives for technical professionals.

On This Page

Legend by Goldman Sachs : Open Source Model Driven Financial Data platform