Sitemap

Data Contracts in Action

15 min readJun 18, 2025

--

Todd McLellan

In the world of data, we often talk about “data products” and “data as a service.” Sounds great, right? But here’s the dirty little secret: too often, data isn’t reliable. It’s like building a beautiful house on a shaky foundation.

Imagine you’re a data consumer — perhaps a business analyst creating a critical report, or a machine learning engineer building a predictive model. You pull data from a source, only for the schema to unexpectedly change, a column to go missing, or the data quality to plummet without warning. Suddenly:

  • Your dashboards break.
  • Your reports show incorrect numbers.
  • Your machine learning models start making bad predictions.
  • You spend hours, days, or even weeks debugging data issues, not building value.

This isn’t just annoying; it erodes trust, slows down innovation, and costs organizations significant time and money. The root cause? A lack of clear, enforced agreements between data producers (those generating the data) and data consumers (those using it). This is where data contracts come in.

Why datacontract CLI?

While the concept of data contracts is powerful, implementing them shouldn’t be a bureaucratic nightmare. This is precisely where https://cli.datacontract.com/ (the datacontract CLI) shines.

It’s an open-source, platform-agnostic command-line tool that empowers data teams to:

  • Define data contracts in a simple YAML format.
  • Lint contracts to catch syntax errors and best practice violations.
  • Test actual data against the contract for schema and quality compliance.
  • Detect breaking changes before they wreak havoc.
  • Export contracts into various formats (like SQL DDL).

In short, the datacontract CLI makes data contracts not just a concept, but a tangible, automatable part of your data workflow. It's built for engineers, by engineers, focusing on practical implementation and seamless integration into your existing CI/CD pipelines.

Getting Started: Installation & First Command

Now that we understand why data contracts are crucial and what the datacontract CLI offers, let's get our hands dirty and set it up. This section will guide you through the quick installation process and how to verify everything is working as expected.

Prerequisites

Before you begin, make sure you have one of the following:

  • Python 3.10+: The datacontract-cli is a Python application. We recommend using Python 3.11 for optimal compatibility.
  • Docker: If you prefer a containerized approach or don’t want to manage Python environments directly, Docker is an excellent alternative.

The datacontract CLI is straightforward to install. Choose the method that best fits your workflow:

Using pip (Recommended for Python users): This method installs the datacontract-cli along with all necessary database connectors (like those for S3, BigQuery, Snowflake, etc.).

pip install 'datacontract-cli[all]'

If you prefer to use a virtual environment (which is good practice for Python projects), you can do the following:

python -m venv venv
source venv/bin/activate # On Windows, use `venv\Scripts\activate`
pip install 'datacontract-cli[all]'

Using Docker (Alternative): If you prefer to run datacontract CLI in an isolated container, you can pull the official Docker image:

docker pull datacontract/cli
  1. You’ll then run commands by prefixing them with docker run --rm -v ${PWD}:/home/datacontract datacontract/cli. We'll see an example of this in the verification step.

Verifying Installation

After installation, it’s always a good idea to confirm that the datacontract CLI is correctly installed and accessible in your environment.

If installed with pip:

datacontract — version

If using Docker:

docker run --rm datacontract/cli --version

Congratulations! You’ve successfully installed the datacontract CLI. You're now ready to start crafting your first data contract.

Crafting Your First Data Contract (datacontract.yaml)

With the datacontract CLI installed, it's time to define your data contract. Data contracts are typically written in YAML, a human-friendly data serialization standard. The datacontract CLI provides an excellent starting point with its init command.

The datacontract init Command: Quick Start with a Template

The easiest way to begin is by using the init command. This command generates a basic datacontract.yaml file with a pre-filled template, saving you from starting from scratch.

Let’s create a new file named my_first_contract.yaml:

datacontract init my_first_contract.yaml

This command will create my_first_contract.yaml in your current directory. Open it with your favorite text editor, and you'll see a structure similar to this (comments and some details omitted for brevity, but they'll be in your generated file):

dataContractSpecification: 1.1.0
id: urn:datacontract:your-org:my-first-contract
info:
title: My First Data Contract
version: 0.1.0
owner: Data Engineering Team
servers: {}
models: {}

This boilerplate provides the foundational sections for your data contract. Let’s break down each key part.

Understanding the Structure

A data contract YAML file is structured logically to define everything about your data product.

info: Basic Metadata

This section holds general information about your data contract. It’s crucial for documentation and understanding the purpose and ownership of the data.

info:
title: Orders Data Product
version: 1.0.0
owner: Sales Analytics Team <sales-analytics@example.com>
description: |
This contract defines the structure and quality expectations for the 'orders'
data product, capturing key transaction details.
  • title: A human-readable name for your data contract.
  • version: The version of this specific data contract. Follow semantic versioning (e.g., 1.0.0, 1.1.0, 2.0.0).
  • owner: The team or individual responsible for this data product.
  • description: A more detailed explanation of what this data contract covers.

servers: Defining Your Data Sources

The servers section specifies where your data actually lives. This is how the datacontract CLI knows how to connect and test your data. You can define multiple servers (e.g., production, staging, development).

Let’s add a simple S3 CSV server configuration for demonstration. This tells the CLI that our orders data is a CSV file in a specific S3 bucket.

servers:
production:
type: s3
location: s3://your-company-data-lake/sales/orders/
format: csv # Specify the format if applicable
delimiter: ","
header: true
  • production: This is a logical name for your environment.
  • type: The type of data source (e.g., s3, bigquery, snowflake, postgresql).
  • location: The path to your data.
  • format: (Optional) The file format (e.g., csv, parquet, json).
  • delimiter, header: (Optional) Format-specific options for CSV.

Remember that for datacontract CLI to connect to services like S3, you'll need to provide credentials, typically via environment variables (e.g., DATACONTRACT_S3_ACCESS_KEY_ID, DATACONTRACT_S3_SECRET_ACCESS_KEY). We'll cover this more in the "Testing" section.

models: Describing Your Data Schema

The models section is where you define the expected schema of your data product. Each model represents a dataset or a table.

Let’s define a simple orders model with a few key fields:

models:
orders:
type: table # Or 'object' for other data structures
description: Details of customer orders.
fields:
order_id:
type: string
primaryKey: true
description: Unique identifier for the order.
customer_id:
type: string
description: Unique identifier for the customer.
examples:
- "CUST001"
- "CUST002"
order_timestamp:
type: timestamp
required: true
description: The UTC timestamp when the order was placed.
order_total_usd:
type: decimal
required: true
description: Total amount of the order in USD.
examples:
- 123.45
- 99.99
  • orders: The name of your data model (e.g., a table name).
  • type: The type of model (e.g., table, object).
  • fields: A list of expected columns/attributes in your dataset.
  • field_name: The name of the column (e.g., order_id).
  • type: The data type (e.g., string, integer, decimal, timestamp, boolean).
  • primaryKey: (Optional) Set to true if this field is part of the primary key.
  • required: (Optional) Set to true if this field cannot be null.
  • description: A clear explanation of what the field represents.
  • examples: (Optional) Illustrative examples of valid values.

quality: Adding Data Quality Expectations

This is where you define the actual data quality rules that the datacontract CLI will verify. You can specify various types of quality checks.

Let’s add some basic quality checks for our orders model:

models:
orders:
# ... (previous model definition)
quality:
- type: row_count
description: Ensure there are at least 1000 orders in the dataset.
mustBe:
min: 1000
- type: not_null
field: order_id
description: order_id must never be null.
- type: not_null
field: order_timestamp
description: order_timestamp must never be null.
- type: unique
field: order_id
description: order_id must be unique across all records.
- type: custom_sql
name: order_total_range_check
description: 95% of all order total values are expected to be between 10 and 1000 USD.
query: |
SELECT SUM(CASE WHEN order_total_usd BETWEEN 10 AND 1000 THEN 1 ELSE 0 END) * 1.0 / COUNT(*) AS percentage_in_range
FROM orders_table; # 'orders_table' will be replaced by the actual table name by the CLI
mustBe:
min: 0.95 # At least 95% of values should be in range
  • type: The type of quality check (e.g., row_count, not_null, unique, custom_sql).
  • field: (For field-specific checks) The name of the field to check.
  • description: A human-readable explanation of the quality rule.
  • mustBe: The conditions that must be met (e.g., min, max, equals).
  • custom_sql: For advanced checks, you can provide a SQL query. The datacontract CLI will execute this against your data source. Note that orders_table is a placeholder that will be replaced by the actual table name derived from your server and model configuration.

By filling out these sections in your my_first_contract.yaml file, you're building a comprehensive agreement about your data. In the next section, we'll see how to validate this contract using the lint command.

Linting Your Data Contract: Catching Issues Early

You’ve just crafted your first datacontract.yaml file, defining the blueprint for your data product. Before you even think about connecting to a data source and running tests, it's critical to ensure that your data contract itself is well-formed and adheres to the specification. This is where linting comes in.

Linting is like having a diligent editor review your YAML file for syntax errors, structural inconsistencies, and potential issues that could lead to problems down the line. It’s a quick, local check that saves you headaches later.

The datacontract lint Command

To lint your data contract, simply use the datacontract lint command, pointing it to your YAML file:

datacontract lint my_first_contract.yaml

When you run this command, the datacontract CLI will parse your my_first_contract.yaml file and report any findings.

  • On Success: If your contract is perfectly valid, you’ll see a message indicating success, typically something like:
✅ Data Contract is valid.
  • On Failure: If there are issues, the CLI will provide detailed error messages, indicating the line number and the nature of the problem.

Interpreting Linting Results: Common Errors and How to Fix Them

Linting errors are your friends! They help you catch mistakes early. Here are some common issues you might encounter and how to address them:

YAML Syntax Errors:

  • Problem: Incorrect indentation, missing colons, invalid characters. YAML is very sensitive to whitespace.
  • Example Error: yaml.scanner.ScannerError: mapping values are not allowed here or expected <indicator>.
  • Fix: Double-check your indentation. Ensure consistent use of spaces (not tabs). Use a YAML linter in your text editor (like VS Code’s YAML extension) for real-time feedback.

Missing Required Fields:

  • Problem: The dataContractSpecification might require certain fields (e.g., title, version under info) that are missing.
  • Example Error: ValidationError: 'title' is a required property.
  • Fix: Add the missing required field as specified in the error message. Refer to the datacontract.com documentation for the full specification if unsure.

Invalid Data Types or Values:

  • Problem: You’ve used an unrecognized type for a field (e.g., str instead of string) or an invalid value for a property (e.g., a string where a number is expected).
  • Example Error: ValidationError: 'str' is not a valid 'type' for field 'my_field'.
  • Fix: Correct the data type to one supported by the datacontract specification (e.g., string, integer, decimal, timestamp, boolean).

Incorrect Structure:

  • Problem: Misplacing a section (e.g., quality block outside of models).
  • Example Error: ValidationError: Additional properties are not allowed ('quality' was unexpected).
  • Fix: Review the expected structure of the datacontract.yaml file. The datacontract init template is a great reference for correct placement.

By making datacontract lint a standard part of your development workflow, especially before committing changes, you ensure that your data contracts are always syntactically correct and follow the specification. This sets the stage for reliable data testing in the next step!

Testing Your Data Against the Contract

Linting ensures your data contract is well-formed, but the real power of datacontract CLI comes from its ability to test your actual data against the definitions in your contract. This is your real-world validation step, ensuring that the data flowing through your pipelines lives up to its promise.

For datacontract CLI to connect to your data source (like S3, BigQuery, Snowflake, etc.), it needs credentials. The safest and most common way to provide these is through environment variables. The specific variables will depend on your data source type defined in the servers section of your datacontract.yaml.

For our S3 example, you’ll typically need:

export DATACONTRACT_S3_ACCESS_KEY_ID="YOUR_S3_ACCESS_KEY_ID"
export DATACONTRACT_S3_SECRET_ACCESS_KEY="YOUR_S3_SECRET_ACCESS_KEY"
# Optionally, if your bucket is in a specific region:
export DATACONTRACT_S3_REGION="your-aws-region"

Important Notes:

  • Security: Never hardcode credentials directly into your datacontract.yaml or version control them. Use environment variables or a secure secret management system.
  • Other Data Sources: For other data sources (e.g., BigQuery, Snowflake, PostgreSQL), the environment variables will differ. Refer to the datacontract.com documentation for the exact variable names (e.g., DATACONTRACT_BIGQUERY_PROJECT_ID, DATACONTRACT_SNOWFLAKE_USERNAME).

The datacontract test Command

Once your environment variables are set, you can execute the test command. You’ll specify your data contract file and the server environment you want to test against (as defined in your servers block).

datacontract test my_first_contract.yaml --server production

This command will:

  1. Parse your my_first_contract.yaml.
  2. Connect to the data source specified under the production server using the provided credentials.
  3. Execute all schema and quality checks defined in your models section against the actual data.

Analyzing Test Results: Schema Validation, Data Quality Checks, Success/Failure Scenarios

The output of the datacontract test command is designed to be clear and actionable.

  • Success: If all checks pass, you’ll see a clear success message:
✅ Data Contract tests passed.

This means your actual data adheres to the schema and quality expectations defined in your contract.

  • Failure: If any tests fail, the CLI will provide detailed information about which checks failed and why. This is incredibly valuable for debugging data issues.

Example of a potential failure for a not_null check:

❌ Data Contract tests failed.
- Model: orders
- Field: order_id
- Check: not_null (Failed)
Description: order_id must never be null.
Details: Found 5 null values in 'order_id'.

Example of a potential failure for a row_count check:

❌ Data Contract tests failed.
- Model: orders
- Check: row_count (Failed)
Description: Ensure there are at least 1000 orders in the dataset.
Details: Actual row count was 850, expected min 1000.

You’ll see similar detailed output for schema mismatches (e.g., incorrect data types, missing fields) and any custom_sql checks that fail.

What to do when tests fail:

  • Investigate the Data: The error messages will guide you. Check your data source to understand why the data isn’t conforming to the contract. Is there an upstream data pipeline issue? Has a schema changed unexpectedly?
  • Adjust the Contract (if necessary): Sometimes, the data contract might be too strict, or the data has legitimately changed. In such cases, you might need to update your datacontract.yaml to reflect the new reality (and ideally, communicate these changes to consumers!).
  • Fix the Data Pipeline: More often, a test failure indicates a problem with the data production process. The contract acts as an early warning system, helping you catch and fix issues before they impact downstream consumers.

By consistently running datacontract test as part of your data pipeline's CI/CD, you establish a powerful guardrail, ensuring data quality and preventing unexpected breakage.

Detecting Breaking Changes

Data contracts are not static; they evolve as your data products change. However, modifying a data contract without careful consideration can lead to breaking changes — alterations that negatively impact existing data consumers. The datacontract CLI provides a crucial command to proactively identify such changes, helping you maintain backward compatibility and prevent downstream failures.

The datacontract breaking-changes Command

This command compares two versions of a data contract (an “old” and a “new” version) and reports any breaking changes. This is invaluable in a CI/CD pipeline, where you can automatically check proposed contract changes before they are merged.

datacontract breaking-changes old_contract.yaml new_contract.yaml

Here:

  • old_contract.yaml: Represents the current, deployed version of your data contract.
  • new_contract.yaml: Represents the proposed, modified version of your data contract.

The CLI will analyze these two files and report any changes that could break existing consumers, such as:

  • Removing a required field.
  • Changing a field's data type to an incompatible one (e.g., integer to string is usually okay, string to integer is not).
  • Making an optional field required.
  • Renaming a field without providing an alias/migration strategy.

Scenario: Make a Breaking Change and See the CLI in Action

Let's illustrate this with a quick example.

Step 1: Create an old_contract.yaml Copy your my_first_contract.yaml to old_contract.yaml. This will be our baseline.

cp my_first_contract.yaml old_contract.yaml

Step 2: Introduce a Breaking Change in my_first_contract.yaml Open my_first_contract.yaml (which now acts as new_contract.yaml for this test) and make a breaking change. For instance, let's rename order_total_usd to total_amount_usd and remove the customer_id field, making it a breaking change for any consumer relying on it.

# my_first_contract.yaml (excerpt with breaking change)
models:
orders:
fields:
order_id:
type: string
primaryKey: true
description: Unique identifier for the order.
# customer_id: Removed this field - BREAKING CHANGE!
order_timestamp:
type: timestamp
required: true
description: The UTC timestamp when the order was placed.
total_amount_usd: # Renamed from order_total_usd - BREAKING CHANGE!
type: decimal
required: true
description: Total amount of the order in USD.

Step 3: Run the breaking-changes Command

Now, execute the command:

datacontract breaking-changes old_contract.yaml my_first_contract.yaml

You will see output similar to this, highlighting the breaking changes:

❌ Breaking changes detected!
- Removed field 'customer_id' from model 'orders'. This is a breaking change.
- Renamed field 'order_total_usd' to 'total_amount_usd' in model 'orders'. This is a breaking change.

This immediate feedback is invaluable. It tells you exactly what changes are breaking and helps you decide how to proceed.

Integrating into CI/CD for Automated Breaking Change Detection

The true power of the breaking-changes command is realized when integrated into your Continuous Integration/Continuous Delivery (CI/CD) pipeline.

How to integrate:

  1. Version Control: Ensure your datacontract.yaml files are under version control (e.g., Git).
  2. Pull Request Checks: In your CI/CD pipeline (e.g., GitHub Actions, GitLab CI, Jenkins), set up a job that runs the datacontract breaking-changes command whenever a pull request (PR) is opened or updated for your data contract.
  3. Comparison Logic:
  • Fetch the datacontract.yaml from the main (or production) branch. This is your old_contract.yaml.
  • Use the datacontract.yaml from the current feature branch (the one in the PR). This is your new_contract.yaml.
  • Run datacontract breaking-changes old_contract.yaml new_contract.yaml.

4. Fail the Build: Configure the CI/CD job to fail if the breaking-changes command detects any breaking changes. This prevents problematic contract changes from being merged into your main branch without explicit approval or a migration plan.

By automating this check, you create a robust safety net. Data producers are immediately notified of changes that could break consumers, fostering better communication, planned deprecations, and a more stable data ecosystem.

Exporting Your Contract

Beyond validation and breaking change detection, the datacontract CLI offers another powerful capability: exporting your data contracts into various useful formats. This is incredibly handy for documentation, code generation, and integrating with other data tools.

The datacontract export Command

The export command allows you to transform your datacontract.yaml into different outputs. The most common use case for data engineers is generating SQL Data Definition Language (DDL).

datacontract export my_first_contract.yaml --format sql

This command will read your my_first_contract.yaml file and, based on the models defined, generate SQL CREATE TABLE statements that reflect your schema. The output will be printed directly to your console, which you can then redirect to a file:

datacontract export my_first_contract.yaml --format sql > orders_ddl.sql

The generated SQL will typically look something like this (depending on the database dialect and your contract’s definitions):

-- Data Contract: urn:datacontract:your-org:my-first-contract
-- SQL Dialect: <inferred_dialect_or_default>

CREATE TABLE orders (
order_id TEXT NOT NULL PRIMARY KEY,
customer_id TEXT,
order_timestamp TIMESTAMP WITH TIME ZONE NOT NULL,
order_total_usd NUMERIC NOT NULL
);

The CLI intelligently maps the data contract types (e.g., string, timestamp, decimal) to the appropriate SQL types for a generic dialect.

Use Cases: Generating DDL for Databases, Documentation, etc.

The datacontract export command has several practical applications:

  1. Automated DDL Generation for Databases:
  • You can use this to automatically generate CREATE TABLE scripts that align perfectly with your data contracts. This ensures that your actual database schemas are always in sync with your defined contracts, reducing manual errors and discrepancies.
  • This is especially useful in schema migration pipelines where new tables or schema versions need to be deployed.

2. Living Documentation:

  • Exporting your contracts to human-readable formats (like Markdown, though SQL DDL itself is quite readable for schema) can serve as living documentation for your data products.
  • Consumers can easily understand the schema and expected data types without needing access to the database or complex data catalogs.

3. Integration with Other Tools:

  • While sql is a primary format, the datacontract CLI might support or expand to other formats in the future (check the official documentation). This allows for broader interoperability with other data governance, quality, or cataloging tools that consume specific metadata formats.

By leveraging the export command, your data contracts become more than just validation rules; they become generative assets that streamline your data engineering workflows and improve data discoverability and understanding across your organization.

The Power of datacontract CLI in Your Data Stack

You’ve now walked through the essential steps of defining, linting, testing, detecting breaking changes, and exporting data contracts using the datacontract CLI. As an expert data engineer, I can confidently say that incorporating data contracts into your workflow isn't just a best practice – it's a strategic imperative for building resilient and trustworthy data platforms.

Recap of Benefits: Standardization, Quality, Trust, Automation

Let’s quickly recap the immense power that datacontract CLI brings to your data stack:

  • Standardization: Data contracts provide a consistent, machine-readable format to define expectations for all your data products, reducing ambiguity and fostering a common language across teams.
  • Quality: By enforcing schema and quality rules directly against your actual data, you catch issues early, before they escalate into costly downstream failures. The CLI acts as your automated data quality guardian.
  • Trust: When consumers know that data products are backed by explicit contracts and automated validation, their trust in the data increases dramatically. This leads to more confident decision-making and faster innovation.
  • Automation: The datacontract CLI is designed for automation. Its commands integrate seamlessly into CI/CD pipelines, allowing you to automatically validate contracts, detect breaking changes, and even generate DDL, reducing manual effort and improving efficiency.

Next Steps: Explore Advanced Features, Integrate into Your Workflow

This post has covered the hands-on basics, but the datacontract CLI offers more to explore:

  • Advanced Quality Checks: Dive deeper into more complex custom_sql rules or explore other built-in quality assertion types.
  • Different Server Types: Experiment with connecting to various data sources like BigQuery, Snowflake, or PostgreSQL by configuring their respective servers blocks and environment variables.
  • Full CI/CD Integration: Take the breaking-changes and test commands and embed them fully into your data pipelines' CI/CD workflows for continuous validation and governance.
  • Community and Documentation: Stay updated with the latest features and best practices by consulting the official cli.datacontract.com documentation and engaging with the community.

By embracing data contracts with the datacontract CLI, you're not just documenting your data; you're actively governing it, transforming it into a reliable asset that truly empowers your organization. Start small, integrate incrementally, and watch your data quality and team efficiency soar!

--

--

Chris Shayan
Chris Shayan

Written by Chris Shayan

I'm driven by a singular purpose: "Relentlessly elevating experience." Head of AI at Backbase | https://www.chrisshayan.com/

No responses yet