Data Contracts in Action
In the world of data, we often talk about “data products” and “data as a service.” Sounds great, right? But here’s the dirty little secret: too often, data isn’t reliable. It’s like building a beautiful house on a shaky foundation.
Imagine you’re a data consumer — perhaps a business analyst creating a critical report, or a machine learning engineer building a predictive model. You pull data from a source, only for the schema to unexpectedly change, a column to go missing, or the data quality to plummet without warning. Suddenly:
- Your dashboards break.
- Your reports show incorrect numbers.
- Your machine learning models start making bad predictions.
- You spend hours, days, or even weeks debugging data issues, not building value.
This isn’t just annoying; it erodes trust, slows down innovation, and costs organizations significant time and money. The root cause? A lack of clear, enforced agreements between data producers (those generating the data) and data consumers (those using it). This is where data contracts come in.
Why datacontract
CLI?
While the concept of data contracts is powerful, implementing them shouldn’t be a bureaucratic nightmare. This is precisely where https://cli.datacontract.com/
(the datacontract
CLI) shines.
It’s an open-source, platform-agnostic command-line tool that empowers data teams to:
- Define data contracts in a simple YAML format.
- Lint contracts to catch syntax errors and best practice violations.
- Test actual data against the contract for schema and quality compliance.
- Detect breaking changes before they wreak havoc.
- Export contracts into various formats (like SQL DDL).
In short, the datacontract
CLI makes data contracts not just a concept, but a tangible, automatable part of your data workflow. It's built for engineers, by engineers, focusing on practical implementation and seamless integration into your existing CI/CD pipelines.
Getting Started: Installation & First Command
Now that we understand why data contracts are crucial and what the datacontract
CLI offers, let's get our hands dirty and set it up. This section will guide you through the quick installation process and how to verify everything is working as expected.
Prerequisites
Before you begin, make sure you have one of the following:
- Python 3.10+: The
datacontract-cli
is a Python application. We recommend using Python 3.11 for optimal compatibility. - Docker: If you prefer a containerized approach or don’t want to manage Python environments directly, Docker is an excellent alternative.
The datacontract
CLI is straightforward to install. Choose the method that best fits your workflow:
Using pip
(Recommended for Python users): This method installs the datacontract-cli
along with all necessary database connectors (like those for S3, BigQuery, Snowflake, etc.).
pip install 'datacontract-cli[all]'
If you prefer to use a virtual environment (which is good practice for Python projects), you can do the following:
python -m venv venv
source venv/bin/activate # On Windows, use `venv\Scripts\activate`
pip install 'datacontract-cli[all]'
Using Docker (Alternative): If you prefer to run datacontract
CLI in an isolated container, you can pull the official Docker image:
docker pull datacontract/cli
- You’ll then run commands by prefixing them with
docker run --rm -v ${PWD}:/home/datacontract datacontract/cli
. We'll see an example of this in the verification step.
Verifying Installation
After installation, it’s always a good idea to confirm that the datacontract
CLI is correctly installed and accessible in your environment.
If installed with pip
:
datacontract — version
If using Docker:
docker run --rm datacontract/cli --version
Congratulations! You’ve successfully installed the datacontract
CLI. You're now ready to start crafting your first data contract.
Crafting Your First Data Contract (datacontract.yaml
)
With the datacontract
CLI installed, it's time to define your data contract. Data contracts are typically written in YAML, a human-friendly data serialization standard. The datacontract
CLI provides an excellent starting point with its init
command.
The datacontract init
Command: Quick Start with a Template
The easiest way to begin is by using the init
command. This command generates a basic datacontract.yaml
file with a pre-filled template, saving you from starting from scratch.
Let’s create a new file named my_first_contract.yaml
:
datacontract init my_first_contract.yaml
This command will create my_first_contract.yaml
in your current directory. Open it with your favorite text editor, and you'll see a structure similar to this (comments and some details omitted for brevity, but they'll be in your generated file):
dataContractSpecification: 1.1.0
id: urn:datacontract:your-org:my-first-contract
info:
title: My First Data Contract
version: 0.1.0
owner: Data Engineering Team
servers: {}
models: {}
This boilerplate provides the foundational sections for your data contract. Let’s break down each key part.
Understanding the Structure
A data contract YAML file is structured logically to define everything about your data product.
info
: Basic Metadata
This section holds general information about your data contract. It’s crucial for documentation and understanding the purpose and ownership of the data.
info:
title: Orders Data Product
version: 1.0.0
owner: Sales Analytics Team <sales-analytics@example.com>
description: |
This contract defines the structure and quality expectations for the 'orders'
data product, capturing key transaction details.
title
: A human-readable name for your data contract.version
: The version of this specific data contract. Follow semantic versioning (e.g., 1.0.0, 1.1.0, 2.0.0).owner
: The team or individual responsible for this data product.description
: A more detailed explanation of what this data contract covers.
servers
: Defining Your Data Sources
The servers
section specifies where your data actually lives. This is how the datacontract
CLI knows how to connect and test your data. You can define multiple servers (e.g., production
, staging
, development
).
Let’s add a simple S3 CSV server configuration for demonstration. This tells the CLI that our orders
data is a CSV file in a specific S3 bucket.
servers:
production:
type: s3
location: s3://your-company-data-lake/sales/orders/
format: csv # Specify the format if applicable
delimiter: ","
header: true
production
: This is a logical name for your environment.type
: The type of data source (e.g.,s3
,bigquery
,snowflake
,postgresql
).location
: The path to your data.format
: (Optional) The file format (e.g.,csv
,parquet
,json
).delimiter
,header
: (Optional) Format-specific options for CSV.
Remember that for datacontract
CLI to connect to services like S3, you'll need to provide credentials, typically via environment variables (e.g., DATACONTRACT_S3_ACCESS_KEY_ID
, DATACONTRACT_S3_SECRET_ACCESS_KEY
). We'll cover this more in the "Testing" section.
models
: Describing Your Data Schema
The models
section is where you define the expected schema of your data product. Each model represents a dataset or a table.
Let’s define a simple orders
model with a few key fields:
models:
orders:
type: table # Or 'object' for other data structures
description: Details of customer orders.
fields:
order_id:
type: string
primaryKey: true
description: Unique identifier for the order.
customer_id:
type: string
description: Unique identifier for the customer.
examples:
- "CUST001"
- "CUST002"
order_timestamp:
type: timestamp
required: true
description: The UTC timestamp when the order was placed.
order_total_usd:
type: decimal
required: true
description: Total amount of the order in USD.
examples:
- 123.45
- 99.99
orders
: The name of your data model (e.g., a table name).type
: The type of model (e.g.,table
,object
).fields
: A list of expected columns/attributes in your dataset.field_name
: The name of the column (e.g.,order_id
).type
: The data type (e.g.,string
,integer
,decimal
,timestamp
,boolean
).primaryKey
: (Optional) Set totrue
if this field is part of the primary key.required
: (Optional) Set totrue
if this field cannot be null.description
: A clear explanation of what the field represents.examples
: (Optional) Illustrative examples of valid values.
quality
: Adding Data Quality Expectations
This is where you define the actual data quality rules that the datacontract
CLI will verify. You can specify various types of quality checks.
Let’s add some basic quality checks for our orders
model:
models:
orders:
# ... (previous model definition)
quality:
- type: row_count
description: Ensure there are at least 1000 orders in the dataset.
mustBe:
min: 1000
- type: not_null
field: order_id
description: order_id must never be null.
- type: not_null
field: order_timestamp
description: order_timestamp must never be null.
- type: unique
field: order_id
description: order_id must be unique across all records.
- type: custom_sql
name: order_total_range_check
description: 95% of all order total values are expected to be between 10 and 1000 USD.
query: |
SELECT SUM(CASE WHEN order_total_usd BETWEEN 10 AND 1000 THEN 1 ELSE 0 END) * 1.0 / COUNT(*) AS percentage_in_range
FROM orders_table; # 'orders_table' will be replaced by the actual table name by the CLI
mustBe:
min: 0.95 # At least 95% of values should be in range
type
: The type of quality check (e.g.,row_count
,not_null
,unique
,custom_sql
).field
: (For field-specific checks) The name of the field to check.description
: A human-readable explanation of the quality rule.mustBe
: The conditions that must be met (e.g.,min
,max
,equals
).custom_sql
: For advanced checks, you can provide a SQL query. Thedatacontract
CLI will execute this against your data source. Note thatorders_table
is a placeholder that will be replaced by the actual table name derived from your server and model configuration.
By filling out these sections in your my_first_contract.yaml
file, you're building a comprehensive agreement about your data. In the next section, we'll see how to validate this contract using the lint
command.
Linting Your Data Contract: Catching Issues Early
You’ve just crafted your first datacontract.yaml
file, defining the blueprint for your data product. Before you even think about connecting to a data source and running tests, it's critical to ensure that your data contract itself is well-formed and adheres to the specification. This is where linting comes in.
Linting is like having a diligent editor review your YAML file for syntax errors, structural inconsistencies, and potential issues that could lead to problems down the line. It’s a quick, local check that saves you headaches later.
The datacontract lint
Command
To lint your data contract, simply use the datacontract lint
command, pointing it to your YAML file:
datacontract lint my_first_contract.yaml
When you run this command, the datacontract
CLI will parse your my_first_contract.yaml
file and report any findings.
- On Success: If your contract is perfectly valid, you’ll see a message indicating success, typically something like:
✅ Data Contract is valid.
- On Failure: If there are issues, the CLI will provide detailed error messages, indicating the line number and the nature of the problem.
Interpreting Linting Results: Common Errors and How to Fix Them
Linting errors are your friends! They help you catch mistakes early. Here are some common issues you might encounter and how to address them:
YAML Syntax Errors:
- Problem: Incorrect indentation, missing colons, invalid characters. YAML is very sensitive to whitespace.
- Example Error:
yaml.scanner.ScannerError: mapping values are not allowed here
orexpected <indicator>
. - Fix: Double-check your indentation. Ensure consistent use of spaces (not tabs). Use a YAML linter in your text editor (like VS Code’s YAML extension) for real-time feedback.
Missing Required Fields:
- Problem: The
dataContractSpecification
might require certain fields (e.g.,title
,version
underinfo
) that are missing. - Example Error:
ValidationError: 'title' is a required property
. - Fix: Add the missing required field as specified in the error message. Refer to the
datacontract.com
documentation for the full specification if unsure.
Invalid Data Types or Values:
- Problem: You’ve used an unrecognized
type
for a field (e.g.,str
instead ofstring
) or an invalid value for a property (e.g., a string where a number is expected). - Example Error:
ValidationError: 'str' is not a valid 'type' for field 'my_field'
. - Fix: Correct the data type to one supported by the
datacontract
specification (e.g.,string
,integer
,decimal
,timestamp
,boolean
).
Incorrect Structure:
- Problem: Misplacing a section (e.g.,
quality
block outside ofmodels
). - Example Error:
ValidationError: Additional properties are not allowed ('quality' was unexpected)
. - Fix: Review the expected structure of the
datacontract.yaml
file. Thedatacontract init
template is a great reference for correct placement.
By making datacontract lint
a standard part of your development workflow, especially before committing changes, you ensure that your data contracts are always syntactically correct and follow the specification. This sets the stage for reliable data testing in the next step!
Testing Your Data Against the Contract
Linting ensures your data contract is well-formed, but the real power of datacontract
CLI comes from its ability to test your actual data against the definitions in your contract. This is your real-world validation step, ensuring that the data flowing through your pipelines lives up to its promise.
For datacontract
CLI to connect to your data source (like S3, BigQuery, Snowflake, etc.), it needs credentials. The safest and most common way to provide these is through environment variables. The specific variables will depend on your data source type
defined in the servers
section of your datacontract.yaml
.
For our S3 example, you’ll typically need:
export DATACONTRACT_S3_ACCESS_KEY_ID="YOUR_S3_ACCESS_KEY_ID"
export DATACONTRACT_S3_SECRET_ACCESS_KEY="YOUR_S3_SECRET_ACCESS_KEY"
# Optionally, if your bucket is in a specific region:
export DATACONTRACT_S3_REGION="your-aws-region"
Important Notes:
- Security: Never hardcode credentials directly into your
datacontract.yaml
or version control them. Use environment variables or a secure secret management system. - Other Data Sources: For other data sources (e.g., BigQuery, Snowflake, PostgreSQL), the environment variables will differ. Refer to the
datacontract.com
documentation for the exact variable names (e.g.,DATACONTRACT_BIGQUERY_PROJECT_ID
,DATACONTRACT_SNOWFLAKE_USERNAME
).
The datacontract test
Command
Once your environment variables are set, you can execute the test command. You’ll specify your data contract file and the server environment you want to test against (as defined in your servers
block).
datacontract test my_first_contract.yaml --server production
This command will:
- Parse your
my_first_contract.yaml
. - Connect to the data source specified under the
production
server using the provided credentials. - Execute all schema and quality checks defined in your
models
section against the actual data.
Analyzing Test Results: Schema Validation, Data Quality Checks, Success/Failure Scenarios
The output of the datacontract test
command is designed to be clear and actionable.
- Success: If all checks pass, you’ll see a clear success message:
✅ Data Contract tests passed.
This means your actual data adheres to the schema and quality expectations defined in your contract.
- Failure: If any tests fail, the CLI will provide detailed information about which checks failed and why. This is incredibly valuable for debugging data issues.
Example of a potential failure for a not_null
check:
❌ Data Contract tests failed.
- Model: orders
- Field: order_id
- Check: not_null (Failed)
Description: order_id must never be null.
Details: Found 5 null values in 'order_id'.
Example of a potential failure for a row_count
check:
❌ Data Contract tests failed.
- Model: orders
- Check: row_count (Failed)
Description: Ensure there are at least 1000 orders in the dataset.
Details: Actual row count was 850, expected min 1000.
You’ll see similar detailed output for schema mismatches (e.g., incorrect data types, missing fields) and any custom_sql
checks that fail.
What to do when tests fail:
- Investigate the Data: The error messages will guide you. Check your data source to understand why the data isn’t conforming to the contract. Is there an upstream data pipeline issue? Has a schema changed unexpectedly?
- Adjust the Contract (if necessary): Sometimes, the data contract might be too strict, or the data has legitimately changed. In such cases, you might need to update your
datacontract.yaml
to reflect the new reality (and ideally, communicate these changes to consumers!). - Fix the Data Pipeline: More often, a test failure indicates a problem with the data production process. The contract acts as an early warning system, helping you catch and fix issues before they impact downstream consumers.
By consistently running datacontract test
as part of your data pipeline's CI/CD, you establish a powerful guardrail, ensuring data quality and preventing unexpected breakage.
Detecting Breaking Changes
Data contracts are not static; they evolve as your data products change. However, modifying a data contract without careful consideration can lead to breaking changes — alterations that negatively impact existing data consumers. The datacontract
CLI provides a crucial command to proactively identify such changes, helping you maintain backward compatibility and prevent downstream failures.
The datacontract breaking-changes
Command
This command compares two versions of a data contract (an “old” and a “new” version) and reports any breaking changes. This is invaluable in a CI/CD pipeline, where you can automatically check proposed contract changes before they are merged.
datacontract breaking-changes old_contract.yaml new_contract.yaml
Here:
old_contract.yaml
: Represents the current, deployed version of your data contract.new_contract.yaml
: Represents the proposed, modified version of your data contract.
The CLI will analyze these two files and report any changes that could break existing consumers, such as:
- Removing a required field.
- Changing a field's data type to an incompatible one (e.g.,
integer
tostring
is usually okay,string
tointeger
is not). - Making an optional field required.
- Renaming a field without providing an alias/migration strategy.
Scenario: Make a Breaking Change and See the CLI in Action
Let's illustrate this with a quick example.
Step 1: Create an old_contract.yaml
Copy your my_first_contract.yaml
to old_contract.yaml
. This will be our baseline.
cp my_first_contract.yaml old_contract.yaml
Step 2: Introduce a Breaking Change in my_first_contract.yaml
Open my_first_contract.yaml
(which now acts as new_contract.yaml
for this test) and make a breaking change. For instance, let's rename order_total_usd
to total_amount_usd
and remove the customer_id
field, making it a breaking change for any consumer relying on it.
# my_first_contract.yaml (excerpt with breaking change)
models:
orders:
fields:
order_id:
type: string
primaryKey: true
description: Unique identifier for the order.
# customer_id: Removed this field - BREAKING CHANGE!
order_timestamp:
type: timestamp
required: true
description: The UTC timestamp when the order was placed.
total_amount_usd: # Renamed from order_total_usd - BREAKING CHANGE!
type: decimal
required: true
description: Total amount of the order in USD.
Step 3: Run the breaking-changes
Command
Now, execute the command:
datacontract breaking-changes old_contract.yaml my_first_contract.yaml
You will see output similar to this, highlighting the breaking changes:
❌ Breaking changes detected!
- Removed field 'customer_id' from model 'orders'. This is a breaking change.
- Renamed field 'order_total_usd' to 'total_amount_usd' in model 'orders'. This is a breaking change.
This immediate feedback is invaluable. It tells you exactly what changes are breaking and helps you decide how to proceed.
Integrating into CI/CD for Automated Breaking Change Detection
The true power of the breaking-changes
command is realized when integrated into your Continuous Integration/Continuous Delivery (CI/CD) pipeline.
How to integrate:
- Version Control: Ensure your
datacontract.yaml
files are under version control (e.g., Git). - Pull Request Checks: In your CI/CD pipeline (e.g., GitHub Actions, GitLab CI, Jenkins), set up a job that runs the
datacontract breaking-changes
command whenever a pull request (PR) is opened or updated for your data contract. - Comparison Logic:
- Fetch the
datacontract.yaml
from themain
(orproduction
) branch. This is yourold_contract.yaml
. - Use the
datacontract.yaml
from the current feature branch (the one in the PR). This is yournew_contract.yaml
. - Run
datacontract breaking-changes old_contract.yaml new_contract.yaml
.
4. Fail the Build: Configure the CI/CD job to fail if the breaking-changes
command detects any breaking changes. This prevents problematic contract changes from being merged into your main branch without explicit approval or a migration plan.
By automating this check, you create a robust safety net. Data producers are immediately notified of changes that could break consumers, fostering better communication, planned deprecations, and a more stable data ecosystem.
Exporting Your Contract
Beyond validation and breaking change detection, the datacontract
CLI offers another powerful capability: exporting your data contracts into various useful formats. This is incredibly handy for documentation, code generation, and integrating with other data tools.
The datacontract export
Command
The export
command allows you to transform your datacontract.yaml
into different outputs. The most common use case for data engineers is generating SQL Data Definition Language (DDL).
datacontract export my_first_contract.yaml --format sql
This command will read your my_first_contract.yaml
file and, based on the models
defined, generate SQL CREATE TABLE
statements that reflect your schema. The output will be printed directly to your console, which you can then redirect to a file:
datacontract export my_first_contract.yaml --format sql > orders_ddl.sql
The generated SQL will typically look something like this (depending on the database dialect and your contract’s definitions):
-- Data Contract: urn:datacontract:your-org:my-first-contract
-- SQL Dialect: <inferred_dialect_or_default>
CREATE TABLE orders (
order_id TEXT NOT NULL PRIMARY KEY,
customer_id TEXT,
order_timestamp TIMESTAMP WITH TIME ZONE NOT NULL,
order_total_usd NUMERIC NOT NULL
);
The CLI intelligently maps the data contract types (e.g., string
, timestamp
, decimal
) to the appropriate SQL types for a generic dialect.
Use Cases: Generating DDL for Databases, Documentation, etc.
The datacontract export
command has several practical applications:
- Automated DDL Generation for Databases:
- You can use this to automatically generate
CREATE TABLE
scripts that align perfectly with your data contracts. This ensures that your actual database schemas are always in sync with your defined contracts, reducing manual errors and discrepancies. - This is especially useful in schema migration pipelines where new tables or schema versions need to be deployed.
2. Living Documentation:
- Exporting your contracts to human-readable formats (like Markdown, though SQL DDL itself is quite readable for schema) can serve as living documentation for your data products.
- Consumers can easily understand the schema and expected data types without needing access to the database or complex data catalogs.
3. Integration with Other Tools:
- While
sql
is a primary format, thedatacontract
CLI might support or expand to other formats in the future (check the official documentation). This allows for broader interoperability with other data governance, quality, or cataloging tools that consume specific metadata formats.
By leveraging the export
command, your data contracts become more than just validation rules; they become generative assets that streamline your data engineering workflows and improve data discoverability and understanding across your organization.
The Power of datacontract
CLI in Your Data Stack
You’ve now walked through the essential steps of defining, linting, testing, detecting breaking changes, and exporting data contracts using the datacontract
CLI. As an expert data engineer, I can confidently say that incorporating data contracts into your workflow isn't just a best practice – it's a strategic imperative for building resilient and trustworthy data platforms.
Recap of Benefits: Standardization, Quality, Trust, Automation
Let’s quickly recap the immense power that datacontract
CLI brings to your data stack:
- Standardization: Data contracts provide a consistent, machine-readable format to define expectations for all your data products, reducing ambiguity and fostering a common language across teams.
- Quality: By enforcing schema and quality rules directly against your actual data, you catch issues early, before they escalate into costly downstream failures. The CLI acts as your automated data quality guardian.
- Trust: When consumers know that data products are backed by explicit contracts and automated validation, their trust in the data increases dramatically. This leads to more confident decision-making and faster innovation.
- Automation: The
datacontract
CLI is designed for automation. Its commands integrate seamlessly into CI/CD pipelines, allowing you to automatically validate contracts, detect breaking changes, and even generate DDL, reducing manual effort and improving efficiency.
Next Steps: Explore Advanced Features, Integrate into Your Workflow
This post has covered the hands-on basics, but the datacontract
CLI offers more to explore:
- Advanced Quality Checks: Dive deeper into more complex
custom_sql
rules or explore other built-in quality assertion types. - Different Server Types: Experiment with connecting to various data sources like BigQuery, Snowflake, or PostgreSQL by configuring their respective
servers
blocks and environment variables. - Full CI/CD Integration: Take the
breaking-changes
andtest
commands and embed them fully into your data pipelines' CI/CD workflows for continuous validation and governance. - Community and Documentation: Stay updated with the latest features and best practices by consulting the official
cli.datacontract.com
documentation and engaging with the community.
By embracing data contracts with the datacontract
CLI, you're not just documenting your data; you're actively governing it, transforming it into a reliable asset that truly empowers your organization. Start small, integrate incrementally, and watch your data quality and team efficiency soar!