Introducing the KL3M Data Project

Today, we’re excited to announce the release of the KL3M Data Project, a comprehensive collection of copyright-clean training resources for large language models. Our research paper details how we’ve created legally sound resources spanning 132,349,390 documents across 16 different data sources.

Explore the Data

Visit our interactive data gallery to explore individual documents with our interactive tool or browse our public datasets on HuggingFace.

The Legal Challenge in AI Training

Recent litigation and regulatory scrutiny have highlighted significant legal uncertainties around AI training data:

Copyright concerns: Using copyrighted works without permission
Contract violations: Overlooking terms of service and license agreements
Lack of transparency: Failing to disclose data sources and provenance
Uncertain legal defenses: Relying on untested “fair use” arguments

These issues create significant risks for AI developers and users alike, potentially leading to legal liability, model removal, or retraining requirements.

Our Solution: Building on Positive Legal Rights

The KL3M Data Project takes a fundamentally different approach:

Systematic legal assessment: Comprehensive review of legal risks for each data source
Focus on government documents: Prioritizing works expressly exempt from copyright
Public domain works: Utilizing content where copyright has expired or been waived
Complete pipelines: Releasing all code and processes for full transparency
CC-BY licensing: Making all resources available under permissive terms

By building on positive legal rights and consent rather than uncertain fair use arguments, we establish an alternative paradigm for ethical AI data collection.

Complete Three-Stage Data Pipeline

What sets the KL3M Data Project apart is our commitment to releasing the complete data pipeline at all stages. This public release represents a snapshot of our ongoing collection efforts, with new documents and resources being added daily:

Original Documents (~28 TB compressed)
- Raw HTML, PDF, XML, JSON files exactly as collected
- Complete with metadata and original formatting
- Archived for long-term research access
- Compressed using industry-standard algorithms for efficient storage
Extracted Content
- Cleaned text, markdown, and structured content
- Consistently processed using documented extraction methods
- Preserves critical semantic information
- Highly compressible structured formats for efficient distribution
Pre-tokenized Resources
- Contains over 1.35 trillion tokens (potentially 2-3x more with different tokenizers)
- Tokenized with various vocabularies (32k, 64k, 128k)
- Ready for immediate model training
- Available in efficient formats on HuggingFace

This approach enables unprecedented research transparency, reproducibility, and auditing capabilities. Researchers can trace any model output back to source documents, verify processing methods, and create alternative extraction approaches.

Unprecedented Scale and Diversity

The KL3M Data Project encompasses materials at a scale suitable for serious AI development:

Total Documents: 132,349,390 (growing daily)
Storage Size: ~28 TB compressed (as of April 2025)
Token Count: 1.35 trillion across 16 sources

Our resources span diverse domains:

Federal and state court opinions and decisions
Regulatory filings and government publications
U.S. federal statutory code and regulations
Corporate agreements and financial disclosures
Patent applications and documents
Official publications from the EU and UK

The document collection includes texts of all lengths, with a mean of 6,237 tokens and median of 1,855 tokens per document. Over 200,000 documents exceed 100,000 tokens in length, making this dataset valuable for long-context model training.

Featured Datasets

Here are some highlights from our collections currently available on HuggingFace:

KL3M Data Snapshot (March 2024)

A comprehensive snapshot of our data containing 57.8M rows of tokenized documents from various sources, ready for model training.

EDGAR Agreements

1.45M corporate agreements extracted from SEC filings, providing rich examples of legal and financial language.

KL3M SFT Hearings Sample

A sample dataset containing 3,491,369 rows from government records, demonstrating how the data can be structured for supervised fine-tuning tasks.

Access to All Pipeline Stages

We’ve made the KL3M Data Project accessible at each processing stage through multiple channels:

Amazon S3: s3://data.kl3m.ai/
- Original documents: /documents/{source}/
- Extracted content: /representations/{source}/
- Tokenized data: /parquet/{source}/
HuggingFace:
- Pre-tokenized datasets optimized for immediate model training
- Metadata linking back to original document sources
- Sample datasets demonstrating various applications
Interactive Gallery:
- Web-based exploration tool at gallery.kl3m.ai
- View original documents alongside their extracted content
- Analyze token distributions and statistics
GitHub:
- Source code for the entire pipeline
- Documentation for all processing steps
- Tools to verify data integrity and provenance

Applications and Use Cases

The KL3M Data Project supports a wide range of AI development needs:

Pre-training foundational models with legally sound data
Fine-tuning specialized models for legal, financial, or government applications
Creating sophisticated retrieval systems for specific domains
Benchmarking model performance on real-world documents
Researching document characteristics and linguistic patterns

Get Started Today

Ready to explore legally and ethically sound AI training data?

We invite researchers, developers, and AI practitioners to build upon these resources as we work toward more ethical, legally sound AI development practices. The KL3M Data Project is a living dataset that continues to grow, with new documents, datasets, and tools being released regularly.

KL3M Data Project: Copyright-Clean AI Training Resources