By: ALEA on Tue Apr 15 2025

KL3M Data Project: Copyright-Clean AI Training Resources

Introducing the KL3M Data Project: a comprehensive collection of legally sound training resources for large language models spanning 132+ million documents.

KL3M Data Project Token Histogram

Introducing the KL3M Data Project

Today, we’re excited to announce the release of the KL3M Data Project, a comprehensive collection of copyright-clean training resources for large language models. Our research paper details how we’ve created legally sound resources spanning 132,349,390 documents across 16 different data sources.

Explore the Data

Visit our interactive data gallery to explore individual documents with our interactive tool or browse our public datasets on HuggingFace.

Recent litigation and regulatory scrutiny have highlighted significant legal uncertainties around AI training data:

  • Copyright concerns: Using copyrighted works without permission
  • Contract violations: Overlooking terms of service and license agreements
  • Lack of transparency: Failing to disclose data sources and provenance
  • Uncertain legal defenses: Relying on untested “fair use” arguments

These issues create significant risks for AI developers and users alike, potentially leading to legal liability, model removal, or retraining requirements.

The KL3M Data Project takes a fundamentally different approach:

  • Systematic legal assessment: Comprehensive review of legal risks for each data source
  • Focus on government documents: Prioritizing works expressly exempt from copyright
  • Public domain works: Utilizing content where copyright has expired or been waived
  • Complete pipelines: Releasing all code and processes for full transparency
  • CC-BY licensing: Making all resources available under permissive terms

By building on positive legal rights and consent rather than uncertain fair use arguments, we establish an alternative paradigm for ethical AI data collection.

Complete Three-Stage Data Pipeline

What sets the KL3M Data Project apart is our commitment to releasing the complete data pipeline at all stages. This public release represents a snapshot of our ongoing collection efforts, with new documents and resources being added daily:

  1. Original Documents (~28 TB compressed)

    • Raw HTML, PDF, XML, JSON files exactly as collected
    • Complete with metadata and original formatting
    • Archived for long-term research access
    • Compressed using industry-standard algorithms for efficient storage
  2. Extracted Content

    • Cleaned text, markdown, and structured content
    • Consistently processed using documented extraction methods
    • Preserves critical semantic information
    • Highly compressible structured formats for efficient distribution
  3. Pre-tokenized Resources

    • Contains over 1.35 trillion tokens (potentially 2-3x more with different tokenizers)
    • Tokenized with various vocabularies (32k, 64k, 128k)
    • Ready for immediate model training
    • Available in efficient formats on HuggingFace

This approach enables unprecedented research transparency, reproducibility, and auditing capabilities. Researchers can trace any model output back to source documents, verify processing methods, and create alternative extraction approaches.

Unprecedented Scale and Diversity

The KL3M Data Project encompasses materials at a scale suitable for serious AI development:

Total Documents: 132,349,390 (growing daily)
Storage Size: ~28 TB compressed (as of April 2025)
Token Count: 1.35 trillion across 16 sources

Our resources span diverse domains:

  • Federal and state court opinions and decisions
  • Regulatory filings and government publications
  • U.S. federal statutory code and regulations
  • Corporate agreements and financial disclosures
  • Patent applications and documents
  • Official publications from the EU and UK

The document collection includes texts of all lengths, with a mean of 6,237 tokens and median of 1,855 tokens per document. Over 200,000 documents exceed 100,000 tokens in length, making this dataset valuable for long-context model training.

Here are some highlights from our collections currently available on HuggingFace:

KL3M Data Snapshot (March 2024)

A comprehensive snapshot of our data containing 57.8M rows of tokenized documents from various sources, ready for model training.

EDGAR Agreements

1.45M corporate agreements extracted from SEC filings, providing rich examples of legal and financial language.

KL3M SFT Hearings Sample

A sample dataset containing 3,491,369 rows from government records, demonstrating how the data can be structured for supervised fine-tuning tasks.

Access to All Pipeline Stages

We’ve made the KL3M Data Project accessible at each processing stage through multiple channels:

  1. Amazon S3: s3://data.kl3m.ai/

    • Original documents: /documents/{source}/
    • Extracted content: /representations/{source}/
    • Tokenized data: /parquet/{source}/
  2. HuggingFace:

    • Pre-tokenized datasets optimized for immediate model training
    • Metadata linking back to original document sources
    • Sample datasets demonstrating various applications
  3. Interactive Gallery:

    • Web-based exploration tool at gallery.kl3m.ai
    • View original documents alongside their extracted content
    • Analyze token distributions and statistics
  4. GitHub:

    • Source code for the entire pipeline
    • Documentation for all processing steps
    • Tools to verify data integrity and provenance

Applications and Use Cases

The KL3M Data Project supports a wide range of AI development needs:

  • Pre-training foundational models with legally sound data
  • Fine-tuning specialized models for legal, financial, or government applications
  • Creating sophisticated retrieval systems for specific domains
  • Benchmarking model performance on real-world documents
  • Researching document characteristics and linguistic patterns

Get Started Today

Ready to explore legally and ethically sound AI training data?

We invite researchers, developers, and AI practitioners to build upon these resources as we work toward more ethical, legally sound AI development practices. The KL3M Data Project is a living dataset that continues to grow, with new documents, datasets, and tools being released regularly.