KL3M Tokenizers

A Family of Domain-Specific and Character-Level Tokenizers

Specialized tokenizers achieving up to 83% efficiency gains for legal and financial NLP

⩘ Research Overview

Our research introduces the KL3M Tokenizers family, a specialized collection of tokenizers designed to enhance efficiency in domain-specific contexts like legal and financial text processing.

Authored by Michael J. Bommarito II, Daniel Martin Katz, and Jillian Bommarito, this research demonstrates how domain-specific tokenizers can significantly improve token efficiency while preserving semantic coherence.

Key Contributions

Domain-Specific BPE Tokenizers

9-17% fewer tokens than GPT-4o and Llama3 for specialized documents
83% efficiency improvement for legal terminology
39% efficiency improvement for financial terminology
128K vocabulary size with specialized tokens for domain-specific entities

Character-Level BPE Tokenizers

4K, 8K, and 16K vocabulary size variants
Designed for text correction tasks like OCR post-processing
Maintain consistent token boundaries between error-containing and reference text
Optimized for preprocessing applications

Technical Benefits

Expanded effective context window utilization
Reduced computational requirements for inference and fine-tuning

Enhanced preservation of semantic coherence
Improved efficiency for specialized text

Resources

Research Paper

Read the full academic publication on GitHub

GitHub Repository

Access the source code and documentation

Hugging Face

Download and use the tokenizers directly

Read our announcement blog post

Licensing

Source Code

MIT License

Data & Publications

Creative Commons Attribution 4.0 International (CC-BY 4.0)