KL3M Tokenizers

A Family of Domain-Specific and Character-Level Tokenizers

Specialized tokenizers achieving up to 83% efficiency gains for legal and financial NLP

⩘   Research Overview

KL3M Tokenizer Paper

Our research introduces the KL3M Tokenizers family, a specialized collection of tokenizers designed to enhance efficiency in domain-specific contexts like legal and financial text processing.

Authored by Michael J. Bommarito II, Daniel Martin Katz, and Jillian Bommarito, this research demonstrates how domain-specific tokenizers can significantly improve token efficiency while preserving semantic coherence.

Key Contributions

Domain-Specific BPE Tokenizers

  • 9-17% fewer tokens than GPT-4o and Llama3 for specialized documents
  • 83% efficiency improvement for legal terminology
  • 39% efficiency improvement for financial terminology
  • 128K vocabulary size with specialized tokens for domain-specific entities

Character-Level BPE Tokenizers

  • 4K, 8K, and 16K vocabulary size variants
  • Designed for text correction tasks like OCR post-processing
  • Maintain consistent token boundaries between error-containing and reference text
  • Optimized for preprocessing applications

Technical Benefits

  • Expanded effective context window utilization
  • Reduced computational requirements for inference and fine-tuning
  • Enhanced preservation of semantic coherence
  • Improved efficiency for specialized text

Licensing

Source Code

MIT License

Data & Publications

Creative Commons Attribution 4.0 International (CC-BY 4.0)