A Family of Domain-Specific and Character-Level Tokenizers
Specialized tokenizers achieving up to 83% efficiency gains for legal and financial NLP

Our research introduces the KL3M Tokenizers family, a specialized collection of tokenizers designed to enhance efficiency in domain-specific contexts like legal and financial text processing.
Authored by Michael J. Bommarito II, Daniel Martin Katz, and Jillian Bommarito, this research demonstrates how domain-specific tokenizers can significantly improve token efficiency while preserving semantic coherence.
MIT License
Creative Commons Attribution 4.0 International (CC-BY 4.0)