Solving a Foundational Problem in Legal NLP

We’re excited to announce the release of two specialized libraries for legal sentence boundary detection: NUPunkt and CharBoundary, along with a comprehensive benchmark dataset.

Try It Yourself

Visit our interactive visualization tool to test these models with your own legal text and see how they compare to standard approaches.

Our research paper, “Precise Legal Sentence Boundary Detection for Retrieval at Scale: NUPunkt and CharBoundary”, documents the significant improvements these tools bring to legal text analysis.

Why This Matters

Accurate sentence boundary detection is a fundamental challenge for natural language processing systems, particularly in specialized domains like law. Legal text presents unique challenges with:

Complex citation formats (e.g., “See Smith v. Jones, 123 F.2d 456 (7th Cir. 2010).“)
Domain-specific abbreviations (e.g., “U.S.C.”, “Fed. R. Civ. P.“)
Intricate nested sentence structures

When legal text is incorrectly segmented at sentence boundaries, critical information is fragmented, context is lost, and downstream applications like retrieval-augmented generation (RAG) systems suffer dramatically reduced performance.

Two Complementary Solutions

We’re addressing this challenge with two different approaches, each with unique advantages:

NUPunkt

NUPunkt is a pure Python library with zero external dependencies, making it ideal for high-throughput production environments:

Blazing Fast: Processes millions of characters per second on CPU
High Precision: Achieves 91.1% precision, a 29-32% improvement over standard tools
Zero Dependencies: Just Python 3.11+, nothing else required
Easy Installation: pip install nupunkt

from nupunkt import sent_tokenize

text = "See Smith v. Jones, 123 F.2d 456, 789 (7th Cir. 2010). The court cited U.S.C. § 101 et seq. in its analysis."
sentences = sent_tokenize(text)
print(sentences)
# ['See Smith v. Jones, 123 F.2d 456, 789 (7th Cir. 2010).', 'The court cited U.S.C. § 101 et seq. in its analysis.']

CharBoundary

CharBoundary takes a character-level machine learning approach for even greater accuracy:

Multiple Models: Three model sizes (small, medium, large) to balance speed and accuracy
State-of-the-Art Performance: Large model achieves F1 score of 0.782
Respectable Speed: Hundreds of thousands of characters per second on CPU
ONNX Acceleration: Up to 2.1x faster inference with ONNX optimizations
Easy Installation: pip install charboundary[onnx]

from charboundary import get_large_onnx_segmenter

segmenter = get_large_onnx_segmenter()
text = "Pursuant to Rule 12(b)(6), defendant Corp. Inc. moves to dismiss. See Twombly, 550 U.S. at 555-56."

# Get sentence boundaries with character-level precision
spans = segmenter.get_sentence_spans(text)
print(spans)
# [(0, 65), (65, 98)]

# Get the actual sentences
sentences = segmenter.segment_to_sentences(text)
print(sentences)
# ['Pursuant to Rule 12(b)(6), defendant Corp. Inc. moves to dismiss.', 'See Twombly, 550 U.S. at 555-56.']

Benchmark Dataset

To enable further research in this area, we’ve released a high-quality benchmark dataset:

45,739 text examples
107,346 sentence tags
97,667 paragraph tags
Derived from the KL3M legal document corpus
Creative Commons licensed

This dataset is available on HuggingFace.

The Impact

Our research shows that improvements in sentence boundary detection have a multiplicative effect on downstream applications:

Enhanced RAG Performance: More coherent chunks lead to better retrieval
Improved Knowledge Extraction: Legal concepts remain intact across sentence boundaries
More Accurate Summarization: Complete sentences provide better input for summarization models
Faster Processing: Optimized algorithms reduce computational overhead

Get Started Today

Both libraries are available now under the MIT license:

Interactive Visualization - Compare models in real-time
NUPunkt on GitHub
CharBoundary on GitHub
Research Paper on arXiv
Benchmark Dataset on HuggingFace

We invite researchers, legal technologists, and developers to try these tools, contribute to their development, and build upon this work to advance the state of legal natural language processing.

Improving Legal Text Analysis with Precise Sentence Boundary Detection