By: ALEA on Tue Apr 08 2025

Improving Legal Text Analysis with Precise Sentence Boundary Detection

Introducing NUPunkt and CharBoundary: two specialized libraries that dramatically improve sentence boundary detection in legal documents.

Legal Sentence Boundary Detection Precision and Recall Chart

Solving a Foundational Problem in Legal NLP

We’re excited to announce the release of two specialized libraries for legal sentence boundary detection: NUPunkt and CharBoundary, along with a comprehensive benchmark dataset.

Try It Yourself

Visit our interactive visualization tool to test these models with your own legal text and see how they compare to standard approaches.

Our research paper, “Precise Legal Sentence Boundary Detection for Retrieval at Scale: NUPunkt and CharBoundary”, documents the significant improvements these tools bring to legal text analysis.

Why This Matters

Accurate sentence boundary detection is a fundamental challenge for natural language processing systems, particularly in specialized domains like law. Legal text presents unique challenges with:

  • Complex citation formats (e.g., “See Smith v. Jones, 123 F.2d 456 (7th Cir. 2010).“)
  • Domain-specific abbreviations (e.g., “U.S.C.”, “Fed. R. Civ. P.“)
  • Intricate nested sentence structures

When legal text is incorrectly segmented at sentence boundaries, critical information is fragmented, context is lost, and downstream applications like retrieval-augmented generation (RAG) systems suffer dramatically reduced performance.

Two Complementary Solutions

We’re addressing this challenge with two different approaches, each with unique advantages:

NUPunkt

NUPunkt is a pure Python library with zero external dependencies, making it ideal for high-throughput production environments:

  • Blazing Fast: Processes millions of characters per second on CPU
  • High Precision: Achieves 91.1% precision, a 29-32% improvement over standard tools
  • Zero Dependencies: Just Python 3.11+, nothing else required
  • Easy Installation: pip install nupunkt
from nupunkt import sent_tokenize

text = "See Smith v. Jones, 123 F.2d 456, 789 (7th Cir. 2010). The court cited U.S.C. § 101 et seq. in its analysis."
sentences = sent_tokenize(text)
print(sentences)
# ['See Smith v. Jones, 123 F.2d 456, 789 (7th Cir. 2010).', 'The court cited U.S.C. § 101 et seq. in its analysis.']

CharBoundary

CharBoundary takes a character-level machine learning approach for even greater accuracy:

  • Multiple Models: Three model sizes (small, medium, large) to balance speed and accuracy
  • State-of-the-Art Performance: Large model achieves F1 score of 0.782
  • Respectable Speed: Hundreds of thousands of characters per second on CPU
  • ONNX Acceleration: Up to 2.1x faster inference with ONNX optimizations
  • Easy Installation: pip install charboundary[onnx]
from charboundary import get_large_onnx_segmenter

segmenter = get_large_onnx_segmenter()
text = "Pursuant to Rule 12(b)(6), defendant Corp. Inc. moves to dismiss. See Twombly, 550 U.S. at 555-56."

# Get sentence boundaries with character-level precision
spans = segmenter.get_sentence_spans(text)
print(spans)
# [(0, 65), (65, 98)]

# Get the actual sentences
sentences = segmenter.segment_to_sentences(text)
print(sentences)
# ['Pursuant to Rule 12(b)(6), defendant Corp. Inc. moves to dismiss.', 'See Twombly, 550 U.S. at 555-56.']

Benchmark Dataset

To enable further research in this area, we’ve released a high-quality benchmark dataset:

  • 45,739 text examples
  • 107,346 sentence tags
  • 97,667 paragraph tags
  • Derived from the KL3M legal document corpus
  • Creative Commons licensed

This dataset is available on HuggingFace.

The Impact

Our research shows that improvements in sentence boundary detection have a multiplicative effect on downstream applications:

  • Enhanced RAG Performance: More coherent chunks lead to better retrieval
  • Improved Knowledge Extraction: Legal concepts remain intact across sentence boundaries
  • More Accurate Summarization: Complete sentences provide better input for summarization models
  • Faster Processing: Optimized algorithms reduce computational overhead

Get Started Today

Both libraries are available now under the MIT license:

We invite researchers, legal technologists, and developers to try these tools, contribute to their development, and build upon this work to advance the state of legal natural language processing.