Words To Data - Convert Legal Documents Into Diffable Data Structures
Overview
words_to_data parses US Code titles and Public Laws (bills) from USLM XML format, providing structured access to legislative text, the ability to track changes between document versions, and tools for annotating how bills amend existing law.
Available for both Rust and Python with high-performance Rust core and ergonomic Python bindings via PyO3.
Features
- Dataset-centric workflow - Manage versioned legal documents, bills, and annotations in a single structure
- Parse USC and Public Law documents - Extract hierarchical structure from USLM XML files
- Rich text content - Capture heading, chapeau, proviso, content, and continuation fields
- Bill amendment extraction - Identify USC references and amending actions from bills
- Hierarchical diffing - Compute word-level differences between document versions
- Congress data integration - Fetch bill metadata and text from Congress.gov API
- Python bindings - Full API access from Python with PyO3
Installation
Rust
Add to your Cargo.toml:
[]
= "0.3.0"
Python
Note: Pre-built wheels are available for Linux x86_64. Other platforms will build from source (requires Rust toolchain).
Getting Data
- Title data: https://uscode.house.gov/download/download.shtml
- Bill data: https://congress.gov
Quick Start
Dataset Workflow
The Dataset is the primary abstraction for working with versioned legal documents. It holds document versions, bills, and annotations together.
Rust:
use ;
use parse_bill_amendments;
Python:
=
=
# Add document versions
# Add bill
=
# Compute diff
=
# Navigate to specific section
=
# Save dataset
Download from Congress API
Bills can be automatically fetched with additional metadata from the congress.gov API
Rust:
use CongressClient;
use ;
Python:
# Create client with API key
=
# Download bill data
=
# Create dataset and load bill
=
=
# Load bill into dataset
=
# Access bill data
=
Core Concepts
Dataset
The Dataset is the primary abstraction for working with versioned legal documents:
- DatasetMetadata: Name, description, author, license, version
- VersionSnapshot: A document tree at a specific point in time
- Bills: Parsed bill data with extracted amendments
- Annotations: Links diff paths to bill amendments with verification status
Use Dataset to load documents, compute diffs, and build training data for ML models.
USLM Elements
Documents are represented as trees of USLMElement structures. Each element contains:
- ElementData: Metadata, text content, and identification
- Children: Nested child elements forming the document hierarchy
The library uses two types of paths:
-
Structural Path: Full hierarchy including all elements Example:
uscode/title_26/subtitle_A/chapter_1/section_174 -
USLM ID: Official USLM identifier (excludes structural-only elements) Example:
/us/usc/t26/s174/a/1
Text Content Fields
Each element can contain up to five distinct text fields:
- Heading: Section or subsection title
- Chapeau: Opening text before enumerated items
- Proviso: Conditional or qualifying clauses
- Content: Main body text
- Continuation: Text appearing after child elements
Diffs
The TreeDiff structure mirrors the element hierarchy and tracks:
- Field changes: Word-level differences in text content fields
- Added elements: New child elements in the newer version
- Removed elements: Elements that existed in the older version
- Child diffs: Recursive diffs for matching child elements
Diffs are computed using word-level granularity via the similar crate.
Amending Actions
Bills can perform these operations on existing code:
Amend, Add, Delete, Insert, Redesignate, Repeal, Move, Strike, StrikeAndInsert
API Documentation
Rust
Generate and view the full API documentation:
Development
# Run Rust tests
# Build and install Python bindings locally
# Run Python tests