json-extractor 0.1.0

High-performance two-stage JSON fragment scanner with SIMD acceleration
Documentation

json-extractor

A high-performance two-stage JSON fragment scanner written in Rust. Extracts complete JSON objects and arrays from documents containing mixed content (log files, JSON Lines, etc.).

Features

  • Two-stage pipeline: SIMD character classification + fragment extraction
  • SIMD-accelerated: AVX2/SSE4.2 with automatic scalar fallback
  • Zero-copy API: Buffer reuse via StagedScanner eliminates repeated allocations
  • Fragment detection: Identifies JSON objects ({}) and arrays ([])
  • Error reporting: Detailed error information for incomplete/invalid fragments
  • Position tracking: Absolute byte offsets for each fragment

Installation

Add this to your Cargo.toml:

[dependencies]
json-extractor = "0.1.0"

Usage

Quick Start

Extract the first JSON fragment from a string:

use json_extractor::extract_first;

let input = r#"some log prefix {"name": "Alice"} tail"#;
assert_eq!(extract_first(input), Some(r#"{"name": "Alice"}"#));

Multiple Fragments

Use StagedScanner for full control and buffer reuse across repeated scans:

use json_extractor::StagedScanner;

let mut scanner = StagedScanner::new();
let data = br#"some prefix {"name": "Alice"} garbage {"age": 30} more text"#;
let fragments = scanner.scan_fragments(data);

assert_eq!(fragments.len(), 2);
assert!(fragments[0].is_complete());
assert_eq!(&data[fragments[0].start..fragments[0].end()], br#"{"name": "Alice"}"#);

Error Handling

use json_extractor::{StagedScanner, FragmentStatus, ErrorKind};

let mut scanner = StagedScanner::new();
let data = br#"{"unterminated": "value"#;
let fragments = scanner.scan_fragments(data);

match &fragments[0].status {
    FragmentStatus::Incomplete(err) => {
        println!("Error: {err}");
    }
    FragmentStatus::Complete => {}
}

Performance

Benchmarked on x86_64 with AVX2:

Workload Throughput
Long strings (1KB) 14.9 GiB/s
Large arrays (10k) 3.44 GiB/s
Mixed log files 1.63 GiB/s
Simple objects 1.21 GiB/s
Deep nesting (50) 1.10 GiB/s

Run benchmarks:

cargo bench --bench scanner_bench 2>/dev/null

API

  • extract_first — Extract the first complete JSON fragment from a &str. Simplest entry point.
  • StagedScanner — Stateful scanner with buffer reuse. Best for repeated scans or when you need all fragments.
  • JsonFragmentScanner — Convenience stateless wrapper (allocates per call).
  • Fragment — Extracted fragment with start, length, status, end(), is_complete().
  • FragmentStatusComplete or Incomplete(ErrorKind).
  • ErrorKind — Detailed error variants (unterminated strings, mismatched brackets, etc.).

License

Licensed under either of

at your option.

Contributing

Contributions are welcome!

Unless you explicitly state otherwise, any contribution intentionally submitted for inclusion in the work by you, as defined in the Apache-2.0 license, shall be dual licensed as above, without any additional terms or conditions.