# json-extractor
A high-performance two-stage JSON fragment scanner written in Rust. Extracts complete JSON objects and arrays from documents containing mixed content (log files, JSON Lines, etc.).
## Features
- **Two-stage pipeline**: SIMD character classification + fragment extraction
- **SIMD-accelerated**: AVX2/SSE4.2 with automatic scalar fallback
- **Zero-copy API**: Buffer reuse via `StagedScanner` eliminates repeated allocations
- **Fragment detection**: Identifies JSON objects (`{}`) and arrays (`[]`)
- **Error reporting**: Detailed error information for incomplete/invalid fragments
- **Position tracking**: Absolute byte offsets for each fragment
## Installation
Add this to your `Cargo.toml`:
```toml
[dependencies]
json-extractor = "0.1.0"
```
## Usage
### Quick Start
Extract the first JSON fragment from a string:
```rust
use json_extractor::extract_first;
let input = r#"some log prefix {"name": "Alice"} tail"#;
assert_eq!(extract_first(input), Some(r#"{"name": "Alice"}"#));
```
### Multiple Fragments
Use `StagedScanner` for full control and buffer reuse across repeated scans:
```rust
use json_extractor::StagedScanner;
let mut scanner = StagedScanner::new();
let data = br#"some prefix {"name": "Alice"} garbage {"age": 30} more text"#;
let fragments = scanner.scan_fragments(data);
assert_eq!(fragments.len(), 2);
assert!(fragments[0].is_complete());
assert_eq!(&data[fragments[0].start..fragments[0].end()], br#"{"name": "Alice"}"#);
```
### Error Handling
```rust
use json_extractor::{StagedScanner, FragmentStatus, ErrorKind};
let mut scanner = StagedScanner::new();
let data = br#"{"unterminated": "value"#;
let fragments = scanner.scan_fragments(data);
match &fragments[0].status {
FragmentStatus::Incomplete(err) => {
println!("Error: {err}");
}
FragmentStatus::Complete => {}
}
```
## Performance
Benchmarked on x86_64 with AVX2:
| Long strings (1KB) | 14.9 GiB/s |
| Large arrays (10k) | 3.44 GiB/s |
| Mixed log files | 1.63 GiB/s |
| Simple objects | 1.21 GiB/s |
| Deep nesting (50) | 1.10 GiB/s |
Run benchmarks:
```bash
cargo bench --bench scanner_bench 2>/dev/null
```
## API
- **`extract_first`** — Extract the first complete JSON fragment from a `&str`. Simplest entry point.
- **`StagedScanner`** — Stateful scanner with buffer reuse. Best for repeated scans or when you need all fragments.
- **`JsonFragmentScanner`** — Convenience stateless wrapper (allocates per call).
- **`Fragment`** — Extracted fragment with `start`, `length`, `status`, `end()`, `is_complete()`.
- **`FragmentStatus`** — `Complete` or `Incomplete(ErrorKind)`.
- **`ErrorKind`** — Detailed error variants (unterminated strings, mismatched brackets, etc.).
## License
Licensed under either of
- Apache License, Version 2.0 ([LICENSE-APACHE](LICENSE-APACHE) or <http://www.apache.org/licenses/LICENSE-2.0>)
- MIT License ([LICENSE-MIT](LICENSE-MIT) or <http://opensource.org/licenses/MIT>)
at your option.
## Contributing
Contributions are welcome!
Unless you explicitly state otherwise, any contribution intentionally submitted for inclusion in the work by you, as defined in the Apache-2.0 license, shall be dual licensed as above, without any additional terms or conditions.