# ip-extract
A fast IP address extraction library for Rust.
Extract IPv4 and IPv6 addresses from unstructured text with minimal overhead. This crate powers the core extraction engine for [`geoipsed`](https://github.com/erichutchins/geoipsed) and is designed for high-throughput scanning of large datasets.
## Features
- ⚡ **Performance Optimized**: Compile-time DFA with O(n) scanning, no runtime regex compilation
- 🎯 **Strict Validation**: Deep validation eliminates false positives (e.g., rejects `1.2.3.4.5`)
- 🛡️ **Defang Support**: Automatically matches defanged IPs (`192[.]168[.]1[.]1`, `2001[:]db8[:]...`) with negligible overhead on normal input
- ⚙️ **Configurable**: Fine-grained control over address types (private, loopback, broadcast)
- 🔢 **Byte-Oriented**: Zero-copy scanning directly on byte slices, no UTF-8 validation overhead
### Basic Example
Extract all IP addresses (default behavior):
```rust
use ip_extract::ExtractorBuilder;
fn main() -> anyhow::Result<()> {
// Extracts all IPs: IPv4, IPv6, private, loopback, broadcast
let extractor = ExtractorBuilder::new().build()?;
let input = b"Connection from 192.168.1.1 and 8.8.8.8";
for range in extractor.find_iter(input) {
let ip = std::str::from_utf8(&input[range]).unwrap();
println!("Found IP: {}", ip);
}
Ok(())
}
```
### Configuration Examples
```rust
use ip_extract::ExtractorBuilder;
// Extract only public IPs (recommended for most use cases)
let extractor = ExtractorBuilder::new()
.only_public()
.build()?;
// Extract only IPv4, ignoring loopback
let extractor = ExtractorBuilder::new()
.ipv6(false)
.ignore_loopback()
.build()?;
// Fine-grained control
let extractor = ExtractorBuilder::new()
.ipv4(true)
.ipv6(true)
.ignore_private()
.ignore_broadcast()
.build()?;
```
## Examples
Run the included example to see IP extraction in action:
```bash
cargo run --example simple_extraction
```
Output:
```
Scanning text...
Found: 10.0.0.5 at 14..23
Found: 8.8.8.8 at 27..35
Found: 2001:4860:4860::8888 at 45..63
--- Decorated Output ---
Traffic from [10.0.0.5] to [8.8.8.8] and also [2001:4860:4860::8888]
--- JSON Output ---
{"text":"Traffic from 10.0.0.5 to 8.8.8.8 and also 2001:4860:4860::8888","tags":[{"ip":"10.0.0.5","range":[14,23],"decoration":"[10.0.0.5]"},{"ip":"8.8.8.8","range":[27,35],"decoration":"[8.8.8.8]"},{"ip":"2001:4860:4860::8888","range":[45,63],"decoration":"[2001:4860:4860::8888]"}]}
```
The example demonstrates:
1. Building an extractor with default settings (all IPs included)
2. Scanning input text for IP addresses
3. Creating tagged output with decorations
4. Outputting results in both decorated text and JSON formats
## Defanged IP Support
Defanged IPs are commonly used in threat intelligence reports and security logs to prevent accidental clicks or connections. `ip-extract` matches them automatically — no configuration needed.
```rust
let extractor = ExtractorBuilder::new().build()?;
let input = b"Attacker IP: 192[.]168[.]1[.]1 and 2001[:]db8[:]0[:]0[:]0[:]0[:]0[:]1";
for m in extractor.match_iter(input) {
println!("{}", m.as_str()); // "192.168.1.1" — brackets stripped
println!("{}", m.as_matched_str()); // "192[.]168[.]1[.]1" — raw match
}
```
Supported notation:
- IPv4: `[.]` dot brackets (e.g., `192[.]168[.]1[.]1`)
- IPv6: `[:]` colon brackets (e.g., `2001[:]db8[:]...`), fully-expanded form only
Defang patterns are baked into the DFA at compile time. The expanded DFA adds ~3KB to the binary and has **no measurable regression on normal (fanged) input** — on defanged input it's 16% faster than a pre-processing normalization approach.
## Benchmarks
Typical throughput on modern hardware (see `benches/ip_benchmark.rs`):
| Dense IPs (mostly IP addresses) | **160+ MiB/s** |
| Sparse logs (mixed with text) | **360+ MiB/s** |
| Pure scanning (no IPs) | **620+ MiB/s** |
Run benchmarks locally:
```bash
cargo bench --bench ip_benchmark
```
## Performance Architecture
`ip-extract` achieves maximum throughput through a two-stage design:
1. **Compile-Time DFA** (Build Phase)
- Regex patterns compiled into dense Forward DFAs during build
- DFA serialized and embedded in binary (~600KB)
- Eliminates all runtime regex compilation
2. **Zero-Cost Scanning** (Runtime)
- O(n) byte scanning with lazy DFA initialization
- Single forward pass, no backtracking
- Validation only on candidates, not all scanned bytes
3. **Strict Validation**
- Hand-optimized IPv4 parser (20-30% faster than `std::net`)
- Boundary checking prevents false matches (e.g., `1.2.3.4.5` rejected)
- Configurable filters for special ranges
## API Defaults
By default, `ExtractorBuilder::new()` extracts **all IP addresses**:
- ✅ **IPv4**: Enabled
- ✅ **IPv6**: Enabled
- ✅ **Private IPs**: Enabled (RFC 1918, IPv6 ULA)
- ✅ **Loopback**: Enabled (127.0.0.0/8, ::1)
- ✅ **Broadcast**: Enabled (255.255.255.255, link-local)
Use convenience methods to filter:
- `.only_public()` - Extract only publicly routable IPs
- `.ignore_private()` - Skip RFC 1918 and IPv6 ULA ranges
- `.ignore_loopback()` - Skip loopback addresses
- `.ignore_broadcast()` - Skip broadcast addresses
## Limitations
By design, this engine makes conservative choices for performance:
- **Strict Boundaries**: IPs must be separated by non-IP characters; concatenated IPs without separators may be skipped
- **Standard IPv4 Only**: Four-octet dotted notation only (e.g., `192.168.0.1` only, not `0xC0A80001`, not `3232235521`, and not `11000000.10101000.00000000.00000001`)
- **IPv6 Scope IDs Not Captured**: Formats like `fe80::1%eth0` will extract as `fe80::1` (the scope ID `%eth0` is treated as a boundary and dropped).
These constraints ensure minimal false positives and maximum scanning performance.