ip-extract
A fast IP address extraction library for Rust.
Extract IPv4 and IPv6 addresses from unstructured text with minimal overhead. This crate powers the core extraction engine for geoipsed and is designed for high-throughput scanning of large datasets.
Features
- ⚡ Performance Optimized: Compile-time DFA with O(n) scanning, no runtime regex compilation
- 🎯 Strict Validation: Deep validation eliminates false positives (e.g., rejects
1.2.3.4.5) - ⚙️ Configurable: Fine-grained control over address types (private, loopback, broadcast)
- 🔢 Byte-Oriented: Zero-copy scanning directly on byte slices, no UTF-8 validation overhead
Basic Example
Extract all IP addresses (default behavior):
use ExtractorBuilder;
Configuration Examples
use ExtractorBuilder;
// Extract only public IPs (recommended for most use cases)
let extractor = new
.only_public
.build?;
// Extract only IPv4, ignoring loopback
let extractor = new
.ipv6
.ignore_loopback
.build?;
// Fine-grained control
let extractor = new
.ipv4
.ipv6
.ignore_private
.ignore_broadcast
.build?;
Examples
Run the included example to see IP extraction in action:
Output:
Scanning text...
Found: 10.0.0.5 at 14..23
Found: 8.8.8.8 at 27..35
Found: 2001:4860:4860::8888 at 45..63
Traffic from [10.0.0.5] to [8.8.8.8] and also [2001:4860:4860::8888]
{"text":"Traffic from 10.0.0.5 to 8.8.8.8 and also 2001:4860:4860::8888","tags":[{"ip":"10.0.0.5","range":[14,23],"decoration":"[10.0.0.5]"},{"ip":"8.8.8.8","range":[27,35],"decoration":"[8.8.8.8]"},{"ip":"2001:4860:4860::8888","range":[45,63],"decoration":"[2001:4860:4860::8888]"}]}
The example demonstrates:
- Building an extractor with default settings (all IPs included)
- Scanning input text for IP addresses
- Creating tagged output with decorations
- Outputting results in both decorated text and JSON formats
Benchmarks
Typical throughput on modern hardware (see benches/ip_benchmark.rs):
| Scenario | Throughput |
|---|---|
| Dense IPs (mostly IP addresses) | 160+ MiB/s |
| Sparse logs (mixed with text) | 360+ MiB/s |
| Pure scanning (no IPs) | 620+ MiB/s |
Run benchmarks locally:
Performance Architecture
ip-extract achieves maximum throughput through a two-stage design:
-
Compile-Time DFA (Build Phase)
- Regex patterns compiled into dense Forward DFAs during build
- DFA serialized and embedded in binary (~600KB)
- Eliminates all runtime regex compilation
-
Zero-Cost Scanning (Runtime)
- O(n) byte scanning with lazy DFA initialization
- Single forward pass, no backtracking
- Validation only on candidates, not all scanned bytes
-
Strict Validation
- Hand-optimized[^1] IPv4 parser (20-30% faster than
std::net) - Boundary checking prevents false matches (e.g.,
1.2.3.4.5rejected) - Configurable filters for special ranges
- Hand-optimized[^1] IPv4 parser (20-30% faster than
API Defaults
By default, ExtractorBuilder::new() extracts all IP addresses:
- ✅ IPv4: Enabled
- ✅ IPv6: Enabled
- ✅ Private IPs: Enabled (RFC 1918, IPv6 ULA)
- ✅ Loopback: Enabled (127.0.0.0/8, ::1)
- ✅ Broadcast: Enabled (255.255.255.255, link-local)
Use convenience methods to filter:
.only_public()- Extract only publicly routable IPs.ignore_private()- Skip RFC 1918 and IPv6 ULA ranges.ignore_loopback()- Skip loopback addresses.ignore_broadcast()- Skip broadcast addresses
Limitations
By design, this engine makes conservative choices for performance:
- Strict Boundaries: IPs must be separated by non-IP characters; concatenated IPs without separators may be skipped
- Standard IPv4 Only: Four-octet dotted notation only (e.g.,
192.168.0.1only, not0xC0A80001, not3232235521, and not11000000.10101000.00000000.00000001) - IPv6 Scope IDs Not Captured: Formats like
fe80::1%eth0will extract asfe80::1(the scope ID%eth0is treated as a boundary and dropped).
These constraints ensure minimal false positives and maximum scanning performance.
[^1]: AI wrote all of this. It does not have hands.