geoipsed 0.3.0

Inline decoration of IPv4 and IPv6 address geolocations
Documentation
# Python: ipextract

A Python package for fast IP address extraction from text, powered by the `ip-extract` Rust crate via PyO3.

## Installation

```bash
pip install ipextract
```

Requires Python 3.10+. Pre-built wheels are available for Linux (x86_64, aarch64), macOS (x86_64, arm64), and Windows (x64). No Rust toolchain needed.

## Quick Start

```python
import ipextract

# Extract all IPs from a string
ips = ipextract.extract("Connection from 192.168.1.1 and 8.8.8.8")
# ["192.168.1.1", "8.8.8.8"]

# Deduplicate, preserving first-seen order
ips = ipextract.extract_unique("8.8.8.8 1.1.1.1 8.8.8.8")
# ["8.8.8.8", "1.1.1.1"]

# bytes input works too (useful when reading log files directly)
ips = ipextract.extract(b"host 10.0.0.1 connected")
# ["10.0.0.1"]
```

## Reusable Extractor

For processing many lines (log analysis, batch jobs), create an `Extractor` once and reuse it. This avoids redundant initialization on each call.

```python
extractor = ipextract.Extractor().only_public()

for line in log_file:
    ips = extractor.extract(line)
    if ips:
        process(ips)
```

## Filtering

By default, all IP addresses are extracted — IPv4, IPv6, private ranges, loopback, and broadcast. Use the builder methods to filter:

```python
# Only publicly routable IPs (excludes RFC 1918, loopback, broadcast)
e = ipextract.Extractor().only_public()

# Specific exclusions
e = ipextract.Extractor().ignore_private().ignore_loopback()

# Constructor kwargs for one-shot config
e = ipextract.Extractor(private=False, loopback=False, ipv6=False)
```

Fluent methods return a **new** `Extractor` — the original is not modified, making partial configs safe to reuse:

```python
base = ipextract.Extractor()
public = base.only_public()   # new object
ipv4_only = base.ipv6(False)  # new object, base unchanged
```

### Filter reference

| Method                 | Effect                                            |
| ---------------------- | ------------------------------------------------- |
| `.only_public()`       | Exclude private, loopback, and broadcast          |
| `.ignore_private()`    | Exclude RFC 1918 (IPv4) and ULA/link-local (IPv6) |
| `.ignore_loopback()`   | Exclude 127.0.0.0/8 and ::1                       |
| `.ignore_broadcast()`  | Exclude 255.255.255.255 and link-local ranges     |
| `.ipv4(False)`         | Skip IPv4 entirely                                |
| `.ipv6(False)`         | Skip IPv6 entirely                                |
| `.private_ips(bool)`   | Enable/disable private IPs                        |
| `.loopback_ips(bool)`  | Enable/disable loopback IPs                       |
| `.broadcast_ips(bool)` | Enable/disable broadcast IPs                      |

## Extraction Methods

### `extract(text)``list[str]`

Returns all IP addresses found, in order of appearance.

```python
ipextract.Extractor().extract("a 1.1.1.1 b 2.2.2.2")
# ["1.1.1.1", "2.2.2.2"]
```

### `extract_unique(text)``list[str]`

Returns unique IP addresses, preserving first-seen order.

```python
ipextract.Extractor().extract_unique("1.1.1.1 2.2.2.2 1.1.1.1")
# ["1.1.1.1", "2.2.2.2"]
```

### `extract_with_offsets(text)``list[tuple[str, int, int]]`

Returns `(ip, start, end)` tuples. The byte offsets index directly into the original input — useful for annotation, highlighting, or structured log parsing.

```python
text = "host 1.2.3.4 port 80"
for ip, start, end in ipextract.Extractor().extract_with_offsets(text):
    print(f"{ip} at [{start}:{end}]")
    assert text[start:end] == ip
# 1.2.3.4 at [5:12]
```

## Performance

`ipextract` is designed for high-throughput applications. It uses a compile-time DFA (Deterministic Finite Automaton) from the Rust `ip-extract` crate, which scans at O(n) without backtracking.

Typical benchmark results comparing `ipextract` to Python `re` + `ipaddress.ip_address()` validation (both sides extract and validate):

| Scenario | `re` + `ipaddress` (ms) | `ipextract` (ms) | Speedup |
| :--- | :---: | :---: | :---: |
| **Dense IPs** (1000 mixed v4+v6) | 2.3ms | 0.25ms | **9x** |
| **Sparse Logs** (1000 IPs in noise) | 7.4ms | 0.46ms | **16x** |
| **Pure Text** (100KB with zero IPs) | 4.0ms | 0.16ms | **25x** |
| **Defanged IPs** (1000 mixed) | 2.5ms | 0.35ms | **7x** |

The larger the input and the more non-IP text it contains, the greater the performance advantage. `ipextract` excels at high-speed log scanning because it can reject non-IP text much faster than a backtracking regex engine.

## Module-Level Convenience Functions

`ipextract.extract()` and `ipextract.extract_unique()` are shorthand for `Extractor().extract()` with default settings (all IPs included). For repeated calls, prefer creating an `Extractor` instance explicitly.

## Source

The `ipextract` Python package lives in [`crates/ipextract-py/`](https://github.com/erichutchins/geoipsed/tree/main/crates/ipextract-py) and wraps the [`ip-extract`](ip-extract.md) Rust crate.