ip2asn 0.1.2

A high-performance, memory-efficient Rust crate for mapping IP addresses to Autonomous System (AS) information.
Documentation
# **Technical Specification: The `ip2asn` Crate**

This document outlines the complete technical specification for the `ip2asn`
Rust crate. It is intended to be a developer-ready guide for implementation.

## **1. Vision & Core Concepts**

  * **Vision**: To provide the Rust ecosystem with a high-performance,
	memory-efficient, and ergonomic library for mapping IP addresses to their
	corresponding Autonomous System (AS) information.
  * **Core Problem**: Application developers need to efficiently enrich IP
	addresses with ASN data from large text files. The library must perform lookups
	in under a microsecond without excessive memory overhead.
  * **Release Versioning**: The initial release will target version `0.1.0`,
	following standard Cargo conventions.

-----

## **2. Public API & Data Structures**

The public API will be designed to be ergonomic, robust, and compliant with the
[Rust API
Guidelines](https://rust-lang.github.io/api-guidelines/checklist.html). All
public items MUST be documented.

### **2.1. Main Structs & Enums**

```rust
// In lib.rs

/// A read-optimized, in-memory map for IP address to ASN lookups.
/// Construction is handled by the `Builder`.
pub struct IpAsnMap { /* private fields */ }

/// A builder for configuring and loading an `IpAsnMap`.
pub struct Builder { /* private fields */ }

/// A lightweight, read-only view into the ASN information for an IP address.
/// This struct is returned by the `lookup` method.
#[derive(Debug, PartialEq, Eq)]
pub struct AsnInfoView<'a> {
    pub network: IpNetwork,
    pub asn: u32,
    pub country_code: &'a str,
    pub organization: &'a str,
}

/// An owned, lifetime-free struct containing ASN information.
/// This struct is returned by the `lookup_owned` method and is useful
/// for async or multi-threaded applications.
#[derive(Debug, Clone, PartialEq, Eq)]
#[cfg_attr(feature = "serde", derive(serde::Serialize, serde::Deserialize))]
pub struct AsnInfo {
    pub network: IpNetwork,
    pub asn: u32,
    pub country_code: String,
    pub organization: String,
}

/// The primary error type for the crate.
#[derive(Debug)]
pub enum Error { /* see section 5.2 */ }

/// A non-fatal warning for a skipped line during parsing.
#[derive(Debug)]
pub enum Warning { /* see section 5.3 */ }
```

### **2.2. Core Functionality**

```rust
// In impl IpAsnMap
impl IpAsnMap {
    /// Creates a new, empty `IpAsnMap`.
    /// This is useful for applications that may load data later or start empty.
    pub fn new() -> Self {
        Self::default()
    }

    /// Performs a longest-prefix match for the given IP address.
    ///
    * // Returns `Some(AsnInfoView)` if a matching network is found, otherwise `None`.
    * // The lookup is extremely fast, suitable for high-throughput pipelines.
    pub fn lookup(&self, ip: std::net::IpAddr) -> Option<AsnInfoView> {
        // ...
    }

    /// Performs a lookup and returns an owned `AsnInfo` struct.
    /// This is ideal for async contexts or when the result needs to be stored,
    /// as it avoids the lifetime constraints of `AsnInfoView`.
    pub fn lookup_owned(&self, ip: std::net::IpAddr) -> Option<AsnInfo> {
        self.lookup(ip).map(AsnInfo::from)
    }
}

impl Default for IpAsnMap {
    /// Creates a new, empty `IpAsnMap`.
    fn default() -> Self {
        // ...
    }
}
```

### **2.3. Builder API**

```rust
// In impl Builder
impl Builder {
    /// Creates a new builder with default, resilient settings.
    pub fn new() -> Self { /* ... */ }

    /// Loads data from a file path.
    /// Automatically handles gzip decompression by inspecting the file's magic bytes.
    pub fn with_file(self, path: &str) -> Result<Self, Error> { /* ... */ }

    /// Loads data from a URL. Requires the `fetch` feature.
    /// Automatically handles gzip decompression. Uses a blocking HTTP client.
    #[cfg(feature = "fetch")]
    pub fn with_url(self, url: &str) -> Result<Self, Error> { /* ... */ }

    /// Loads data from any source that implements `std::io::Read`.
    /// This method expects a plain, uncompressed text stream.
    pub fn with_source<R: std::io::Read>(self, reader: R) -> Self { /* ... */ }

    /// Enables strict parsing mode.
    /// If called, `build()` will return an `Err` on the first parse failure.
    pub fn strict(mut self) -> Self { /* ... */ }

    /// Sets a callback function to be invoked for each skipped line in resilient mode.
    pub fn on_warning<F: Fn(Warning)>(mut self, callback: F) -> Self { /* ... */ }

    /// Builds the `IpAsnMap`, consuming the builder.
    /// This is a potentially expensive operation.
    pub fn build(self) -> Result<IpAsnMap, Error> { /* ... */ }
}
```

-----

## **3. Data Handling & Internal Architecture**

### **3.1. Source Data Format**

  * The initial implementation will exclusively parse the tab-separated format provided by `iptoasn.com`.
  * **Format**: `range_start\trange_end\tAS_number\tcountry_code\tAS_description`

### **3.2. Gzip Decompression**

  * The `with_file()` and `with_url()` methods MUST transparently decompress
	gzipped content.
  * **Detection**: Decompression will be triggered by detecting the gzip magic
	number (`[0x1f, 0x8b]`) at the beginning of the stream, not by file extension.
  * **Dependency**: The `flate2` crate is recommended.

### **3.3. Core Lookup Engine**

  * The internal storage engine will be an `ip_network_table::IpNetworkTable`
	(or a similar PATRICIA trie implementation) optimized for longest-prefix
	matching of IP network blocks.
  * A private `range_to_cidrs(start: IpAddr, end: IpAddr) -> Vec<IpNetwork>`
	utility will be implemented to convert start/end ranges into the minimal set of
	covering CIDR prefixes.

### **3.4. Memory & Performance Optimizations**

To meet performance goals, the data stored in the trie will be highly optimized.

  * **Country Code**: The 2-character country code will be stored as a `[u8;
	2]`. During parsing, non-standard values (`None`, `Unknown`, etc.) will be
	normalized to the user-assigned ISO code `ZZ`.
  * **Organization**: The organization description strings will be
	**interned**. A central `Vec<String>` will store each unique organization name
	once. The record in the trie will store a `u32` index pointing to this vector,
	avoiding massive string duplication.

-----

## **4. Error & Warning Handling**

### **4.1. Resilient vs. Strict Mode**

  * **Default (Resilient)**: By default, the `build()` process will skip any
	malformed lines. If an `on_warning` callback is configured, it will be called
	for each skipped line with a `Warning` payload.
  * **Strict Mode**: If `builder.strict()` is called, the `build()` process
	will fail fast, returning an `Error` on the first malformed line encountered.

### **4.2. `Error` Enum**

```rust
#[derive(Debug)]
pub enum Error {
    /// An error occurred during an I/O operation.
    Io(std::io::Error),

    /// A line in the data source was malformed (only in strict mode).
    Parse {
        line_number: usize,
        line_content: String,
        kind: ParseErrorKind,
    },

}

#[derive(Debug)]
pub enum ParseErrorKind {
    /// The line did not have the expected number of columns.
    IncorrectColumnCount { expected: usize, found: usize },
    /// A field could not be parsed as a valid IP address.
    InvalidIpAddress { field: String, value: String },
    /// The ASN field could not be parsed as a valid number.
    InvalidAsnNumber { value: String },
    /// The start IP address was greater than the end IP address.
    InvalidRange { start_ip: IpAddr, end_ip: IpAddr },
    /// The start and end IPs were of different families.
    IpFamilyMismatch,
}
```

### **4.3. `Warning` Enum**

```rust
#[derive(Debug)]
pub enum Warning {
    /// A line in the data source could not be parsed and was skipped.
    Parse {
        line_number: usize,
        line_content: String,
        message: String,
    },
    /// A line contained a start IP and end IP of different families.
    IpFamilyMismatch {
        line_number: usize,
        line_content: String,
    },
    // ... other non-fatal warnings as needed ...
}
```

-----

## **5. Cargo Features**

  * `fetch`:
    * Enables the `builder.with_url()` method.
    * Adds a dependency on `reqwest`, using its `blocking` client to ensure
	  the crate remains runtime-agnostic.
  * `serde`:
    * Enables serialization and deserialization for the `AsnInfo` struct.
    * Adds a dependency on `serde` and enables the `serde` feature in the `ip_network` crate.

-----

## **6. Async Compatibility**

  * **Runtime-Agnostic Design**: The core library API is **100% synchronous**.
	It has no dependency on `tokio`, `smol`, or any other async runtime.
  * **Primary Solution**: The `lookup_owned()` method is the primary solution for
    using the map in async contexts. It returns an owned `AsnInfo` struct that is
    `Send + Sync` and has no lifetime constraints, making it safe to store, move
    between threads, or use in async functions without `unsafe` code.
  * **Documentation**: The crate-level documentation MUST provide clear examples
    of using `lookup_owned()` in an async context. It should still mention the
    `spawn_blocking` pattern for the initial, potentially long-running `build()`
    call.

-----

## **7. Performance & Benchmarking**

  * **Goals**:
      * Lookup Speed: `< 500` nanoseconds per lookup.
	  * Memory Usage: A file with \~700,000 ranges should result in an
		in-memory map of approximately 150-200 MB.
  * **Tool**: Benchmarks MUST be implemented using the `criterion` crate.
  * **Benchmark Suite**: The suite MUST include benchmarks for:
    1.  The `build()` time for a large, real-world dataset.
	2.  `lookup()` performance for a random selection of IPv4 and IPv6
		addresses known to be in the dataset.
	3.  `lookup()` performance for IPs known to be unallocated (the "not found"
		case).
	4.  `lookup()` performance for a curated list of edge-case IPs (e.g., the
		first and last address of several network blocks).

-----

## **8. Future Development (Post-v1.0)**

### **8.1. Hot-Reloading**

A future version should include a mechanism for hot-reloading the dataset in a
long-running service.

  * **Change Detection**: Use HTTP `HEAD` requests to check `ETag` or
	`Last-Modified` headers to avoid downloading the full dataset unnecessarily.
  * **Update Mechanism**: Use a "blue-green" strategy. When an update is
	detected, build the new map on a background thread. Once complete, atomically
	swap the new map into service using an `Arc<IpAsnMap>`. This ensures zero
	downtime for lookups.
  * **API**: This could be exposed via a new wrapper struct, e.g.,
	`UpdatingIpAsnMap`.

-----

## **9. Testing Idioms**

To ensure a robust, deterministic, and parallel-safe test suite, the following idioms MUST be followed.

	 * **Integration Tests (e.g., `tests/cli.rs`):**
	     * **State Isolation:** Tests that invoke the CLI binary MUST NOT use `std::env::set_var` to configure the environment of the test process. This creates race conditions when tests are run in parallel.
	     * **Correct Method:** Environment variables MUST be passed directly to the subprocess using `assert_cmd::Command::env("VAR", "VALUE")`. This isolates the environment to the specific command being run.
	     * **Test Fixtures:** Use `rstest` fixtures and temporary directory helpers (like `tempfile`) to create hermetic environments for each test, ensuring no test can interfere with another's file system state.

	 * **Unit Tests (e.g., `src/config.rs`):**
	     * **The Challenge:** Some unit tests *must* modify the environment of the current process to test functions that directly read from it (e.g., `Config::load` reading `std::env::var("HOME")`).
	     * **Correct Method:** These specific tests MUST be marked with the `#[serial]` attribute from the `serial_test` crate. This attribute guarantees that the test will run sequentially, not concurrently with any other `#[serial]` test, preventing state leakage.