ip4sum -- Optimized IPv4 Internet Checksum
ip4sum is a highly optimized implementation of the Internet checksum defined in RFC 1071 and updated in RFC 1141 and RFC 1624, used in IPv4, TCP, UDP, and ICMP headers.
A portable C99 implementation is also provided in the c/ directory.
Key Features
- Fast: Up to 5x faster than
internet-checksum(Fuchsia/Google) on typical packet sizes - No-std compatible: Zero dependencies, works in embedded and bare-metal environments
- Zero-allocation: All computation is done in-place on the stack
- Incremental API: Supports multi-part checksum computation for scattered packet data
- Portable C version: C99-compatible implementation with zero dependencies
Installation
cargo add ip4sum
Quick Start
One-shot Computation
let data = ;
let csum = checksum;
Incremental Computation
use Checksum;
let mut hasher = new;
hasher.update;
hasher.update; // checksum field placeholder
hasher.update;
let csum = hasher.finalize;
C Version
/* one-shot */
uint16_t csum = ;
/* incremental */
ip4sum_checksum c = ;
;
;
uint16_t csum = ;
API Reference
checksum(data: &[u8]) -> u16
Compute the Internet checksum of a byte slice in one shot. Returns the 16-bit one's-complement checksum in network byte order. The checksum field in the input should be set to zero before calling.
Checksum
An incremental checksum calculator.
| Method | Description |
|---|---|
Checksum::new() |
Create a new calculator with accumulator initialized to zero |
update(&mut self, data: &[u8]) |
Feed a slice of data into the running checksum |
finalize(self) -> u16 |
Consume the calculator and return the 16-bit checksum |
reset(&mut self) |
Reset the calculator to its initial state |
Performance
Benchmarks run on Linux x86_64 with Rust 1.94 (-C opt-level=3 -C lto=fat -C codegen-units=1).
One-shot Checksum (checksum)
| Bytes | ip4sum | internet-checksum | Ratio |
|---|---|---|---|
| 20 | 7.2 ns | 33.6 ns | 4.7x |
| 40 | 6.8 ns | 37.5 ns | 5.5x |
| 64 | 6.0 ns | 9.7 ns | 1.6x |
| 128 | 10.3 ns | 14.6 ns | 1.4x |
| 256 | 13.3 ns | 25.4 ns | 1.9x |
| 512 | 10.6 ns | 79.0 ns | 7.5x |
| 1000 | 25.6 ns | 84.8 ns | 3.3x |
| 1500 | 24.3 ns | 126.2 ns | 5.2x |
Incremental Checksum (Checksum struct)
| Bytes | ip4sum | internet-checksum | Ratio |
|---|---|---|---|
| 20 | 4.3 ns | 20.0 ns | 4.7x |
| 64 | 6.8 ns | 10.1 ns | 1.5x |
| 256 | 6.4 ns | 25.7 ns | 4.0x |
| 1500 | 23.9 ns | 145.2 ns | 6.1x |
Multi-feed Incremental (20B header + 1480B payload)
| ip4sum | internet-checksum | Ratio |
|---|---|---|
| 25.3 ns | 140.0 ns | 5.5x |
Rust vs C Comparison
The Rust and C implementations use the same algorithm: a 64-bit wide accumulator with 32-bit reads in native byte order, deferring the carry fold and endian swap to a single final step.
The performance difference between the two comes down to how each compiler optimizes the same logical pattern. Rust's LLVM backend and C compilers (GCC, Clang, MSVC) apply different register allocation, loop vectorization, and instruction scheduling strategies to identical source-level logic. In practice, the Rust version compiled with -C lto=fat -C codegen-units=1 tends to produce tighter inner loops because the whole-program optimization can inline and specialize more aggressively, while the C version is no slouch either and benefits from decades of loop optimization in mature C compilers.
Run benchmarks on your machine:
# Rust benchmarks
cargo bench
# C benchmarks
cd c && gcc -O2 checksum.c test_checksum.c -o test_checksum && ./test_checksum
Why It's Fast
The key insight is simplicity. The accumulator is a 64-bit integer, and we add 32-bit words to it using plain wrapping_add with zero carry tracking. Since a u64 accumulator overflows after ~4 billion additions (~16 GB of data), no realistic packet comes close to the limit. Carry folding is deferred to a single cheap step at the end.
In contrast, internet-checksum (Fuchsia/Google) introduces per-addition overhead:
- Manual carry tracking via
overflowing_add+ boolean carry propagation Option<u8>trailing byte field with branching on everyadd_bytescall- Size-based dispatch to a separate
add_bytes_smallpath usingchecked_add - Macro-based loop unrolling with
try_into().unwrap()in the expansion - Multi-function normalize chain (
normalize->normalize_64->adc_u32->adc_u16)
None of these "optimizations" help for any input under 16 GB. The compiler generates tighter code from a simple wrapping_add loop than from manual carry management.
Testing
# Rust tests
cargo test
# C tests
cd c && gcc -Wall -Wextra -Werror -pedantic -std=c99 -O2 checksum.c test_checksum.c -o test_checksum && ./test_checksum
Contributing
Contributions are welcome! Please:
- Run
cargo +nightly fmtandcargo clippybefore submitting - Add tests for new functionality
- Update documentation as needed
License
Licensed under the MIT License.
Author: Khashayar Fereidani Repository: github.com/fereidani/ip4sum