mrrc 0.7.6

A Rust library for reading, writing, and manipulating MARC bibliographic records in ISO 2709 binary format
Documentation
# MRRC Architecture

This document describes the architecture of the MRRC library and key design decisions.

## Overview

MRRC is a Rust library for reading, writing, and manipulating MARC bibliographic records with Python bindings via PyO3. The library is organized into three main components:

1. **Core Rust Library** - Pure Rust MARC record parsing and manipulation
2. **Python Wrapper** - PyO3 bindings providing Python access with GIL release for concurrency
3. **Benchmarking Infrastructure** - Comprehensive performance testing and profiling

## Core Architecture

### Record Types

MRRC supports three MARC record types:

- **Bibliographic Records** - Standard library catalog records (type 'a', 'c', 'm', etc.)
- **Authority Records** - Subject headings and name authority data (type 'z')
- **Holdings Records** - Physical item location and enumeration data (types 'x', 'y', 'v', 'u')

All record types share common infrastructure through the `MarcRecord` trait.

### Data Structure

A MARC record consists of:

1. **Leader** - 24-byte header with metadata
2. **Control Fields** (000-009) - Fixed-length fields
3. **Data Fields** (010+) - Variable-length fields with indicators and subfields

### Parser Architecture

The core parser uses a state machine approach:

```
ISO 2709 Binary Format
Record Boundary Scanner (finds 0x1D terminators)
Leader Parser (24 bytes)
Directory Parser (field offsets and lengths)
Field Parser (control fields vs data fields)
Subfield Parser (for data fields with indicators)
Character Decoder (MARC-8 or UTF-8)
Record Object
```

## Python Wrapper Architecture

For a higher-level overview of how the Rust and Python code relate, how
maturin and PyO3 fit in, and what builds what, see
[Project Layout](project-layout.md).

### GIL Release Strategy: Three-Phase Model

The Python wrapper implements a three-phase pattern for GIL management during every `read_record()` call:

```
Phase 1: Read bytes (GIL held)
   ↓
Phase 2: Parse bytes (GIL RELEASED) ← Concurrent work happens here
   ↓
Phase 3: Convert to Python object (GIL re-acquired)
```

**Phase 1 (GIL held):**
- Acquire raw record bytes from source
- Python file object: via Python `read()` method
- File path: via Rust `std::fs::File` (no GIL overhead)
- Bytes: already in memory (no I/O)
- Duration: Very short (I/O cached in kernel)

**Phase 2 (GIL released):** 
- Parse record bytes to MARC structure (CPU-intensive work)
- Uses `py.detach()` (PyO3 0.23+) to explicitly release GIL
- Creates Rust `ParseError` without Python objects
- SmallVec buffer handles most records inline
- Duration: ~90% of total parse time
- **Result: Multiple threads can parse concurrently**

**Phase 3 (GIL re-acquired):**
- Convert Rust `ParseError` to Python exception (if needed)
- Convert Rust `Record` to Python `PyRecord`
- Return to caller
- Duration: Negligible (quick object construction)

### Why GIL Release Matters

The Python GIL (Global Interpreter Lock) serializes all Python bytecode execution. Without GIL release during parsing:

**Without GIL Release (current state of pure pymarc)**:
```
Thread 1: Read bytes (GIL) → Parse (GIL) → Convert (GIL)
Thread 2: Waiting... → Waiting... → Waiting...
Result: Threading provides no speedup (1.0x)
```

**With GIL Release (pymrrc)**:
```
Thread 1: Read (GIL) → Parse (GIL RELEASED) → Convert
Thread 2:                Read (GIL) → Parse (GIL RELEASED) → Convert
Result: Threads parse in parallel (3.74x on 4 cores)
```

The key insight: parsing is CPU-intensive but doesn't need Python objects, so releasing the GIL enables true parallelism.

**Single-threaded benefit:** Even without multiple threads, Rust parsing is simply faster (~4x vs pymarc).

**Multi-threaded benefit:** With explicit `ThreadPoolExecutor`, the GIL release enables concurrent parsing across threads (additional 3.74x speedup on 4 cores).

### ReaderBackend Enum

The unified reader supports multiple input types via a backend enum:

```rust
enum ReaderBackend {
    RustFile(std::fs::File),        // Pure Rust I/O, zero GIL
    Cursor(io::Cursor<Vec<u8>>),    // In-memory, zero GIL
    PythonFile(PyObject),            // Python file object, GIL managed
}
```

**Advantages**:

- **Automatic Detection**: Input type determined at construction
- **Optimal Performance**: Each backend uses fastest available method
- **Backward Compatible**: Python file objects still work via GIL management
- **Zero-GIL Paths**: File paths and bytes bypass Python entirely

**Performance Impact**:

- File path: Pure Rust I/O, Phase 1 has minimal GIL hold
- Bytes: Zero I/O, Phase 1 is trivial
- File object: Requires GIL for `.read()`, but Phase 2 still releases it

### Batched Reader (Optimization)

For Python file objects, batching reduces GIL contention:

```
Without batching (N records):
  FOR i = 1 to N:
    Acquire GIL → Read 1 record → Release GIL → Parse

With batching (N records, batch size = 100):
  FOR batch = 1 to N/100:
    Acquire GIL → Read 100 records → Release GIL
    FOR record in batch:
      Parse record (GIL released)
```

Result: N/100 GIL acquisitions instead of N.

### SmallVec Optimization

MARC records vary in size (typically 500-4000 bytes). The SmallVec buffer:

```rust
SmallVec<[u8; 4096]>
```

**Benefits**:

- Inline storage for ~85-90% of records (no allocation)
- Dynamic heap allocation for oversized records
- <3% memory overhead
- Eliminates borrow checker issues in Phase 2

## Error Handling

### ParseError Enum

Custom error type allows error creation without GIL:

```rust
pub enum ParseError {
    InvalidRecord(String),
    InvalidLeader(String),
    InvalidDirectory(String),
    EncodingError(String),
}

impl From<ParseError> for PyErr {
    // Conversion happens after GIL re-acquisition in Phase 3
}
```

**Why Custom Error Type?**
- PyErr requires GIL to create
- ParseError can be created during Phase 2 (GIL released)
- Defers PyErr conversion to Phase 3 (after GIL re-acquired)

## Thread Safety

### Not Send/Sync by Design

The readers are intentionally **not** Send or Sync:

```rust
// Readers hold Python references (not Send/Sync)
pub struct PyMARCReader {
    reader: Option<ReaderType>,
    // ReaderType may contain PythonFile(PyObject) which is !Send
}
```

**Why?**
- Each thread needs its own GIL-aware reader
- Sharing readers across threads causes undefined behavior
- Forces correct usage pattern: one reader per thread

## Concurrency Model

### Two APIs for Different Use Cases

#### Standard MARCReader (Sequential - No Multi-Threading Benefit)

```python
from mrrc import MARCReader

# Simple sequential reading
reader = MARCReader("records.mrc")
for record in reader:
    process(record)
```

**Performance:**
- ✅ Single-threaded: **~4x faster than pymarc**
- ❌ Multi-threaded: **0.85x slowdown** (GIL contention)
- **Use when:** Sequential processing or single-file reads

#### ProducerConsumerPipeline (High-Performance Single-File Multi-Threading)

```python
from mrrc import ProducerConsumerPipeline

# Background producer thread reads file and parses with Rayon
pipeline = ProducerConsumerPipeline.from_file('large_file.mrc')

for record in pipeline:
    process(record)
```

**Verified Performance:**
- 2 threads: 2.0x speedup
- 4 threads: 3.74x speedup
- Scales with CPU core count

**How it works:**
- Background producer thread reads file in 512 KB chunks
- Bounded channel provides backpressure (1000 records)
- Rayon parses batches in parallel on all CPU cores
- Producer runs without GIL, eliminating contention

**Use when:** Processing a single large MARC file with maximum throughput from available cores

## Performance Characteristics

### Throughput (Records/Second)

| Mode | Throughput | Notes |
|------|-----------|-------|
| Sequential (1 thread) | 549,500 rec/s | Baseline |
| Parallel (2 threads) | ~1.1M rec/s | ~2.0x speedup |
| Parallel (4 threads) | ~2.0M rec/s | ~3.74x speedup |

### Memory Usage

| Scenario | Memory | Notes |
|----------|--------|-------|
| Per reader | ~4 KB | SmallVec buffer |
| Per record (memory) | ~4 KB | Typical MARC record |
| Overhead (4 readers) | ~16 KB | Negligible |

### GIL Contention

| Phase | GIL Status | Duration | Notes |
|-------|-----------|----------|-------|
| Phase 1 | Held | Short | Read bytes only |
| Phase 2 | Released | Long | Parsing (CPU-bound) |
| Phase 3 | Held | Short | Convert to Python |

## Character Encoding

### MARC-8 Support

MARC-8 is a legacy encoding with:
- Basic Latin (ASCII)
- ANSEL Extended Latin with diacritical marks
- Greek, Cyrillic, Arabic, Hebrew scripts
- East Asian support (Chinese, Japanese, Korean)
- Combining characters with Unicode NFC normalization

### UTF-8 Support

Modern MARC records use UTF-8 (detected from leader position 9).

### Automatic Detection

Character set detected from MARC leader:
- Position 9: ' ' = MARC-8, 'a' = UTF-8
- Decoder selected automatically
- Invalid bytes produce errors with context

## Format Conversions

### Supported Formats

- **JSON**: Generic field-based representation
- **MARCJSON**: Standard JSON-LD format (LOC spec)
- **XML**: Field/subfield XML structure
- **CSV**: Tabular export for spreadsheets
- **Dublin Core**: Simplified 15-element metadata
- **MODS**: Metadata Object Description Schema
- **BIBFRAME**: RDF/Linked Data (bidirectional conversion)

### Conversion Approach

Each format has:
1. **Serializer**: Record → Format bytes
2. **Deserializer**: Format bytes → Record
3. **Round-trip tests**: Ensure lossless conversion

## Testing

### Test Categories

1. **Unit Tests**: Individual components (parsers, builders, queries)
2. **Integration Tests**: End-to-end workflows (read → process → write)
3. **Compatibility Tests**: pymarc compatibility validation (75+ tests)
4. **Performance Tests**: Benchmarking with Criterion.rs and pytest-benchmark
5. **Encoding Tests**: MARC-8, UTF-8, multilingual records

### Test Fixtures

Located in `tests/data/`:
- `simple_book.mrc` - Basic bibliographic record
- `multi_records.mrc` - Multiple records in one file
- `simple_authority.mrc` - Authority record
- `simple_holdings.mrc` - Holdings record
- `with_control_fields.mrc` - Record with 008 field

### Benchmark Fixtures

Located in `tests/data/fixtures/`:
- `1k_records.mrc` (257 KB) - Quick benchmarks
- `10k_records.mrc` (2.5 MB) - Standard benchmarks

## Key Design Principles

1. **Rust-Idiomatic**: Uses iterators, Result types, ownership patterns
2. **Zero-Copy Where Possible**: Efficient memory usage for large workloads
3. **Format Flexibility**: Multiple serialization formats out of box
4. **Compatibility**: Maintains data fidelity with pymarc
5. **Performance**: Concurrent I/O with intelligent GIL management
6. **Safety**: GIL release without unsafe code (except PyO3 glue)

## References

**Performance & Benchmarking:**

- [Performance Tuning Guide]../guides/performance-tuning.md - Usage patterns and tuning
- [Benchmarking Results]../benchmarks/results.md - Detailed performance data
- [Performance FAQ]../benchmarks/faq.md - Quick Q&A about speedups

**Guides:**

- [Threading in Python]../guides/threading-python.md - Thread safety and GIL behavior

**External References:**

- [MARC Standard]https://www.loc.gov/marc/
- [ISO 2709]https://en.wikipedia.org/wiki/MARC_standards
- [PyO3 Documentation]https://pyo3.rs/