mrrc 0.7.6

A Rust library for reading, writing, and manipulating MARC bibliographic records in ISO 2709 binary format
Documentation
# Threading and Thread Safety in MRRC

This document covers thread safety guarantees, GIL behavior, and safe usage patterns for MRRC. For parallel processing patterns and code examples, see the [Python Concurrency Tutorial](../tutorials/python/concurrency.md).

## GIL Behavior

### What is the GIL?

The Python Global Interpreter Lock (GIL) prevents multiple threads from executing Python bytecode simultaneously. This means:
- Pure Python code cannot benefit from multi-core systems
- I/O operations can release the GIL for parallelism

### MRRC's GIL Release Policy

MRRC **releases the GIL** during:

- **File I/O**: `read_record()`, `write_record()` - Rust code runs natively
- **Parsing**: Record parsing, field access - Rust implementation
- **Serialization**: Format conversions (JSON, XML, MARCJSON)

MRRC **holds the GIL** during:

- Python exception creation
- Type conversions between Python and Rust
- Python object allocation

**Implication**: I/O-bound workloads see near-linear speedup with threading. CPU-bound workloads should use `multiprocessing`.

### Three-Phase GIL Management

Each record parse uses a three-phase pattern:

1. **Phase 1 (GIL Held)**: Acquire raw bytes from source - minimal work
2. **Phase 2 (GIL Released)**: Parse MARC bytes in Rust - CPU-intensive, other threads can run
3. **Phase 3 (GIL Re-acquired)**: Convert to Python objects, handle errors

## Thread Safety

### Is MRRC Thread-Safe?

**Yes, but with caveats:**

**Thread-safe:**

- Creating multiple `MARCReader` instances
- Each thread reading from different files
- Reading from different offsets in the same file (with `seek()`)
- Using `ProducerConsumerPipeline` (internally synchronized)

**Not thread-safe:**

- Sharing a single `MARCReader` between threads
- Concurrent modifications to the same `Record` object
- Concurrent writes to the same output file

### Best Practice: Thread Confinement

Each thread should have its own reader:

```python
# GOOD: Each thread has its own reader with path-based GIL release
def process_file(filename):
    reader = MARCReader(filename)  # Path-based: Rust handles I/O, GIL released
    while record := reader.read_record():
        # process record
```

```python
# BAD: Sharing reader across threads
reader = MARCReader("file.mrc")
def worker():
    while record := reader.read_record():  # Race condition!
        # process record
```

## Memory Usage with Threading

### Per-Thread Overhead

- **Per-thread**: ~8 MB (OS-dependent)
- **Per-reader**: ~4 KB (parsing buffer)
- **Per-record in memory**: ~4 KB (typical)
- **Memory regression vs single-threaded**: < 5%

### Preventing Memory Leaks

```python
# GOOD: Pass path for GIL release; Rust handles file I/O and cleanup
def process_file(filename):
    reader = MARCReader(filename)
    while record := reader.read_record():
        process(record)
    # Rust file handle closed when reader is garbage collected

# ALSO GOOD: File object for explicit lifecycle control (but holds GIL during I/O)
def process_file(filename):
    with open(filename, 'rb') as f:
        reader = MARCReader(f)
        while record := reader.read_record():
            process(record)
    # File explicitly closed by context manager
```

## Common Gotchas

### String Operations Re-acquire GIL

Be careful with `str` objects in tight loops:

```python
# Not ideal: Frequent Python operations
while record := reader.read_record():
    title = record.title  # GIL released
    msg = f"Processing: {title}"  # GIL re-acquired
    print(msg)  # GIL re-acquired

# Better: Minimize Python work per iteration
while record := reader.read_record():
    title = record.title
    do_heavy_processing(title)  # Batch Python work
```

### Callback Complexity

Avoid callback-heavy patterns:

```python
# Complex callback chain
executor.submit(lambda: process_file('f1.mrc', callback1))

# Simpler approach
results = executor.map(process_file, ['f1.mrc', 'f2.mrc'])
```

## Debugging Concurrent Programs

### Enable Thread Logging

```python
import logging

logging.basicConfig(
    format='%(threadName)-10s %(levelname)-8s %(message)s',
    level=logging.DEBUG
)

logger = logging.getLogger()

def process_file(filename):
    logger.info(f"Processing {filename}")
    # ... process ...
    logger.info(f"Done with {filename}")
```

### Detect Deadlocks with Timeout

```python
from concurrent.futures import ThreadPoolExecutor
import signal

def timeout_handler(signum, frame):
    raise TimeoutError("Benchmark timed out")

signal.signal(signal.SIGALRM, timeout_handler)
signal.alarm(30)  # 30-second timeout

try:
    with ThreadPoolExecutor(max_workers=4) as executor:
        results = executor.map(process_file, files, timeout=5)
finally:
    signal.alarm(0)  # Cancel timeout
```

## Future Work

- Convenience functions for `ProducerConsumerPipeline` configuration
- Integration of formal verification methods (in development at [mrrc-testbed]https://github.com/dchud/mrrc-testbed)

## References

- [Python Concurrency Tutorial]../tutorials/python/concurrency.md - Parallel processing patterns and code examples
- [Performance Tuning]performance-tuning.md - Performance tuning and benchmarks
- [Python GIL Documentation]https://docs.python.org/3/glossary.html#term-GIL
- [Concurrent.futures Documentation]https://docs.python.org/3/library/concurrent.futures.html
- [PyO3 Threading Guide]https://pyo3.rs/latest/advanced/index.html