mrrc 0.8.1

A Rust library for reading, writing, and manipulating MARC bibliographic records in ISO 2709 binary format
Documentation
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
# Error Handling

mrrc raises a typed exception hierarchy with structured positional metadata
on every error: where in the byte stream the problem occurred, which record
it came from, the 001 control number, the field/subfield being parsed, and
the source filename when known. The class names and parent relationships
match pymarc's exception layer, so code written against `pymarc`'s exception
classes catches the same conditions in mrrc unchanged.

## Exception hierarchy

```
Exception
├── MrrcException                      (base)
│   ├── RecordLengthInvalid
│   ├── RecordLeaderInvalid
│   ├── BaseAddressInvalid
│   ├── BaseAddressNotFound
│   ├── RecordDirectoryInvalid
│   │   ├── InvalidIndicator    (mrrc)
│   │   ├── BadSubfieldCode     (mrrc)
│   │   └── InvalidField        (mrrc)
│   ├── EndOfRecordNotFound
│   │   └── TruncatedRecord     (mrrc)
│   ├── FieldNotFound
│   ├── FatalReaderError
│   ├── EncodingError           (mrrc)
│   ├── XmlError                (mrrc)
│   ├── JsonError               (mrrc)
│   └── WriterError             (mrrc)
└── OSError
    └── PyIOError                      (Python built-in, raised on I/O failure)

class BadSubfieldCodeWarning(UserWarning)
```

Classes marked **(mrrc)** are mrrc-specific subclasses that pymarc does not
have. Each one extends the closest pymarc parent so existing
pymarc-style `except` clauses keep catching the same conditions.

## Choosing what to catch

| You want to… | Catch |
|---|---|
| Match pymarc's catch behavior exactly | The pymarc-named class (`RecordDirectoryInvalid`, `EndOfRecordNotFound`, etc.) — mrrc-specific subclasses are caught too. |
| Distinguish indicator errors from subfield errors | `InvalidIndicator` and `BadSubfieldCode` separately. |
| Catch every mrrc error, no matter the variant | `MrrcException`. |
| Catch only I/O errors | `OSError` (or its `IOError` alias). |

## Pymarc exception compatibility

This page covers exception **class names, hierarchy, and catch behavior**
only. The new positional attributes are additive: pymarc-style code that
inspects only `str(err)` keeps working without change.

Other compatibility surfaces — record APIs, reader/writer constructor
shapes, format coverage, and performance characteristics — are out of
scope for this page; consult the
[Python API reference](python-api.md) and
[Rust API reference](rust-api.md) for those.

### Exception name mapping

| pymarc class | mrrc class | Notes |
|---|---|---|
| `PymarcException` | `MrrcException` | Same role; alias if desired (see below). |
| `RecordLengthInvalid` | `RecordLengthInvalid` | Same name; gains positional attrs. |
| `RecordLeaderInvalid` | `RecordLeaderInvalid` | Same name; gains positional attrs. |
| `BaseAddressInvalid` | `BaseAddressInvalid` | Same name; gains positional attrs. |
| `BaseAddressNotFound` | `BaseAddressNotFound` | Same name; gains positional attrs. |
| `RecordDirectoryInvalid` | `RecordDirectoryInvalid` | Same name; gains positional attrs. Also catches new mrrc subclasses `InvalidIndicator`, `BadSubfieldCode`, `InvalidField`. |
| `EndOfRecordNotFound` | `EndOfRecordNotFound` | Same name; gains positional attrs. Also catches new subclass `TruncatedRecord`. |
| `FieldNotFound` | `FieldNotFound` | Same name; gains `record_control_number`, `record_index`. |
| `FatalReaderError` | `FatalReaderError` | Same name; reserved for catastrophic states. |
| `BadSubfieldCodeWarning` | `BadSubfieldCodeWarning` | Same name (UserWarning, not exception). |
| `IOError` / `OSError` | `OSError` (via `PyIOError`) | I/O errors map to Python's built-in. |

#### Pymarc names mrrc deliberately omits

The following pymarc classes are intentionally absent in mrrc. Each
row gives the rationale and the mrrc-equivalent behavior a port
should rely on instead.

| pymarc class | why mrrc doesn't have it | mrrc-equivalent behavior |
|---|---|---|
| `NoFieldsFound` | An empty `Record` is a valid in-memory state in mrrc; no exception is raised. | Check `record.get_fields()` length. |
| `WriteNeedsRecord` | `MARCWriter.write_record` is type-annotated; passing a non-Record is a static-type error. | Static type check (`pyright` / `mypy`). |
| `NoActiveFile` | `MARCWriter` is context-managed; operating on a closed writer raises plain `RuntimeError`. | Use a `with` block or check writer state. |
| `BadLeaderValue` | `mrrc.Leader` validates fields at construction. | Bad values raise `ValueError`. |
| `MissingLinkedFields` | 880-linkage validation isn't part of the parser. | Validate links in caller code. |

### Optional symbol-level aliases

For projects swapping `pymarc` imports to `mrrc` and wanting
`from pymarc import RecordLeaderInvalid`-style imports to keep working:

```python
from mrrc import MrrcException as PymarcException
from mrrc import (
    RecordLengthInvalid,
    RecordLeaderInvalid,
    BaseAddressInvalid,
    BaseAddressNotFound,
    RecordDirectoryInvalid,
    EndOfRecordNotFound,
    FieldNotFound,
    FatalReaderError,
    BadSubfieldCodeWarning,
)
```

The catch hierarchy behaves the same as in pymarc. Code outside the
exception layer (record manipulation, reader/writer APIs, format I/O)
may still need changes; consult the [Python API reference](python-api.md).

### What you gain on the exception layer

Three patterns, in order of effort:

**Same `except`, more context.** Existing pymarc-style code keeps working.
The same `except` clause now also gets structured attributes:

```python
try:
    for record in mrrc.MARCReader(open("harvest.mrc", "rb")):
        ...
except mrrc.RecordDirectoryInvalid as e:
    log.warning(
        "directory error in record %d (001=%s, field %s) at byte 0x%X",
        e.record_index, e.record_control_number, e.field_tag, e.byte_offset,
    )
```

**Opt-in granularity.** mrrc-aware code can catch the new subclasses
directly to make decisions on the specific error kind:

```python
try:
    ...
except mrrc.InvalidIndicator as e:
    log.warning(
        "Bad indicator at field %s ind%d in record %d",
        e.field_tag, e.indicator_position, e.record_index,
    )
except mrrc.BadSubfieldCode as e:
    log.warning("Bad subfield code 0x%02X at field %s", e.subfield_code, e.field_tag)
```

**Diagnostic dump.** The `detailed()` method produces a multi-line
diagnostic suitable for logs:

```python
try:
    ...
except mrrc.MrrcException as e:
    log.error(e.detailed())
```

```text
InvalidIndicator at record 847, field 245
  source:          harvest.mrc
  001:             ocm01234567
  indicator 1:     found b':', expected digit or space
  byte offset:     0x1C31 (7217) in stream
  record-relative: byte 42
```

### Subclass behavior reference

| If you `except` this class… | …you also catch these mrrc-specific subclasses |
|---|---|
| `RecordDirectoryInvalid` | `InvalidIndicator`, `BadSubfieldCode`, `InvalidField` |
| `EndOfRecordNotFound` | `TruncatedRecord` |
| `MrrcException` | All mrrc-specific exceptions |
| `OSError` | `PyIOError` (I/O failures) |

### `MARCReader.current_exception` / `current_chunk`

mrrc's `MARCReader` exposes pymarc-compatible `current_exception` and
`current_chunk` attributes. After each `__next__` step:

- `reader.current_chunk` holds the raw bytes of the record just read
  from the source (declared length per the leader). Set on every
  successful chunk read regardless of whether the parse step then
  succeeded or failed.
- `reader.current_exception` holds the typed `MrrcException` swallowed
  by `permissive=True`, or `None` on a clean read.

```python
reader = mrrc.MARCReader("harvest.mrc", permissive=True)
for record in reader:
    if record is None:
        log.warning(
            "skipped malformed record (%d bytes): %s",
            len(reader.current_chunk) if reader.current_chunk else 0,
            reader.current_exception,
        )
        continue
    process(record)
```

Two documented divergences from pymarc:

- **Encoding strictness.** mrrc raises `EncodingError` on invalid UTF-8
  in subfield values (swallowed via `current_exception` under
  `permissive=True`); pymarc applies lossy substitution silently. The
  iteration shape is identical (the bad record yields as `None` either
  way), so callers using `except Exception:` keep working.
- **`current_chunk` on byte-read errors.** When the underlying read
  of the next record's bytes fails before parsing begins (truncated
  stream, I/O error), `current_chunk` may be `None` even though
  `current_exception` is set. For parse failures of fully-read chunks
  (the common case), `current_chunk` carries the full record bytes as
  pymarc does.

### Known hierarchy divergences from pymarc

mrrc's exception class names match pymarc's, but two relationships in
the class tree differ. Existing `except` clauses written against a
specific class name (`except RecordDirectoryInvalid:`,
`except EndOfRecordNotFound:`, etc.) work in mrrc unchanged. The
divergences only matter for code that catches a *parent* class.

**`FatalReaderError` parentage.** In pymarc, `FatalReaderError` is the
parent of `RecordLengthInvalid`, `TruncatedRecord`, and
`EndOfRecordNotFound`; a pymarc loop can `except FatalReaderError:` to
catch any of those four. In mrrc, `FatalReaderError` is a sibling
(reserved for the specific "recovered-error cap exceeded" case under
`recovery_mode="lenient"`/`"permissive"` with `with_max_errors`).
`except FatalReaderError:` in mrrc therefore catches only the
cap-exhausted case, not the malformed-record cases. To match pymarc's
catch surface, either enumerate the four classes —

```python
except (RecordLengthInvalid, TruncatedRecord, EndOfRecordNotFound,
        FatalReaderError):
    ...
```

— or catch the mrrc base, which is broader (every typed mrrc error):

```python
except MrrcException:
    ...
```

**`PymarcException` → `MrrcException`.** The base class name differs.
`from pymarc import PymarcException` fails at import; replace with
`from mrrc import MrrcException` (or alias on import — see *Optional
symbol-level aliases* below).

## Per-variant field reference

Each exception class accepts the following keyword arguments at construction
time (all optional). Attributes of the same name are populated by the parser
when the information is available; absent values stay `None`.

| Field | Type | Meaning |
|---|---|---|
| `record_index` | `int \| None` | 1-based position of the record in the input stream. |
| `record_control_number` | `str \| None` | Value of the 001 control field for the record being parsed. `None` for errors raised before 001 is decoded (invalid leader, invalid directory, pre-001 truncation). |
| `field_tag` | `str \| None` | Tag of the field being parsed (e.g., `"245"`). |
| `indicator_position` | `int \| None` | Indicator position (`0` or `1`), populated for `InvalidIndicator`. |
| `subfield_code` | `int \| None` | Offending subfield code byte, populated for `BadSubfieldCode`. |
| `found` | `bytes \| None` | The bad bytes that triggered the error, capped at 32 bytes. |
| `expected` | `str \| None` | Human-readable description of what was expected. |
| `byte_offset` | `int \| None` | Absolute byte offset within the input stream. |
| `record_byte_offset` | `int \| None` | Byte offset within the current record. |
| `source` | `str \| None` | Filename or stream identifier, populated when the reader was constructed via `from_path`. |
| `bytes_near` | `bytes \| None` | Up to 32 bytes around the error offset, for hex-dump rendering. `None` when the parser did not have access to a buffer at error time. |
| `bytes_near_offset` | `int \| None` | Absolute stream offset of the first byte of `bytes_near`. |

Subclass-specific extras:

- `InvalidField`, `EncodingError`, `XmlError`, `JsonError`, `WriterError` add
  a `message: str | None` field carrying a human-readable description of the
  problem.
- `TruncatedRecord` adds `expected_length` and `actual_length` (both
  `int | None`) describing how far short the record was of its declared
  length.

### Always-present vs may-be-present per variant

The parser populates `record_index` and `byte_offset` on every parse-path
error; `record_control_number` whenever 001 is already decoded;
`source` whenever the reader was constructed via `with_source()` or
`from_path()`. Other fields are populated when applicable to the variant
(e.g., `indicator_position` only on `InvalidIndicator`).

`FieldNotFound` is an accessor error rather than a parse error; it carries
`field_tag`, `record_control_number`, and `record_index` but not byte
offsets.

## Position semantics by format

`byte_offset` and `record_byte_offset` mean different things depending on the
input format:

- **ISO 2709** (binary MARC). `byte_offset` is the absolute byte position in
  the input stream; `record_byte_offset` is relative to the start of the
  current record. This is the primary case.
- **MARCXML**. The underlying `quick_xml` parser does not expose a byte
  position from its deserializer error type, so `byte_offset` is currently
  `None`. Position information is available via the wrapped cause: walk
  `err.__cause__` for the original `quick_xml` error.
- **MARCJSON**. The wrapped `serde_json::Error` exposes line and column;
  `byte_offset` is currently `None` because translating (line, column) to a
  byte offset requires the original input bytes. Walk `err.__cause__` to
  read `cause.line` and `cause.column`.

When a format's underlying parser does not expose usable position
information, the field stays `None` rather than being fabricated.

## Source filename plumbing

The `source` attribute on errors is populated when the reader was told its
input identity. There are two ways to set it:

```python
# 1. Builder method: any reader, any input source.
reader = mrrc.MARCReader(file_obj).with_source("harvest.mrc")

# 2. Convenience constructor: opens a file and sets source from the path.
reader = mrrc.MARCReader.from_path("harvest.mrc")
```

When neither is used (e.g., reading from `BytesIO`), `source` stays `None`
on emitted errors.

The same `with_source` / `from_path` pattern is available on
`AuthorityMARCReader` and `HoldingsMARCReader`.

## Validation level vs recovery mode

Two orthogonal axes govern parsing behavior:

- **`validation_level`***what counts as an error*.
- **`recovery_mode`***what to do when one fires*.

The single rule, statable in one sentence: **`structural` is lossy
across every reader; `strict_marc` is strict across every reader — every
reader behaves the same way at each level.**

Concretely:

| | `validation_level="structural"` (default) | `validation_level="strict_marc"` |
|---|---|---|
| ISO 2709 structural errors (E001–E007, E101, E106) | fire | fire |
| Indicator byte validation (E201, byte-level) | skipped | fires |
| Per-tag MARC 21 indicator semantics (E201, e.g. 245 ind1 ∈ {0,1}) | skipped | fires |
| Subfield-code byte validation (E202) | skipped | fires |
| MARC 21 leader semantics (E002, e.g. record_status ∈ {a,c,d,n,p}) | skipped | fires |
| UTF-8 strictness (E301) | lossy decode (`U+FFFD` substitution) across bibliographic + authority + holdings | strict decode raises across all three readers |

```python
reader = mrrc.MARCReader(
    file,
    validation_level="structural",   # or "strict_marc"
    recovery_mode="strict",          # or "lenient", "permissive"
)
```

The two axes compose. `(strict_marc, lenient)` means *I want byte-level
checks AND I want to keep iterating past one bad record* — strict_marc
makes E201/E202/E301 fire, lenient absorbs them via the per-stream
recovery cap.

## Recovery modes and errors

The `RecoveryMode` setting (`Strict` / `Lenient` / `Permissive`) controls
whether a malformed record raises immediately, is salvaged with partial
data, or is skipped. The structured positional metadata is populated
identically in all three modes — the modes only differ in whether the
error is propagated, suppressed, or used to inform a salvage attempt.

### Defaults: Python `permissive`, Rust `Strict`

The Python user surface (`mrrc.MARCReader`, `mrrc.AuthorityMARCReader`,
`mrrc.HoldingsMARCReader`) defaults to `recovery_mode="permissive"` —
the same default shape as pymarc / marc4j / libmarc. A fresh
`MARCReader(file)` iterates past per-record defects rather than aborting
on the first one, so users coming from those libraries get the
expected behavior without setting any kwarg.

The Rust core (`mrrc::MarcReader`) keeps the stricter `RecoveryMode::Strict`
default. Rust callers expect explicit error handling via `Result<T, E>`
and `?` propagation; flipping the default there would convert a loud
`Err` into a quiet `record.errors` field that the caller has to
remember to inspect.

#### A gentle case for choosing `strict` when feasible

Permissive mode is the more forgiving default, but it has a real cost
worth understanding before you ship it past a prototype:

- **Unsalvageable records yield as `None`.** When the parser can't make
  even partial sense of a record's bytes, the Python wrapper hands you
  `None` rather than skipping silently. A loop written as
  `for record in reader: process(record)` will pass `None` into
  `process` unless you guard with `if record is not None:` or iterate
  via `iter_with_errors()`. Worth being deliberate about.
- **Per-record diagnostics live on `record.errors`.** A clean iteration
  in permissive mode can still be hiding malformed records — the errors
  are attached to the yielded record rather than raised. If nothing
  checks `record.errors`, defects are observable but invisible.
- **`record.errors` accumulates up to `max_errors`.** Without an
  explicit `max_errors=N` kwarg, a pathological stream can fill memory
  with diagnostic objects before anyone notices. The Rust core caps
  at `DEFAULT_MAX_ERRORS` (10 000) per parse, but the Python wrapper-
  level cap defaults to disabled (see [Capping recovered errors with
  `max_errors`](#capping-recovered-errors-with-max_errors)).

If you control the input and quality matters more than throughput,
`recovery_mode="strict"` makes defects loud: a single bad record
raises a typed exception with full positional context. Pair it with
`permissive=True` for the pymarc-shape pattern of "yield `None` for
bad records, stash the exception on `current_exception`" without
losing the precise diagnostics.

```python
# Most forgiving (default): keep going, attach defects to record.errors
reader = mrrc.MARCReader(file)

# Pymarc-shape: yield None for failed parses, stash exception
reader = mrrc.MARCReader(file, permissive=True)

# Loudest: typed exception raised on first defect
reader = mrrc.MARCReader(file, recovery_mode="strict")
```

## Inspecting per-record errors

In `lenient` and `permissive` recovery modes, errors that would have
been raised under `strict` are instead **attached to the yielded
record** as `record.errors`. The list carries one typed exception per
recovered defect, with the same positional context (record_index,
byte_offset, field_tag, etc.) as if the error had been raised directly.

```python
reader = mrrc.MARCReader(file, recovery_mode="lenient")
for record in reader:
    if record.errors:
        for err in record.errors:
            log.warning(f"[{err.code}] {err}")
    process(record)
```

In `strict` mode `record.errors` is always `[]` — the parser raises on
the first error before the record is yielded. In `lenient` and
`permissive` it carries diagnostics for every defect the parser
recovered from (subject to `max_errors` cap).

### `iter_with_errors()`

`MARCReader.iter_with_errors()` is an alternate iterator yielding
`(record, errors)` tuples instead of bare records. Equivalent to
iterating + reading `record.errors`, but more discoverable for the
"give-me-everything-defective" use case:

```python
for record, errors in reader.iter_with_errors():
    if errors:
        log.warning(f"{len(errors)} issues parsing record")
    if record:
        process(record)
```

Under `permissive=True`, records that the parser cannot salvage at all
yield as `(None, [exception])` so even unsalvageable records are
observable. Without `iter_with_errors`, those records are silently
returned as `None` and the diagnostic is lost.

```python
reader = mrrc.MARCReader(file, permissive=True)
for record, errors in reader.iter_with_errors():
    if record is None:
        log.error(f"unsalvageable: {errors[0]}")
    else:
        process(record)
```

`AuthorityMARCReader` and `HoldingsMARCReader` expose `record.errors`
the same way (the load-bearing surface). They don't carry the
`iter_with_errors` convenience method — that's a pymarc-shape ergonomic
specific to `MARCReader`. Iterate normally and check `record.errors`:

```python
for record in mrrc.AuthorityMARCReader(file, recovery_mode="lenient"):
    if record.errors:
        log.warning(...)
```

### Capping recovered errors with `max_errors`

A pathological stream in `lenient` / `permissive` mode can accumulate diagnostics without bound — every malformed record adds one or more `MrrcException` instances to `record.errors`. Pass `max_errors=N` to `MARCReader` to cap the total recovered count across the stream; once the (N+1)-th recovered error lands, the next iteration raises `FatalReaderError` (E099) instead of yielding another record.

```python
reader = mrrc.MARCReader(file, recovery_mode="lenient", max_errors=100)
try:
    for record in reader:
        process(record)
except mrrc.FatalReaderError as e:
    log.error(f"stopped after {e.errors_seen} errors (cap={e.cap})")
```

- `max_errors=None` (the default) disables the wrapper-level cap.
- `max_errors=0` also disables the cap (matches the Rust API's no-cap sentinel).
- `max_errors=N` for any `N > 0` trips on the (N+1)-th recovered error.

Observationally inert in `strict` mode: the first error raises before any recovery accumulates against the cap. `AuthorityMARCReader` and `HoldingsMARCReader` don't carry the kwarg — they inherit the Rust core's per-reader `DEFAULT_MAX_ERRORS` (10_000) directly.

## Structured serialization (`to_dict` / `to_json`)

Every `MrrcException` exposes `to_dict()` and `to_json()` for emitting the
error into structured logging platforms (ELK, Datadog, Splunk,
JSON-line pipelines) without writing an adapter. The Rust side offers a
matching `MarcError::to_json_value()` / `to_json()` that produces the same
schema.

```python
try:
    ...
except mrrc.MrrcException as e:
    log.error(json.dumps({**e.to_dict(), "app": "ingest"}))
```

Sample output:

```python
>>> err.to_dict()
{
  "schema_version": 1,
  "class": "InvalidIndicator",
  "code": "E201",
  "slug": "invalid_indicator",
  "severity": "error",
  "help_url": "https://dchud.github.io/mrrc/reference/error-codes/#E201",
  "record_index": 847,
  "record_control_number": "ocm01234567",
  "field_tag": "245",
  "indicator_position": 0,
  "found": None,
  "found_hex": "3a",
  "expected": "digit or space",
  "byte_offset": 7217,
  "record_byte_offset": 42,
  "source": "harvest.mrc",
  "bytes_near": None,
  "bytes_near_hex": "323032336e79752020202020202020203a3030203020656e6720641e323435",
  "bytes_near_offset": 7201,
  "_cause": None
}
```

### Notes on the shape

- Bytes fields carry their data under a `_hex` suffix key (`found_hex`,
  `bytes_near_hex`); the bare key (`found`, `bytes_near`) stays `null` so
  the dict is JSON-serializable without a custom encoder. The `_hex`
  keys appear only when bytes were captured.
- `_cause` is always a string or `null`, never nested. For the full
  exception chain pass `include_traceback=True` or walk `__cause__`.
- The emitted bytes are bounded at capture time (`found` ≤ 32 bytes,
  `bytes_near` ≤ 32 bytes from the 16+16 hex-dump window), so payloads
  don't grow unboundedly.
- `schema_version: 1` is included so callers can branch on it later if
  the shape ever changes. Pre-1.0, the shape may still evolve.

### `include_traceback`

`to_dict(include_traceback=True)` adds a `traceback` key with formatted
traceback lines (only present when the exception was actually raised).
`to_json(include_traceback=True)` forwards the flag to `to_dict`.

## Hex dump in `detailed()`

When the parser captures a byte window around the error offset, the
exception's `detailed()` output appends a 32-byte hex + ASCII dump with a
caret pointing at the offending byte:

```text
InvalidIndicator at record 847, field 245
  source:          harvest.mrc
  001:             ocm01234567
  indicator 0:     found b':', expected digit or space
  byte offset:     0x1C31 (7217) in stream
  record-relative: byte 42

bytes near offset 0x1C31:
    0x1C21:  32 30 32 33 6e 79 75 20  20 20 20 20 20 20 20 20 |2023nyu         |
    0x1C31:  3a 30 00 30 20 30 20 65  6e 67 20 64 1e 32 34 35 |:0.0 0 eng d.245|
             ^^ offending byte
```

The window is up to 16 bytes before + 16 bytes after the error offset,
clamped at buffer boundaries. Non-printable bytes render as `.` in the
ASCII sidecar. The window layout is fixed at 16 bytes per row with an
8-byte gap for readability; the format is byte-for-byte identical in
Rust (`MarcError::detailed()`) and Python (`MrrcException.detailed()`).

The `bytes_near` attribute on the exception is `None` when the parser
did not have access to a buffer at the point the error was raised
(e.g., for wrapping variants like `IoError` / `XmlError` / `JsonError`,
or for error paths that do not yet plumb the buffer through).

## Pickle round-trip

Exception instances round-trip through `pickle` with all positional
attributes preserved (subclass extras like `expected_length`/`message`
included). For security, `__setstate__` whitelists incoming attribute names
against the per-class allowed set; a maliciously-crafted pickle that tries
to set arbitrary attributes (including method names) will raise `TypeError`
rather than silently shadowing methods on the instance.

This is a defense-in-depth measure only. As with any pickle-based
deserialization, do not unpickle data from untrusted sources — the
unpickling step itself is the relevant attack surface.