1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
//! Unicode-capable system-font fallback for office round-trip.
//!
//! Base-14 PDF fonts cover Latin-1 only. Source PDFs containing
//! Hebrew, Arabic, Devanagari, CJK, or even most Latin Extended
//! characters embed their own fonts that the writer typically can't
//! re-embed (CID-only subsets, Type 1 programs, etc.). Without a
//! Unicode-capable fallback the renderer emits `.notdef` for every
//! such glyph, which surfaces as `?` or missing-glyph boxes on the
//! round-trip PDF.
//!
//! This module locates a system font that covers a broad Unicode
//! range — DejaVu Sans on Linux, falling back to FreeSans or Noto
//! Sans — loads its raw bytes, and exposes them to the office
//! writer so they can be registered alongside the source's embedded
//! fonts. The caller decides per-span whether to route to the
//! fallback via `needs_unicode_fallback`.
//!
//! The font is loaded at most once per process (cached via
//! `OnceLock`). The bytes are cloned on each retrieval — a few
//! hundred KB, ~once per round-trip, not a hot path.
use OnceLock;
/// Resource-name we register the Unicode fallback under. Stable
/// across docx / pptx / xlsx so the back-to-PDF code path can find
/// the font by name regardless of source format.
pub const UNICODE_FALLBACK_NAME: &str = "Pdfox-UnicodeFallback";
/// Resource-name for the CJK-capable fallback. Distinct from the
/// general fallback because the CJK font program is much larger
/// (4-19 MB) — we only register it when CJK text is actually
/// present in the document, to keep small-doc round-trip output
/// slim.
pub const UNICODE_FALLBACK_CJK_NAME: &str = "Pdfox-UnicodeFallback-CJK";
static CACHED_BYTES: = new;
static CACHED_CJK_BYTES: = new;
/// Load (and cache) a system Unicode-capable font. Returns the raw
/// TTF bytes the office writer can hand to
/// `register_embedded_font` / `embed_font`.
///
/// First match wins from a fixed candidate list — DejaVu Sans (very
/// broad coverage, ships with most Linux distros), then GNU FreeSans
/// (BSD-compatible), then Chrome OS / Noto Sans, then Tinos /
/// Arimo. On systems with none of these the helper returns `None`
/// and the round-trip silently degrades to the existing
/// `?`-glyph behaviour rather than panicking.
/// Load (and cache) a system CJK-capable font. Used as a secondary
/// fallback when text contains Han / Hiragana / Katakana / Hangul
/// characters that the general Unicode fallback (DejaVu Sans /
/// FreeSans) doesn't cover.
///
/// Prefers a standalone TrueType file over a TTC since the office
/// writer's font pipeline doesn't yet handle TrueType Collections.
/// Returns `None` when no CJK font is found on the system — the
/// renderer then falls back to `.notdef` and CJK glyphs render
/// as the missing-glyph box (same behaviour as before this helper
/// existed).
/// Returns `true` when the text contains at least one character in
/// the CJK Unicode ranges. CJK script needs a different fallback
/// font from Latin / Hebrew / Arabic because the general Unicode
/// fallback (DejaVu Sans) has zero Han / Hiragana / Hangul
/// coverage — routing CJK text to it would still emit `.notdef`.
///
/// Ranges covered (PDF/spec terminology):
/// - U+3000..U+303F CJK Symbols and Punctuation
/// - U+3040..U+30FF Hiragana + Katakana
/// - U+31F0..U+31FF Katakana Phonetic Extensions
/// - U+3400..U+4DBF CJK Unified Ideographs Extension A
/// - U+4E00..U+9FFF CJK Unified Ideographs
/// - U+A000..U+A4CF Yi Syllables + Radicals (treated as CJK-region)
/// - U+AC00..U+D7AF Hangul Syllables
/// - U+F900..U+FAFF CJK Compatibility Ideographs
/// - U+FE30..U+FE4F CJK Compatibility Forms
/// - U+FF00..U+FFEF Halfwidth and Fullwidth Forms
/// Returns `true` when the supplied text contains at least one
/// character that base-14 PDF fonts (Helvetica / Times / Courier
/// + Symbol + ZapfDingbats) cannot render via WinAnsi encoding.
///
/// Base-14 fonts cover Latin-1 (U+0000..U+00FF) **plus** a handful
/// of typographic Unicode codepoints commonly mapped in WinAnsi:
/// curly quotes, em / en dash, bullet, ellipsis, trademark, etc.
/// Routing those to a Unicode-capable face would needlessly switch
/// font family on regular Western text (curly quotes appear in
/// almost every form / policy document), which produces a visible
/// regression even when the source text is otherwise pure Latin.
///
/// Empty strings, ASCII-only strings, and strings whose only
/// non-Latin-1 codepoints are in the WinAnsi-extra set return
/// `false`. Hebrew, Arabic, CJK, Devanagari, Greek, Cyrillic, and
/// Latin Extended-A/B return `true` and route to the Unicode
/// fallback.