1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
use PdfError;
use *;
use Deref;
use PathBuf;
use ;
/// Global singleton for the Pdfium instance.
///
/// The pdfium-render library only allows binding to the Pdfium library ONCE per process.
/// Subsequent calls to `Pdfium::bind_to_library()` or `Pdfium::bind_to_system_library()`
/// will fail with a library loading error because the dynamic library is already loaded.
///
/// Additionally, `Pdfium::new()` calls `FPDF_InitLibrary()` which must only be called once,
/// and when `Pdfium` is dropped, it calls `FPDF_DestroyLibrary()` which would invalidate
/// all subsequent PDF operations.
///
/// This singleton ensures:
/// 1. Library binding happens exactly once (on first access)
/// 2. `FPDF_InitLibrary()` is called exactly once
/// 3. The `Pdfium` instance is never dropped, so `FPDF_DestroyLibrary()` is never called
/// 4. All callers share the same `Pdfium` instance safely
///
/// CRITICAL: We use `&'static Pdfium` (a leaked reference) instead of `Pdfium` to prevent
/// the instance from being dropped during process exit. Without this, when Rust's runtime
/// cleans up static variables during process teardown, the Pdfium destructor runs and calls
/// `FPDF_DestroyLibrary()`, which can cause segfaults/SIGTRAP (exit code 201 on macOS) in
/// FFI scenarios, especially in Go tests where cgo cleanup happens in a specific order.
static PDFIUM_SINGLETON: = new;
/// Global mutex to serialize all PDFium operations.
///
/// PDFium is NOT thread-safe. While the pdfium-render library provides a safe Rust API,
/// the underlying C library can crash when accessed concurrently from multiple threads.
/// This is especially problematic in batch processing mode where multiple `spawn_blocking`
/// tasks may try to process PDFs simultaneously.
///
/// This mutex ensures that only one thread can be executing PDFium operations at any time.
/// While this serializes PDF processing and eliminates parallelism for PDFs, it prevents
/// crashes and ensures correctness.
///
/// # Performance Impact
///
/// In batch mode, PDFs will be processed sequentially rather than in parallel. However,
/// other document types (text, HTML, etc.) can still be processed in parallel. For
/// workloads with mixed document types, this provides good overall performance.
///
/// # Alternatives Considered
///
/// 1. **Process-based parallelism**: Spawn separate processes for PDF extraction.
/// This would allow true parallelism but adds significant complexity and overhead.
///
/// 2. **Thread-local PDFium instances**: Not possible because the library only allows
/// binding once per process (`FPDF_InitLibrary` can only be called once).
///
/// 3. **Disable batch mode for PDFs**: Would require changes to the batch orchestration
/// to detect PDF types and process them differently.
static PDFIUM_OPERATION_LOCK: = new;
/// Extract the bundled pdfium library and return its directory path.
///
/// This is only called on first initialization when `bundled-pdfium` feature is enabled.
/// Bind to the Pdfium library and create bindings.
///
/// This function is only called once during singleton initialization.
/// Initialize the Pdfium singleton.
///
/// This function performs the one-time initialization:
/// 1. Extracts bundled library if using `bundled-pdfium` feature
/// 2. Creates bindings to the Pdfium library
/// 3. Creates and leaks the `Pdfium` instance to prevent cleanup during process exit
///
/// This is only called once, on first access to the singleton.
///
/// CRITICAL: We intentionally leak the Pdfium instance using `Box::leak()` to prevent
/// it from being dropped during process exit. If the instance were dropped, it would call
/// `FPDF_DestroyLibrary()` which causes segfaults/SIGTRAP in FFI scenarios (exit code 201
/// on macOS), particularly visible in Go tests where cgo cleanup order matters.
/// A handle to the global Pdfium instance with exclusive access.
///
/// This wrapper provides access to the singleton `Pdfium` instance. It implements
/// `Deref<Target = Pdfium>` so it can be used anywhere a `&Pdfium` is expected.
///
/// # Design
///
/// The handle holds an exclusive lock on PDFium operations via `PDFIUM_OPERATION_LOCK`.
/// When the handle is dropped, the lock is released, allowing other threads to
/// acquire PDFium access.
///
/// This design ensures:
/// - The Pdfium library is initialized exactly once
/// - The library is never destroyed during the process lifetime
/// - Only one thread can access PDFium at a time (thread safety)
/// - The lock is automatically released when the handle goes out of scope
///
/// # Thread Safety
///
/// PDFium is NOT thread-safe, so this handle serializes all PDFium operations.
/// While this prevents parallel PDF processing, it ensures correctness and
/// prevents crashes in batch processing scenarios.
pub
/// Get a handle to the Pdfium library with lazy initialization.
///
/// The first call to this function triggers initialization of the global Pdfium singleton.
/// This includes:
/// - Extracting the bundled Pdfium library (if using `bundled-pdfium` feature)
/// - Loading and binding to the Pdfium dynamic library
/// - Calling `FPDF_InitLibrary()` to initialize the library
///
/// Subsequent calls return immediately with a handle to the same singleton instance.
///
/// # Arguments
///
/// * `map_err` - Function to convert error strings into `PdfError` variants
/// * `context` - Context string for error messages (e.g., "text extraction")
///
/// # Returns
///
/// A `PdfiumHandle` that provides access to the global `Pdfium` instance via `Deref`.
/// The handle can be used anywhere a `&Pdfium` reference is expected.
///
/// # Performance
///
/// - **First call**: Performs full initialization (~8-12ms for bundled extraction + binding)
/// - **Subsequent calls**: Returns immediately (just fetches from `OnceLock`, ~nanoseconds)
///
/// This lazy initialization defers Pdfium setup until the first PDF is processed,
/// improving cold start time for non-PDF workloads.
///
/// # Thread Safety
///
/// This function is thread-safe but SERIALIZES access to PDFium:
/// - The `OnceLock` ensures initialization happens exactly once
/// - The `PDFIUM_OPERATION_LOCK` mutex ensures only one thread can access PDFium at a time
/// - The returned `PdfiumHandle` holds the mutex guard; when dropped, the lock is released
///
/// This serialization is necessary because PDFium is NOT thread-safe. Concurrent access
/// to PDFium from multiple threads causes crashes (segfaults, abort traps).
///
/// # Error Handling
///
/// If initialization fails (e.g., library not found, extraction failed), the error
/// is cached and returned on all subsequent calls. The process cannot recover from
/// a failed initialization - restart the process to retry.
///
/// # Example
///
/// ```ignore
/// // First call initializes the singleton
/// let pdfium = bind_pdfium(PdfError::TextExtractionFailed, "text extraction")?;
///
/// // Use it like a &Pdfium
/// let document = pdfium.load_pdf_from_byte_slice(bytes, None)?;
///
/// // Subsequent calls return immediately
/// let pdfium2 = bind_pdfium(PdfError::RenderingFailed, "page rendering")?;
/// // pdfium and pdfium2 reference the same underlying instance
/// ```
pub