meta_oxide 0.1.1

Universal metadata extraction library supporting 13 formats (HTML Meta, Open Graph, Twitter Cards, JSON-LD, Microdata, Microformats, RDFa, Dublin Core, Web App Manifest, oEmbed, rel-links, Images, SEO) with 7 language bindings
Documentation
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
# MetaOxide C API Usage Guide

Complete guide for using the MetaOxide C library to extract metadata from HTML documents.

## Table of Contents

1. [Introduction]#introduction
2. [Installation]#installation
3. [Quick Start]#quick-start
4. [API Reference]#api-reference
5. [Memory Management]#memory-management
6. [Error Handling]#error-handling
7. [Thread Safety]#thread-safety
8. [Examples]#examples
9. [Performance Tips]#performance-tips

## Introduction

MetaOxide provides a comprehensive C API for extracting 13 different types of metadata from HTML documents:

- **Standard HTML meta tags** (title, description, keywords, etc.)
- **Open Graph** (Facebook, LinkedIn rich previews)
- **Twitter Cards** (Twitter rich previews)
- **JSON-LD** (Schema.org structured data)
- **Microdata** (HTML5 structured data)
- **Microformats** (h-card, h-entry, h-event, h-review, h-recipe, h-product, h-feed, h-adr, h-geo)
- **RDFa** (W3C structured data standard)
- **Dublin Core** (Library/archive metadata)
- **Web App Manifest** (PWA configuration)
- **oEmbed** (Embeddable content discovery)
- **rel-* links** (HTML link relationships)

All extraction functions return JSON strings for easy integration with any system.

## Installation

### Building from Source

```bash
# Clone the repository
git clone https://github.com/yfedoseev/meta_oxide
cd meta_oxide

# Build the C library
cargo build --release

# The library will be in target/release/
# - Linux: libmeta_oxide.so
# - macOS: libmeta_oxide.dylib
# - Windows: meta_oxide.dll

# The C header is in include/meta_oxide.h
```

### Linking Against MetaOxide

**GCC/Clang:**
```bash
gcc -I/path/to/meta_oxide/include \
    -L/path/to/meta_oxide/target/release \
    -o myapp myapp.c \
    -lmeta_oxide -lpthread -ldl -lm
```

**CMakeLists.txt:**
```cmake
find_library(META_OXIDE_LIB meta_oxide HINTS /path/to/target/release)
include_directories(/path/to/include)
target_link_libraries(myapp ${META_OXIDE_LIB} pthread dl m)
```

**Makefile:**
```makefile
CFLAGS = -I/path/to/include
LDFLAGS = -L/path/to/target/release -lmeta_oxide -lpthread -ldl -lm

myapp: myapp.c
    $(CC) $(CFLAGS) -o myapp myapp.c $(LDFLAGS)
```

## Quick Start

```c
#include "meta_oxide.h"
#include <stdio.h>

int main() {
    const char* html =
        "<html>"
        "<head>"
        "  <title>Example Page</title>"
        "  <meta name=\"description\" content=\"A test page\">"
        "  <meta property=\"og:image\" content=\"https://example.com/image.jpg\">"
        "</head>"
        "</html>";

    // Extract all metadata at once
    MetaOxideResult* result = meta_oxide_extract_all(html, "https://example.com");

    if (result == NULL) {
        fprintf(stderr, "Error: %s\n", meta_oxide_error_message());
        return 1;
    }

    // Access individual fields
    if (result->meta != NULL) {
        printf("Meta tags: %s\n", result->meta);
    }

    if (result->open_graph != NULL) {
        printf("Open Graph: %s\n", result->open_graph);
    }

    // Clean up (essential to prevent memory leaks!)
    meta_oxide_result_free(result);
    return 0;
}
```

## API Reference

### Core Function: Extract All Metadata

```c
MetaOxideResult* meta_oxide_extract_all(
    const char* html,      // HTML content (required)
    const char* base_url   // Base URL for resolving relative URLs (optional, can be NULL)
);
```

Extracts all supported metadata formats in a single call. Returns a `MetaOxideResult` struct containing JSON strings for each format, or `NULL` on error.

**Result Structure:**
```c
typedef struct MetaOxideResult {
    char* meta;           // Standard HTML meta tags (JSON object)
    char* open_graph;     // Open Graph metadata (JSON object)
    char* twitter;        // Twitter Card metadata (JSON object)
    char* json_ld;        // JSON-LD structured data (JSON array)
    char* microdata;      // Microdata items (JSON array)
    char* microformats;   // Microformats data (JSON object)
    char* rdfa;           // RDFa structured data (JSON array)
    char* dublin_core;    // Dublin Core metadata (JSON object)
    char* manifest;       // Web App Manifest discovery (JSON object)
    char* oembed;         // oEmbed endpoint discovery (JSON object)
    char* rel_links;      // rel-* link relationships (JSON object)
} MetaOxideResult;
```

Each field is either a JSON string or `NULL` if no data was found.

### Individual Extractors

For performance-critical applications where you only need specific metadata types:

```c
// Standard meta tags
char* meta_oxide_extract_meta(const char* html, const char* base_url);

// Social media metadata
char* meta_oxide_extract_open_graph(const char* html, const char* base_url);
char* meta_oxide_extract_twitter(const char* html, const char* base_url);

// Structured data
char* meta_oxide_extract_json_ld(const char* html, const char* base_url);
char* meta_oxide_extract_microdata(const char* html, const char* base_url);
char* meta_oxide_extract_microformats(const char* html, const char* base_url);
char* meta_oxide_extract_rdfa(const char* html, const char* base_url);

// Other formats
char* meta_oxide_extract_dublin_core(const char* html);
char* meta_oxide_extract_manifest(const char* html, const char* base_url);
char* meta_oxide_extract_oembed(const char* html, const char* base_url);
char* meta_oxide_extract_rel_links(const char* html, const char* base_url);
```

All functions return a JSON string or `NULL` on error. Strings must be freed with `meta_oxide_string_free()`.

### Manifest Parsing

```c
// Parse manifest.json content
char* meta_oxide_parse_manifest(const char* json, const char* base_url);
```

Parses a Web App Manifest JSON file and resolves all relative URLs.

### Error Handling

```c
// Get the last error code (0 = no error)
int meta_oxide_last_error(void);

// Get a human-readable error message
const char* meta_oxide_error_message(void);
```

### Memory Management

```c
// Free a MetaOxideResult struct
void meta_oxide_result_free(MetaOxideResult* result);

// Free a string returned by any extractor function
void meta_oxide_string_free(char* s);

// Free a ManifestDiscovery struct
void meta_oxide_manifest_discovery_free(ManifestDiscovery* discovery);
```

### Utility Functions

```c
// Get the library version
const char* meta_oxide_version(void);
```

## Memory Management

**Critical Rules:**

1. **All returned pointers are owned by the caller** - you must free them
2. **Use the correct free function:**
   - `MetaOxideResult*``meta_oxide_result_free()`
   - `char*` (strings) → `meta_oxide_string_free()`
   - `ManifestDiscovery*``meta_oxide_manifest_discovery_free()`
3. **Never use `free()` directly** - always use MetaOxide's free functions
4. **NULL pointers are safe to free** - all free functions check for NULL

**Example:**
```c
// Good
MetaOxideResult* result = meta_oxide_extract_all(html, NULL);
if (result) {
    // Use result...
    meta_oxide_result_free(result);  // Correct!
}

// Bad - will leak memory!
MetaOxideResult* result = meta_oxide_extract_all(html, NULL);
// Missing meta_oxide_result_free(result);
```

**Valgrind Check:**
```bash
valgrind --leak-check=full ./your_program
# Should report: "All heap blocks were freed -- no leaks are possible"
```

## Error Handling

MetaOxide uses return codes and thread-local error state:

```c
char* meta = meta_oxide_extract_meta(html, NULL);
if (meta == NULL) {
    // Error occurred
    int error_code = meta_oxide_last_error();
    const char* error_msg = meta_oxide_error_message();
    fprintf(stderr, "Error %d: %s\n", error_code, error_msg);
    return 1;
}

// Success - use meta
printf("%s\n", meta);
meta_oxide_string_free(meta);
```

**Error Codes:**
- `0` - No error (META_OXIDE_OK)
- `1` - HTML parsing error
- `2` - Invalid URL format
- `3` - Invalid UTF-8 string
- `4` - Memory allocation error
- `5` - JSON serialization error
- `6` - NULL pointer passed as argument

**Best Practices:**
- Always check for NULL returns
- Check errors immediately after failed calls
- Error state is thread-local (safe for multithreading)
- Error messages are human-readable and suitable for logging

## Thread Safety

**All MetaOxide functions are thread-safe:**

- No global state (except thread-local error storage)
- All functions are stateless
- Safe to call from multiple threads simultaneously
- No locking required by the caller

**Example Multithreaded Usage:**
```c
#include <pthread.h>

void* worker_thread(void* arg) {
    const char* html = (const char*)arg;

    // Each thread has its own error state
    MetaOxideResult* result = meta_oxide_extract_all(html, NULL);
    if (result == NULL) {
        fprintf(stderr, "Thread error: %s\n", meta_oxide_error_message());
        return NULL;
    }

    // Process result...
    meta_oxide_result_free(result);
    return NULL;
}

int main() {
    pthread_t threads[4];
    const char* htmls[4] = { /* ... */ };

    // Safe to call from multiple threads
    for (int i = 0; i < 4; i++) {
        pthread_create(&threads[i], NULL, worker_thread, (void*)htmls[i]);
    }

    for (int i = 0; i < 4; i++) {
        pthread_join(threads[i], NULL);
    }
}
```

## Examples

### Example 1: News Article Metadata

```c
const char* html =
    "<html><head>"
    "<title>Breaking News: Example Event</title>"
    "<meta property=\"og:type\" content=\"article\">"
    "<meta property=\"og:image\" content=\"https://news.example/image.jpg\">"
    "<script type=\"application/ld+json\">"
    "{"
    "  \"@type\": \"NewsArticle\","
    "  \"headline\": \"Breaking News\","
    "  \"datePublished\": \"2025-01-15T10:00:00Z\","
    "  \"author\": {\"@type\": \"Person\", \"name\": \"Jane Reporter\"}"
    "}"
    "</script>"
    "</head></html>";

MetaOxideResult* result = meta_oxide_extract_all(html, "https://news.example");

// Parse the JSON-LD to get structured article data
if (result->json_ld) {
    // Use a JSON parser library to parse result->json_ld
    printf("JSON-LD: %s\n", result->json_ld);
}

meta_oxide_result_free(result);
```

### Example 2: Product Page

```c
const char* html =
    "<div itemscope itemtype=\"https://schema.org/Product\">"
    "  <h1 itemprop=\"name\">Awesome Product</h1>"
    "  <div itemprop=\"offers\" itemscope itemtype=\"https://schema.org/Offer\">"
    "    <span itemprop=\"price\">$29.99</span>"
    "    <span itemprop=\"priceCurrency\">USD</span>"
    "  </div>"
    "</div>";

char* microdata = meta_oxide_extract_microdata(html, "https://shop.example");
if (microdata) {
    printf("Product data: %s\n", microdata);
    meta_oxide_string_free(microdata);
}
```

### Example 3: Blog Post with Microformats

```c
const char* html =
    "<article class=\"h-entry\">"
    "  <h1 class=\"p-name\">My Blog Post</h1>"
    "  <time class=\"dt-published\" datetime=\"2025-01-15\">Jan 15, 2025</time>"
    "  <div class=\"p-author h-card\">"
    "    <img class=\"u-photo\" src=\"/photo.jpg\">"
    "    <a class=\"p-name u-url\" href=\"https://author.example\">Author Name</a>"
    "  </div>"
    "  <div class=\"e-content\">Post content...</div>"
    "</article>";

char* microformats = meta_oxide_extract_microformats(html, "https://blog.example");
if (microformats) {
    // Will contain h-entry with nested h-card for author
    printf("Microformats: %s\n", microformats);
    meta_oxide_string_free(microformats);
}
```

### Example 4: URL Resolution

```c
const char* html =
    "<head>"
    "  <link rel=\"canonical\" href=\"/articles/example\">"
    "  <meta property=\"og:image\" content=\"../images/featured.jpg\">"
    "</head>";

// Without base URL - URLs remain relative
char* links1 = meta_oxide_extract_rel_links(html, NULL);
// Result: {"/articles/example"}

// With base URL - URLs are resolved
char* links2 = meta_oxide_extract_rel_links(html, "https://example.com/blog/");
// Result: {"https://example.com/articles/example"}

meta_oxide_string_free(links1);
meta_oxide_string_free(links2);
```

### Example 5: Error Recovery

```c
const char* malformed_html =
    "<html><head>"
    "<script type=\"application/ld+json\">{BROKEN JSON}</script>"
    "</head></html>";

MetaOxideResult* result = meta_oxide_extract_all(malformed_html, NULL);

if (result) {
    // MetaOxide continues extracting other formats even if one fails
    // json_ld will be NULL, but meta, og, twitter, etc. will work

    if (result->json_ld == NULL) {
        printf("JSON-LD failed (expected with broken JSON)\n");
    }

    if (result->meta != NULL) {
        printf("But meta tags were still extracted!\n");
    }

    meta_oxide_result_free(result);
}
```

## Performance Tips

### 1. Use Individual Extractors When Possible

If you only need specific metadata types, use individual extractors instead of `extract_all`:

```c
// Fast - only extracts Open Graph
char* og = meta_oxide_extract_open_graph(html, base_url);

// Slower - extracts everything
MetaOxideResult* result = meta_oxide_extract_all(html, base_url);
```

### 2. Reuse Parsed HTML (Future API)

Currently each call parses the HTML. If you need multiple extractions, use `extract_all` once:

```c
// Good - parse once
MetaOxideResult* result = meta_oxide_extract_all(html, base_url);

// Bad - parses HTML 3 times!
char* meta = meta_oxide_extract_meta(html, base_url);
char* og = meta_oxide_extract_open_graph(html, base_url);
char* twitter = meta_oxide_extract_twitter(html, base_url);
```

### 3. Batch Processing

Process multiple documents in parallel using threads:

```c
#pragma omp parallel for
for (int i = 0; i < num_documents; i++) {
    MetaOxideResult* result = meta_oxide_extract_all(documents[i], NULL);
    // Process result...
    meta_oxide_result_free(result);
}
```

### 4. Memory Pool Pattern

For high-throughput scenarios, consider using a memory pool for JSON parsing:

```c
// Extract once
MetaOxideResult* result = meta_oxide_extract_all(html, base_url);

// Parse all JSON strings at once
parse_all_json(result);

// Free all at once
meta_oxide_result_free(result);
```

## Troubleshooting

### Linking Errors

```
undefined reference to `meta_oxide_extract_all`
```

**Solution:** Add `-lmeta_oxide` to linker flags and ensure library path is correct.

### Runtime Library Not Found

```
error while loading shared libraries: libmeta_oxide.so
```

**Solution:** Set `LD_LIBRARY_PATH`:
```bash
export LD_LIBRARY_PATH=/path/to/target/release:$LD_LIBRARY_PATH
./your_program
```

Or install the library system-wide:
```bash
sudo cp target/release/libmeta_oxide.so /usr/local/lib/
sudo ldconfig
```

### Memory Leaks

Use Valgrind to detect leaks:
```bash
valgrind --leak-check=full --show-leak-kinds=all ./your_program
```

Ensure all `_free()` functions are called.

### Crashes with NULL Pointers

```c
// Bad - crashes if result is NULL
MetaOxideResult* result = meta_oxide_extract_all(html, NULL);
printf("%s\n", result->meta);  // CRASH!

// Good - check for NULL
MetaOxideResult* result = meta_oxide_extract_all(html, NULL);
if (result && result->meta) {
    printf("%s\n", result->meta);
}
meta_oxide_result_free(result);
```

## Next Steps

- See [C Binding Guide]c-binding-guide.md for creating language bindings
- Check [examples/c_api_example.c]../examples/c_api_example.c for more examples
- Run [tests/c_api_test.c]../tests/c_api_test.c to verify your installation

## Support

- GitHub Issues: https://github.com/yfedoseev/meta_oxide/issues
- Documentation: https://github.com/yfedoseev/meta_oxide/tree/main/docs

## License

MIT OR Apache-2.0