1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
//! # mnem-ingest
//!
//! Ingest pipeline for [mnem].
//!
//! Converts external source artifacts (Markdown, plain text, PDFs, and
//! chat-conversation exports) into the chunk-and-section intermediate
//! representation that downstream stages (extraction, embedding, graph
//! commit) consume.
//!
//! ## Scope (through Phase-B5c)
//!
//! - [`md::parse_markdown`] - `CommonMark` + GFM tables/code fences with
//! heading hierarchy preserved.
//! - [`text::parse_text`] - single-section pass-through for plain text.
//! - [`pdf::parse_pdf`] - pure-Rust text-layer extraction via
//! `pdf-extract`, page-boundary detection on form-feed.
//! - [`conversation::parse_conversation`] - `ChatGPT` / Claude / generic
//! JSON exports flattened into one [`Section`] per turn.
//! - [`chunk::chunk`] - three chunker strategies:
//! - [`ChunkerKind::Paragraph`] - double-newline split.
//! - [`ChunkerKind::Recursive`] - token-budgeted sliding window.
//! - [`ChunkerKind::Session`] - contiguous conversation messages
//! grouped until role returns to `user` or a cap is hit.
//! - [`chunk::auto_chunker`] - picks a sensible [`ChunkerKind`] per
//! [`SourceKind`].
//! - [`extract::RuleExtractor`] - entity extractor that delegates to the
//! configured [`mnem_ner_providers::NerProvider`] (default: capitalized-phrase
//! heuristic). Provider labels pass through unconditionally.
//! - [`pipeline::Ingester`] - end-to-end driver that writes Doc +
//! Chunk + Entity nodes and the relation edges between them into a
//! borrowed [`mnem_core::repo::Transaction`].
//!
//! ## Optional extensions (Phase-B5e)
//!
//! - [`extract_llm::OllamaExtractor`] - schema-constrained NER via a
//! local Ollama server. Gated behind the `ollama` Cargo feature.
//! Hallucinated spans are re-verified against section text and
//! rejected; failures (timeout, schema-invalid) degrade to empty
//! `Vec` rather than an error, so the rule-based baseline remains
//! the load-bearing path.
//! - [`sidecar::Sidecar`] - escalation hook to an external
//! `docling` / `unstructured-ingest` CLI for PDFs whose text-layer
//! extraction is too thin. Gated behind `sidecar-docling` /
//! `sidecar-unstructured`.
//!
//! ## Non-goals still outstanding
//!
//! - No CLI / MCP / HTTP wiring (Phase-B5d).
//!
//! ## Example
//!
//! ```
//! use mnem_ingest::{md::parse_markdown, chunk::{chunk, ChunkerKind}};
//!
//! let sections = parse_markdown("# Title\n\nFirst para.\n\nSecond para.").unwrap();
//! let chunks = chunk(§ions, &ChunkerKind::Paragraph);
//! assert!(!chunks.is_empty());
//! ```
//!
//! [mnem]: https://github.com/Uranid/mnem
pub use ;
pub use Error;
pub use ;
pub use ;
pub use ;
pub use ;
pub use ;
// Re-export NerConfig so downstream crates (mnem-cli, mnem-mcp, mnem-http)
// can refer to `mnem_ingest::NerConfig` without a direct dep on
// mnem-ner-providers.
pub use NerConfig;
// Re-export Cid so downstream crates can refer to `mnem_ingest::Cid`
// without having to pull mnem-core directly.
pub use Cid as IngestCid;