Expand description
§mnem-ingest
Ingest pipeline for mnem.
Converts external source artifacts (Markdown, plain text, PDFs, and chat-conversation exports) into the chunk-and-section intermediate representation that downstream stages (extraction, embedding, graph commit) consume.
§Scope (through Phase-B5c)
md::parse_markdown-CommonMark+ GFM tables/code fences with heading hierarchy preserved.text::parse_text- single-section pass-through for plain text.pdf::parse_pdf- pure-Rust text-layer extraction viapdf-extract, page-boundary detection on form-feed.conversation::parse_conversation-ChatGPT/ Claude / generic JSON exports flattened into oneSectionper turn.chunk::chunk- three chunker strategies:ChunkerKind::Paragraph- double-newline split.ChunkerKind::Recursive- token-budgeted sliding window.ChunkerKind::Session- contiguous conversation messages grouped until role returns touseror a cap is hit.
chunk::auto_chunker- picks a sensibleChunkerKindperSourceKind.extract::RuleExtractor- entity extractor that delegates to the configuredmnem_ner_providers::NerProvider(default: capitalized-phrase heuristic). Provider labels pass through unconditionally.pipeline::Ingester- end-to-end driver that writes Doc + Chunk + Entity nodes and the relation edges between them into a borrowedmnem_core::repo::Transaction.
§Optional extensions (Phase-B5e)
extract_llm::OllamaExtractor- schema-constrained NER via a local Ollama server. Gated behind theollamaCargo feature. Hallucinated spans are re-verified against section text and rejected; failures (timeout, schema-invalid) degrade to emptyVecrather than an error, so the rule-based baseline remains the load-bearing path.sidecar::Sidecar- escalation hook to an externaldocling/unstructured-ingestCLI for PDFs whose text-layer extraction is too thin. Gated behindsidecar-docling/sidecar-unstructured.
§Non-goals still outstanding
- No CLI / MCP / HTTP wiring (Phase-B5d).
§Example
use mnem_ingest::{md::parse_markdown, chunk::{chunk, ChunkerKind}};
let sections = parse_markdown("# Title\n\nFirst para.\n\nSecond para.").unwrap();
let chunks = chunk(§ions, &ChunkerKind::Paragraph);
assert!(!chunks.is_empty());Re-exports§
pub use chunk::ChunkerKind;pub use chunk::auto_chunker;pub use chunk::chunk;pub use error::Error;pub use extract::EntitySpan;pub use extract::Extractor;pub use extract::RelationSpan;pub use extract::RuleExtractor;pub use extract_keybert::KEYBERT_RELATION_LABEL;pub use extract_keybert::KeyBertAdapter;pub use extract_llm::DEFAULT_OLLAMA_MODEL;pub use extract_llm::DEFAULT_OLLAMA_URL;pub use extract_llm::LLM_ENTITY_CONFIDENCE;pub use extract_llm::LLM_RELATION_CONFIDENCE;pub use extract_llm::OllamaExtractor;pub use pipeline::EmbedText;pub use pipeline::EmbedderArc;pub use pipeline::Ingester;pub use types::Chunk;pub use types::ChunkerAuto;pub use types::ConversationFormat;pub use types::ExtractorConfig;pub use types::IngestConfig;pub use types::IngestResult;pub use types::Message;pub use types::Section;pub use types::SourceKind;
Modules§
- chunk
- Chunker strategies.
- conversation
- Conversation-export parser.
- error
- Error type for the ingest pipeline.
- extract
- Entity + relation extraction over parsed
Sections. - extract_
keybert - Adapter that lets a
mnem_extract::KeyBertExtractordrop into thecrate::extract::Extractorslot oncrate::pipeline::Ingester. - extract_
llm - Optional LLM-backed
Extractorthat talks to a local Ollama server. - md
CommonMark/ GFM parser that emitsSections.- PDF parser that emits
Sections. - pipeline
- End-to-end ingest orchestration.
- sidecar
- Optional sidecar escalation for scanned / text-layer-thin PDFs.
- text
- Plain-text parser.
- types
- Shared data types used throughout the ingest pipeline.
Structs§
- Ingest
Cid - A content identifier -
CIDv1wrapping a codec + multihash.
Enums§
- NerConfig
- NER provider selection.