Skip to main content

Crate mnem_ingest

Crate mnem_ingest 

Source
Expand description

§mnem-ingest

Ingest pipeline for mnem.

Converts external source artifacts (Markdown, plain text, PDFs, and chat-conversation exports) into the chunk-and-section intermediate representation that downstream stages (extraction, embedding, graph commit) consume.

§Scope (through Phase-B5c)

§Optional extensions (Phase-B5e)

  • extract_llm::OllamaExtractor - schema-constrained NER via a local Ollama server. Gated behind the ollama Cargo feature. Hallucinated spans are re-verified against section text and rejected; failures (timeout, schema-invalid) degrade to empty Vec rather than an error, so the rule-based baseline remains the load-bearing path.
  • sidecar::Sidecar - escalation hook to an external docling / unstructured-ingest CLI for PDFs whose text-layer extraction is too thin. Gated behind sidecar-docling / sidecar-unstructured.

§Non-goals still outstanding

  • No CLI / MCP / HTTP wiring (Phase-B5d).

§Example

use mnem_ingest::{md::parse_markdown, chunk::{chunk, ChunkerKind}};

let sections = parse_markdown("# Title\n\nFirst para.\n\nSecond para.").unwrap();
let chunks = chunk(&sections, &ChunkerKind::Paragraph);
assert!(!chunks.is_empty());

Re-exports§

pub use chunk::ChunkerKind;
pub use chunk::auto_chunker;
pub use chunk::chunk;
pub use error::Error;
pub use extract::EntitySpan;
pub use extract::Extractor;
pub use extract::RelationSpan;
pub use extract::RuleExtractor;
pub use extract_keybert::KEYBERT_RELATION_LABEL;
pub use extract_keybert::KeyBertAdapter;
pub use extract_llm::DEFAULT_OLLAMA_MODEL;
pub use extract_llm::DEFAULT_OLLAMA_URL;
pub use extract_llm::LLM_ENTITY_CONFIDENCE;
pub use extract_llm::LLM_RELATION_CONFIDENCE;
pub use extract_llm::OllamaExtractor;
pub use pipeline::EmbedText;
pub use pipeline::EmbedderArc;
pub use pipeline::Ingester;
pub use types::Chunk;
pub use types::ChunkerAuto;
pub use types::ConversationFormat;
pub use types::ExtractorConfig;
pub use types::IngestConfig;
pub use types::IngestResult;
pub use types::Message;
pub use types::Section;
pub use types::SourceKind;

Modules§

chunk
Chunker strategies.
conversation
Conversation-export parser.
error
Error type for the ingest pipeline.
extract
Entity + relation extraction over parsed Sections.
extract_keybert
Adapter that lets a mnem_extract::KeyBertExtractor drop into the crate::extract::Extractor slot on crate::pipeline::Ingester.
extract_llm
Optional LLM-backed Extractor that talks to a local Ollama server.
md
CommonMark / GFM parser that emits Sections.
pdf
PDF parser that emits Sections.
pipeline
End-to-end ingest orchestration.
sidecar
Optional sidecar escalation for scanned / text-layer-thin PDFs.
text
Plain-text parser.
types
Shared data types used throughout the ingest pipeline.

Structs§

IngestCid
A content identifier - CIDv1 wrapping a codec + multihash.

Enums§

NerConfig
NER provider selection.