pub struct Corpus {
pub documents: Vec<CorpusDocument>,
pub name: String,
}Expand description
A corpus of documents to be ingested into a colony.
Fields§
§documents: Vec<CorpusDocument>§name: StringImplementations§
Source§impl Corpus
impl Corpus
Sourcepub fn from_directory(path: &Path) -> Result<Self>
pub fn from_directory(path: &Path) -> Result<Self>
Load all .txt files from a directory.
Files are assigned positions in a grid layout and categories
are inferred from filename prefixes (e.g., cell_biology_01.txt
gets category “cell_biology”).
Sourcepub fn from_embedded() -> Self
pub fn from_embedded() -> Self
Load corpus from disk or fall back to inline content.
Tries to load the expanded 100-document corpus from the poc/data/corpus/
directory. Falls back to an inline 20-document corpus if the directory
is not found (e.g., when running tests from a different working directory).
Topics: cell_biology, molecular_transport, genetics, quantum_computing. Ground-truth clusters enable measuring community detection purity.
Sourcepub fn inline_corpus() -> Self
pub fn inline_corpus() -> Self
Inline fallback corpus with 20 documents across 4 topics. Used when the disk corpus directory is not available.
Sourcepub fn ground_truth(&self) -> HashMap<String, String>
pub fn ground_truth(&self) -> HashMap<String, String>
Get the ground-truth category labels (for NMI computation). Returns a map of document title -> category.
Sourcepub fn categories(&self) -> Vec<String>
pub fn categories(&self) -> Vec<String>
Get unique categories in the corpus.
Sourcepub fn limit(self, max: usize) -> Self
pub fn limit(self, max: usize) -> Self
Limit corpus to at most max documents, evenly sampled across categories.
Sourcepub fn ingest_into(&self, colony: &mut Colony)
pub fn ingest_into(&self, colony: &mut Colony)
Ingest all documents into a colony.