Lint AI
Lint-AI is a system for analyzing and aligning large corpora of AI-generated documentation.
As AI systems produce increasing amounts of documentation--task records, traces, logs, decisions, and reports--these artifacts often become inconsistent, outdated, or misaligned with each other. Lint-AI addresses this by treating documentation as a network of facts, rather than isolated text.
How it works
Lint-AI processes documentation in several stages:
1. Fact Extraction
Extracts entities, concepts, and claims from each document.
2. Concept & Entity Resolution
Identifies when different documents refer to the same concept using different terms.
3. Fact Graph Construction
Builds a network of normalized facts with context such as:
- source document
- time
- confidence
- status (current, deprecated, proposed)
4. Misalignment Detection
Identifies potential issues such as:
- contradictions
- terminology drift
- scope conflicts
- unsupported claims
- missing required context
5. AI Review
Routes suspicious cases to an AI reviewer that verifies and explains the issue with context.
Instead of enforcing rigid templates, Lint-AI focuses on understanding and comparing what documents actually say, enabling systematic detection of misalignment at scale.
The result is a continuous alignment layer that helps ensure AI-generated work remains consistent, interpretable, and trustworthy over time.
Why Lint-AI?
AI systems don't just produce outputs--they produce documentation about their work.
Over time, this creates a growing body of:
- task summaries
- decision notes
- traces and logs
- generated reports
Without alignment:
- terminology drifts
- definitions conflict
- outdated concepts persist
- claims become unsupported or inconsistent
Reading individual documents is not enough. The problem is system-level consistency.
Lint-AI addresses this by analyzing documentation collectively, not in isolation.
Vision
As AI systems perform more work, they will continuously generate documentation describing their actions, decisions, and outputs.
Lint-AI aims to ensure that this growing body of AI-generated knowledge remains:
- consistent
- traceable
- interpretable
- aligned over time
Under the hood
Lint-AI builds on techniques such as:
- concept extraction
- corpus-wide matching
- terminology analysis
These are used as part of a larger system for fact extraction and alignment reasoning.
Usage
How To Use
Run the linter against a docs directory:
Tier 0 and Tier 1 Outputs
Show Tier 0 ingestion records:
Write a Tier 0 index JSON:
Show Tier 1 key entities:
Use spaCy for Tier 1 entities (falls back to heuristic if unavailable):
Show Tier 1 important terms:
Available term rankers:
yakerakecvaluetextrank
Index and Query
Build and print the in-memory hybrid index:
Query the corpus (index is built automatically behind the scenes):
Generate LLM-ready retrieval context (same index/query engine, different output schema):
--llm-context is chunk-focused output for LLM grounding (top_chunks + citation policy), while --query stays doc-focused.
Chunk selection strategy for --llm-context:
Default is all (global chunk scoring).
Export graph for visualization (Graphviz DOT):
Export chunk-level graph (DOT):
Export entity-level graph (DOT):
Export graph as JSON (for D3/Cytoscape integration):
Export interactive Cytoscape.js HTML:
Note: Cytoscape HTML exports load ./cytoscape.min.js from the same directory as the HTML file.
Show chunk graph stats:
Export seed entity ontology graph (JSON):
Query output includes:
queryelapsed_msresult_countresults
Chunking options:
The query pipeline uses hybrid scoring with:
- BM25 lexical scoring
- key-entity overlap
- important-term overlap
- topic/doc-type boosts when available
- score breakdown output for transparency
Lexical expansion data is kept as small checked-in JSON subsets under
data/lexical/. The upstream ConceptNet assertions dump is large, roughly
hundreds of MB compressed and about 1.2 GB extracted, so the full raw file is
not committed to this repo. WordNet is much smaller, typically tens of MB
depending on the package.
Download locations:
- ConceptNet assertions:
https://s3.amazonaws.com/conceptnet/downloads/2019/edges/conceptnet-assertions-5.7.0.csv.gz - ConceptNet download docs:
https://github.com/commonsense/conceptnet5/wiki/Downloads - Princeton WordNet downloads:
https://wordnet.princeton.edu/
To regenerate the checked-in subsets from local upstream downloads:
Seed terms live in data/lexical/seed_terms.txt. Edit that file to widen or
narrow the lexical coverage, then rerun the generator. The script accepts
either the WordNet dict/ directory itself or the parent WordNet package
directory that contains dict/.
The generated JSON writes back to:
data/lexical/wordnet_subset.jsondata/lexical/conceptnet_subset.json
More detail is in docs/lexical-data.md.
Chunk strategy details: docs/chunk-strategy.md
Artifact indexing and update model: docs/artifact-indexing.md
Temporal fact / assertion layer: docs/artifact-indexing.md (see "Temporal Fact / Assertion Layer")
Library Use
lint-ai can also be used as a library for artifact-oriented indexing.
The current public model is:
IndexStore- mutable artifact-facing facade
- owns source documents, cached derived records, tombstones, internal Tantivy lexical state, and refresh lifecycle
MemoryIndex- built immutable query structure
- optimized for semantic and hybrid search signals
Typical flow:
- Normalize external content into
SourceDocument - Insert or update it inside
IndexStore - Call
query(...), which refreshes the semanticMemoryIndexwhen needed and merges it with Tantivy BM25 hits
Example:
use ;
For corpus-local persistence under .lint-ai/, use:
use Path;
use ;
let index = for_corpus?;
If you already have fully prepared DocRecord values and want the built search
structure directly, use lint_ai::index::MemoryIndex.
Advanced
By default the linter skips files larger than 5MB and stops after 50k files. Override these limits:
Limit directory traversal depth:
Limit total bytes read across the corpus:
The tool will automatically scope to /path/to/repo/docs/** when that folder exists.
Example with a local repo:
Show the inferred concept inventory:
Show Markdown headings per file (structure/architecture hints):
Debug phrase matches (prints matched text fragments and concepts):
Coordinator + Workers
lint-service can run as a coordinator in front of multiple long-running lint-client workers.
Components
lint-service: gRPC coordinator + HTTP gateway/UIlint-client: worker process that executeslint-ailint-dispatch: dispatch CLI that sends one request to coordinator and returns aggregated JSON
Start coordinator
LINT_SERVICE_ADDR=127.0.0.1:50051 \
LINT_HTTP_ADDR=127.0.0.1:8080 \
Start a worker
LINT_AI_PATH=/home/louis/sources/lint-ai/target/debug/lint-ai \
LINT_WORKER_ADDR=127.0.0.1:50052 \
LINT_WORKER_ID=worker-1 \
LINT_WORKER_PATH=/home/louis/sources/openclaw/docs \
LINT_HTTP_ADDR=http://127.0.0.1:8080 \
Workers send heartbeats to coordinator every 5s. Coordinator keeps a presence table and drops stale workers automatically.
Dispatch a query
LINT_SERVICE_ADDR=http://127.0.0.1:50051 \
HTTP gateway and UI
GET /: web UI (workers + recent jobs + top results)GET /api/workers: current worker presenceGET /api/jobs: recent dispatch jobsPOST /api/dispatch: run dispatch via HTTPPOST /api/worker/heartbeat: worker heartbeat endpoint
/api/dispatch accepts:
Optional tenant routing header:
x-tenant-id: <tenant>
If license is configured with a tenant, dispatch checks x-tenant-id before running.
Analyze a corpus and emit a suggested lint-ai.json:
Example analysis output (Openclaw channels):
Suggested config:
{
"stopwords": ["group messages", "pairing", "channel routing"],
"ignore_sections": ["unscoped", "related"],
"ignore_crossref_sections": ["unscoped", "related"],
"ignore_paths": [],
"allowlist_concepts": []
}
Stats:
pages: 31
top concepts:
- group messages (25)
- pairing (25)
- channel routing (22)
- slack (11)
- telegram (11)
- signal (10)
- whatsapp (10)
- discord (9)
- troubleshooting (9)
- line (8)
- imessage (7)
- matrix (6)
- zalo (4)
- irc (3)
- location (3)
top sections:
- configuration (41)
- setup (35)
- unscoped (31)
- security (28)
- related (22)
- troubleshooting (22)
- bundled plugin (14)
- routing (14)
- overview (10)
- notes (4)
Configuration
You can place a lint-ai.json file in the target root (or pass --config /path/to/lint-ai.json)
to control filters.
Use --strict-config to fail fast if the config is invalid.
Limit config size:
Example used for Openclaw channels (reduce false positives by skipping "Related" sections and ignoring generic terms):
Run it:
Development
Build
Test
Contributing
- Fork the repo and create a feature branch.
- Make changes with tests where appropriate.
- Run
cargo test. - Open a PR.
Concept Examples
Concept inventory (derived from filenames in docs/channels/**):
discord
slack
telegram
whatsapp
group messages
channel routing
Concepts grouped by section (aggregated across the corpus):
Section: setup
- pairing (4)
- signal (3)
- feishu (2)
- zalo (2)
Section: configuration
- pairing (7)
- signal (6)
- feishu (3)
- groups (3)
Section: related
- channel routing (21)
- groups (21)
- pairing (21)
Surface forms (for matching text to a concept):
group messages
group-messages
group_messages
groupmessages
group message
group messages
Output Examples
Sample findings now include severity tags and link‑debt signals:
Missing cross-ref in docs/channels/discord.md -> [[signal]] (high)
Low link density in docs/channels/location.md (outgoing 1, avg 4.2)
Unreachable page: docs/channels/legacy.md
Orphan page: docs/channels/unused.md
Orphan detection example command: