markbase 0.4.0

A high-performance CLI tool for indexing and querying Markdown files for AI agent. Obsidian-compatible.
Documentation
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
# Markbase Developer Guide

This document is intended for markbase developers and development agents, covering architecture design, development standards, and implementation details.

## 1. Project Overview

Markbase is a high-performance CLI tool for scanning, parsing, and indexing Markdown notes into a DuckDB database, enabling instant metadata queries with Obsidian compatibility.

**Core Value**: Provides structured Markdown knowledge base access for AI Agents.

## 2. Tech Stack

- **Language**: Rust 1.85+ (2024 edition)
- **CLI Framework**: clap v4.5 (derive feature)
- **Database**: DuckDB (bundled with `duckdb` crate)
- **File Traversal**: walkdir v2.5
- **Parsing**: gray_matter (frontmatter), regex (wiki-links/tags), serde_yaml (frontmatter rewrite)
- **Serialization**: serde, serde_json

**Design Principle**: Minimize dependencies, optimize binary size (`strip = true` in Cargo.toml)

## 3. Design Principles

**Design Principle 1 - Name Uniqueness**: Note names must be unique across the entire vault, regardless of directory location. This enables simple `[[note-name]]` linking without path ambiguity.

**Design Principle 2 - Obsidian Link Format**: Wiki-links use filename only (no path, no extension). Example: `[[my-note]]` not `[[notes/my-note.md]]`. Frontmatter links must be quoted: `related: "[[target]]"`.

## 4. Data Model

### 4.1 Schema

Database location: `{{base-dir}}/.markbase/markbase.duckdb`

```sql
CREATE TABLE notes (
    path       TEXT PRIMARY KEY,        -- Path relative to base-dir
    folder     TEXT,                    -- Directory path
    name       TEXT,                    -- File name without extension
    ext        TEXT,                    -- Extension
    size       INTEGER,                 -- Size in bytes
    ctime      TIMESTAMPTZ,             -- Creation time
    mtime      TIMESTAMPTZ,             -- Modification time
    tags       VARCHAR[],               -- Tag array
    links      VARCHAR[],               -- Wiki-links array
    backlinks  VARCHAR[],               -- Backlink array
    embeds     VARCHAR[],               -- Embed array
    properties JSON                     -- Frontmatter properties
);
```

**Indexes**:
```sql
CREATE INDEX idx_mtime ON notes(mtime);
CREATE INDEX idx_folder ON notes(folder);
CREATE INDEX idx_name ON notes(name);
```

### 4.2 Field Resolution

When querying fields, the system uses explicit namespace prefixes:

| Prefix | Namespace | Resolves to |
|--------|-----------|-------------|
| `file.` | File properties | Native database columns |
| `note.` | Note (frontmatter) properties | `properties` JSON column |
| *(none)* | Shorthand for `note.` | `properties` JSON column |

**Resolution Rules:**
1. If field starts with `file.` → direct column access (e.g., `file.name``name`)
2. If field starts with `note.` → frontmatter JSON extraction
3. Bare identifiers → frontmatter JSON extraction (shorthand for `note.*`)
4. Nested paths: `a.b.c``json_extract_string(properties, '$."a"."b"."c"')`

## 5. Core Module Responsibilities

### 5.1 Module Overview

```
src/
├── main.rs          # CLI entry point, argument parsing and command dispatch
├── db.rs            # DuckDB connection management, schema initialization, CRUD
├── scanner.rs       # index command driver, directory traversal, incremental update, backlink computation
├── extractor.rs     # Single file parsing: frontmatter, wiki-links, tags
├── output.rs        # Shared YAML list / Markdown table formatters
├── creator.rs       # note new command, template rendering
├── renamer.rs       # note rename command, link updates
├── verifier.rs      # note verify command, MTS schema validation
├── renderer/
│   ├── mod.rs    # note render command, .base embed expansion, Step 0-5 pipeline
│   ├── filter.rs # Base filter → DuckDB SQL translation; column/sort translation; ThisContext
│   └── output.rs # list / table output formatting; ColumnMeta definition
├── describe.rs      # template describe command
├── lib.rs           # Library exports
└── query/
    ├── mod.rs       # Query output orchestration (table/list)
    ├── detector.rs  # SQL/expression mode detection, security validation
    ├── translator.rs # Field name translation
    ├── error_map.rs # DuckDB error mapping
    └── executor.rs  # Query execution orchestration
```

### 5.2 Key Design Decisions

**`scanner.rs`**:
- Uses WalkDir iterator for sequential processing (avoids memory spikes)
- Incremental update logic: compare `mtime + size`, skip unchanged files
- Deletion detection: after traversal, compare DB with filesystem, remove deleted entries
- Conflict handling: skip duplicate files with warning
- Backlinks: optional second traversal; disabled by default unless explicitly enabled

**`extractor.rs`**:
- Stateless parser, no database awareness
- Merges content tags (`#tag`) and frontmatter tags
- Extracts links from body (`[[...]]`, `![[...]]`) and frontmatter (`[[...]]`)
- Shared regex patterns (`EMBED_RE`, `WIKILINK_RE`) exposed as public constants
- Code blocks in body are excluded from link matching
- Returns `ExtractedContent` with `links`, `embeds`, `tags`, `frontmatter`

**`db.rs`**:
- Owns DuckDB connection, implements Drop trait to ensure closure
- Uses `INSERT OR REPLACE` for upsert
- Row value access via `duckdb::types::Value`, note that column order must match schema

**`query/detector.rs`**:
- Mode detection: starts with `SELECT` → SQL mode, otherwise → expression mode
- Security validation: reject non-SELECT statements, multi-statement injection
- [`is_file_property()`]src/query/detector.rs:148: returns true for `file.*` properties
- [`note_field_key()`]src/query/detector.rs:160: strips `note.` prefix if present

**`query/translator.rs`**:
- [`translate_identifier()`]src/query/translator.rs:240: handles three cases:
  - `file.*` prefix → direct column access
  - `note.*` prefix → frontmatter JSON extraction
  - Bare identifiers → frontmatter JSON extraction (shorthand for `note.*`)
- Preserve type cast syntax (`::INTEGER`, `::TIMESTAMP`)

**`verifier.rs`**:
- Stateless validator, reads from DB and filesystem but never writes
- Business-level errors (note not found, template missing) are returned as VerifyIssue, not Err
- Reuses WIKILINK_RE from extractor.rs for link parsing
- All output routing (stdout vs stderr) is handled by main.rs, not verifier.rs

**`renderer/`**:
- Stateless pipeline: reads DB and filesystem, never writes
- filter.rs: `link(this)``'[[name]]'` string literal; bare column names always resolve
  to note.* (json_extract_string), never direct DB columns
- .base files indexed as non-md: name column contains full filename including extension
  (e.g. "opps.base"), query must NOT strip the extension
- db.query() called with usize::MAX limit, bypassing executor.rs default 1000
- order field = SELECT columns; sort field = ORDER BY (independent, not related to order)

## 6. Command Internal Logic

For detailed usage, see README.md; this section explains implementation details.

### Automatic indexing

**Flow**:
1. Traverse directory tree (WalkDir)
2. For each file:
   - Compare `mtime + size` in DB, skip if unchanged
   - For `.md` files: call `extractor.rs` to parse content, extract links/embeds/tags
   - For non-`.md` files: `name` includes extension (e.g., `image.png`), no content parsing
   - Insert/update DB
3. Deletion detection: entries in DB but not in filesystem → delete
4. Conflict detection: files with same name → warn and skip
5. Optional backlink computation: when enabled, traverse all notes' `links` and populate target note's `backlinks`
6. Commit transaction

This flow runs automatically before DB-backed commands such as `query`, `note verify`, `note render`, and `template list`.

### `query`

**Two input modes**:
- **Expression mode**: `author == 'Tom'``SELECT * FROM notes WHERE author == 'Tom'`
- **SQL mode**: `SELECT path FROM notes WHERE ...` → execute directly (field names only translated)

**Field translation**:
```sql
-- Expression: author == 'John' (bare = frontmatter)
-- Translates to:
SELECT * FROM notes WHERE json_extract_string(properties, '$."author"') = 'John'

-- Expression: file.name == 'readme' (file.* = direct column)
-- Translates to:
SELECT * FROM notes WHERE name = 'readme'

-- SQL: SELECT file.name, author FROM notes WHERE author = 'John'
-- Translates to:
SELECT name, json_extract_string(properties, '$."author"') FROM notes WHERE json_extract_string(properties, '$."author"') = 'John'
```

**Special handling**:
- `file.*` fields in `list_contains()` use native column access
- `note.*` or bare fields in `list_contains()` use `(properties->'$."field"')::VARCHAR[]` for frontmatter arrays

**Error mapping**: DuckDB errors converted to user-friendly messages (unknown field, type cast failure, etc.)

### `note rename`

**Flow**:
1. Find note by name (not path)
2. Uniqueness check: if file with same name exists → fail
3. Rename file
4. Full vault scan: update `[[old-name]]` and `![[old-name]]` in body and frontmatter
5. Preserve aliases, anchors, and block IDs: `[[old-name#Heading|alias]]``[[new-name#Heading|alias]]`
6. Reindex affected notes

## 7. Performance Targets

| Metric | Target |
|--------|--------|
| Cold start index 10,000 notes | < 5s |
| Complex query latency (10k rows) | < 50ms |
| Complex expression compilation | < 100ms |

**Optimization strategies**:
- Incremental updates avoid full scans
- Index optimization (mtime, folder, name)
- Binary size optimization (`strip = true`)

## 8. Constraints and Security

- **Single writer**: DuckDB constraint, only one indexing pass can write at a time
- **Query security**: Only SELECT statements allowed, reject multi-statement injection
- **Error handling**: Comprehensive use of `Result` and `?` operator
- **Thread safety**: Use `Mutex<Database>` for multi-threaded scenarios
- **Graceful shutdown**: `Database` implements Drop trait

## 9. Development Commands

```bash
# Development build
cargo build

# Run
export MARKBASE_BASE_DIR=./notes
cargo run -- index
cargo run -- query "name == 'readme'"

# Test
cargo test
cargo test -- --nocapture

# Release build
cargo build --release

# Code checks
cargo clippy -- -D warnings
cargo fmt --check
```

## 10. Project Structure

```
markbase/
├── Cargo.toml           # Dependency configuration
├── Cargo.lock           # Dependency lock file
├── README.md            # User documentation
├── AGENTS.md            # This file
├── spec/                # Design specifications
│   ├── index_incremental_design.md  # Index incremental update design
│   ├── links_design.md              # Links, backlinks, and embeds design
│   ├── properties_design.md         # Properties and query translation design
│   ├── query_design.md              # Query command design
│   └── template_schema.md           # MTS (Markdown Template Schema)
├── src/
│   ├── main.rs          # CLI entry point
│   ├── lib.rs           # Library exports
│   ├── constants.rs     # Shared constants (DB schema, field names, etc.)
│   ├── db.rs            # Database operations
│   ├── scanner.rs       # Index scanning
│   ├── extractor.rs     # Content extraction
│   ├── creator.rs       # Note creation
│   ├── renamer.rs       # Note renaming
│   ├── verifier.rs      # Note verify command, MTS schema validation
│   ├── describe.rs      # Template description
│   ├── renderer/        # Note render system
│   │   ├── mod.rs       # note render command, .base embed expansion pipeline
│   │   ├── filter.rs    # Base filter → DuckDB SQL translation; column/sort translation
│   │   └── output.rs    # list / table output formatting; ColumnMeta definition
│   └── query/           # Query system
│       ├── mod.rs       # Output formatting
│       ├── detector.rs  # Mode detection
│       ├── translator.rs # Field translation
│       ├── error_map.rs # Error mapping
│       └── executor.rs  # Query execution
└── target/              # Build output
```

### Design Documents (spec/)

The `spec/` directory contains detailed design specifications that complement AGENTS.md:

| Document | Purpose |
|----------|---------|
| `index_incremental_design.md` | Three-phase incremental indexing algorithm, backlinks recomputation, conflict handling |
| `links_design.md` | Wiki-links, embeds, and backlinks extraction rules, regex patterns, rename rewrite logic |
| `properties_design.md` | File vs Note property namespaces, query translation rules, SQL generation |
| `query_design.md` | Query command user interface, expression vs SQL mode, security restrictions |
| `template_schema.md` | MTS v1.11 specification for template-based knowledge management |

**When to consult spec documents:**
- Before modifying extraction logic → see `links_design.md`
- Before changing query behavior → see `properties_design.md` and `query_design.md`
- Before optimizing indexing → see `index_incremental_design.md`
- When implementing MTS templates → see `template_schema.md`

## 11. Development Status

### Completed ✅

- Core indexing functionality
- Index all files (not just .md)
- Query system (SQL mode + expression mode)
- Field translation and security validation
- Multiple output formats (table/list)
- Backlink tracking
- Incremental update and deletion detection
- Note creation (template support)
- Note renaming (link updates)
- Template management
- Note schema verification (note verify)
- Note rendering with .base embed expansion (note render)

### Technical Debt

- Unit test output content verification (query/mod.rs)
- Negative stderr assertions in integration tests
- Performance benchmarking (10k notes target)
- Parallel index processing
- Configuration file support
- Query result caching

### Test Coverage

| Module | Coverage Scope |
|--------|----------------|
| `detector.rs` | Mode detection, expression splitting, security validation |
| `translator.rs` | Field translation, reserved fields, nested paths, type casts, array handling |
| `error_map.rs` | DuckDB error mapping |
| `executor.rs` | Query execution, error wrapping |
| `extractor.rs` | Frontmatter, tags, links, embeds parsing |
| `db.rs` | CRUD operations, queries |
| `scanner.rs` | Scanning, indexing, backlinks, deletion detection, conflict handling |
| `query/mod.rs` | Output formatting |
| `creator.rs` | Template parsing, note creation |
| `renamer.rs` | Link updates, renaming |
| `verifier.rs` | Note not found, no templates, location mismatch, required fields, type/enum/link validation |
| `renderer/filter.rs` | Filter translation, column/sort translation, ThisContext, merge_filters |
| `renderer/output.rs` | list/table format, is_name_col, is_list_col, empty results |
| `renderer/mod.rs` | CLI integration: note render happy path, dry-run, base not found, link(this) |
| `main.rs` | CLI argument parsing |

## 12. Development Workflow

### 12.1 Branch Strategy

**Never develop directly on `main` branch**

Create feature branches:
```bash
git checkout -b feat/<description>
git checkout -b fix/<description>
```

Branch prefixes: `feat/`, `fix/`, `refactor/`, `test/`, `docs/`

### 12.2 Pre-commit Checks

**Must pass the following checks**:

```bash
cargo clippy -- -D warnings  # Lint
cargo test                   # Test
cargo fmt --check            # Format
```

Do not commit if any check fails.

### 12.3 Documentation Sync

**When to update README.md**:
- Commands, options, or behavior changes
- Query operators or function changes
- Environment variable changes

**When to update AGENTS.md**:
- Dependencies or tech stack changes
- Data model changes
- Architecture or algorithm changes
- Performance targets or constraint changes
- Development status changes

### 12.4 Commit Message Convention

```
<type>(<scope>): <summary>

# Examples
feat(query): add exists() function support
fix(scanner): handle symlink cycles in walkdir
refactor(db): simplify upsert_note params
test(compiler): add coverage for nested JSON paths
docs(readme): update query operator table
```

### 12.5 Definition of Done

Task completion requires:

- [ ] Branch synced with main
- [ ] `cargo clippy -- -D warnings` passes
- [ ] `cargo test` all pass
- [ ] `cargo fmt --check` passes
- [ ] User-visible behavior changes updated in README.md
- [ ] Architecture changes updated in AGENTS.md

## 13. Testing Strategy

**Test Value**: Verify meaningful behavior, not just code coverage

**Must write tests**:
- New feature → core behavior + key boundaries
- Bug fix → regression test
- Public API changes → review existing tests

**Good test characteristics**:
- Verify correct output or side effects
- Cover edge cases and error paths
- Breakable by incorrect implementations (naive implementations cannot pass)

**Avoid**:
- Writing tests solely for coverage
- Duplicating existing tests
- Being overly sensitive to unrelated refactors

## 14. Rust Best Practices

### 14.1 Error Handling

- Use `Box<dyn std::error::Error>` for command-level errors propagated to `main()` (consistent with existing modules)
- Reserve `thiserror` for structured error types that need to be matched by callers (currently none in this codebase)
- No `.unwrap()` / `.expect()` in non-test code
- Error messages should explain failure reason: `"failed to open {path}: {source}"` not `"io error"`

### 14.2 Dependency Management

- Check existing dependencies before adding new ones
- Use `cargo add <crate>` instead of manually editing `Cargo.toml`
- Use `cargo add <crate> --features <feature>` when features are needed

### 14.3 Code Style

- Keep functions short and focused
- Consider extracting code blocks that need comments to explain
- Prioritize correctness, optimize based on measurements

## 15. Code Reuse

**Respect module boundaries**:
- Each module has clear responsibility
- Don't put logic in wrong module just to avoid refactoring
- Explicitly move when boundaries need to change

**Check before adding new functionality**:
- Does the logic already exist? Search first
- Does it belong to an existing module?
- Does the new module have a single clear responsibility?

**New CLI subcommands**:
- Follow existing `main.rs` patterns
- Implementation in separate files, not inline in `main.rs`

## 16. CLI UX Principles

### 16.1 Output Structure

- Default output provides structured summary (counts, status, warnings)
- `--verbose` for process details
- No uninformative confirmations (e.g., "Done!")

### 16.2 Output Targets

- Query results and structured data → stdout (supports piping)
- Warnings, errors, diagnostic info → stderr

### 16.3 Exit Codes

- Success → `0`
- Errors (affecting expected outcome) → non-zero
- Non-fatal warnings → `0` (but must report to stderr)

### 16.4 Output Examples

```
# index
Indexing ./notes...
  ✓ 142 files indexed  (3 new, 5 updated, 0 errors)  [1.2s]
  ⚠ Skipped: notes/broken.md — invalid frontmatter (line 4)

# query
path                      mtime
────────────────────────  ───────────────────
./notes/task-a.md         2025-01-10 09:00:00
./notes/task-b.md         2025-01-12 14:30:00

2 results
```