# ocloc Plan
_A concise, actionable plan for the agent to build a `cloc`-like CLI in Rust._
---
## Project Goal
Build `ocloc`, a reliable, fast, and testable CLI tool that counts lines of code with per-language breakdowns, supports common ignore rules, is parallelized for performance, and can output human-friendly and machine-readable formats (table, CSV, JSON).
## Guiding Principles
- Keep the tool simple and correct before optimizing.
- Make parsing deterministic and well-tested.
- Favor explicit ownership and small functions to keep the borrow checker manageable.
- Incrementally add features behind CLI flags.
## High-level Milestones
1. Minimal working prototype (already scaffolded)
2. Correct comment/blank/code classification with unit tests
3. File-system filters: extensions, `.gitignore`, file size, explicit include/exclude
4. Parallel processing with rayon and a configurable thread pool
5. Output formats: pretty table, JSON, CSV
6. CLI polish: progress bar, verbose logging, dry-run, config file
7. CI, tests, linting, and release process
## Minimum Viable Product (MVP)
- Walk a directory recursively and list files to analyze.
- For each file count: total, blank, comment, code lines.
- Support common extensions: rs, py, js, ts, c, cpp, java, go, sh, pl, html, css.
- Single-threaded or rayon-based processing (either OK for MVP).
- Print a human-readable summary and per-extension breakdown.
- Include unit tests for the parser.
## Feature Breakdown (agent action items)
### 1. Core file traversal
- Use `walkdir` to recurse directories.
- Respect symbolic links only when `--follow-symlinks` is provided.
- Skip binary files by size or simple heuristic (non-UTF8 first chunk).
- Produce a stream/Vec of `PathBuf` to analyze.
### 2. Comment & blank detection engine
- Create a `Language` struct:
```rust
struct Language {
name: &'static str,
extensions: &'static [&'static str],
line_markers: &'static [&'static str],
block_markers: Option<(&'static str, &'static str)>,
}
```
- Provide a language registry loaded from JSON (`assets/languages.json`) and a helper to find language by extension or special filename. Use `include_str!` + `once_cell::sync::Lazy` to parse once at startup and build fast lookup maps.
- The analyzer should be line-based and maintain a minimal `State` for block comments and string-literal heuristics when needed.
- Edge cases to test explicitly:
- Block comment start and end on same line.
- Nested block comments when language allows them (or document that nested are unsupported).
- Triple-quoted strings in Python that are not comments.
- Shebang lines for scripts that imply language.
### 3. Per-file analyzer API
- Signature: `fn analyze_file(path: &Path) -> Result<FileCounts>`
- `FileCounts`:
```rust
struct FileCounts { files: usize, total: usize, code: usize, comment: usize, blank: usize }
```
- Analyzer must be well-covered by unit tests using temporary files and string fixtures.
### 4. Aggregation and output
- Aggregate results per-language (by canonical language name) and global totals.
- Output modes:
- `--summary` (default): pretty table grouped by language
- `--json`: machine-readable
- `--csv`: flat rows
- Implement output behind a trait `Formatter` with implementations for Table, JSON, CSV. This makes testing easier.
### 5. CLI and UX
- Use `clap` v4 with a `Args` struct.
- Flags to include:
- `--path <PATH>` default `.`
- `--ext rs,py,js` to limit by extensions
- `--ignore-file <PATH>` support `.gitignore` and custom ignore
- `--threads <N>` or `--jobs` to control rayon threadpool
- `--json` / `--csv` / `--pretty`
- `--follow-symlinks`
- `--min-size` and `--max-size` for files
- `--progress` enable progress bar
- `--verbose` for debug logging
### 6. Parallelism and performance
- Use `rayon::par_iter` on the collected file list.
- Beware of shared mutable state; return per-file `FileCounts` and reduce with `.reduce`.
- Add benchmarks for typical repo sizes (1000 files, 100k lines) and tune thread pool.
### 7. Ignores and heuristics
- Implement `.gitignore` parsing using `ignore` crate or simple parsing that supports patterns and negations.
- Use file size and first-chunk UTF-8 check to skip binaries.
### 8. Testing
- Unit tests for analyzer logic using inline fixtures.
- Integration tests using a `tests/fixtures` folder with small example repos.
- CI: GitHub Actions matrix for stable toolchain; run `cargo test`, `cargo clippy`, `cargo fmt -- --check`.
### 9. Packaging and release
- Add `cargo-release` or manual release notes.
- Provide `install` instructions: `cargo install --path .`.
- Optionally publish to crates.io once stable.
## Data model and types (reference)
- `Language` (described above)
- `FileCounts` (per-file or aggregated)
- `AnalyzeResult { per_lang: HashMap<String, FileCounts>, totals: FileCounts, files_analyzed: usize }`
## Acceptance criteria
- `cargo test` passes and includes tests for comment parsing edge cases.
- Command `./target/release/ocloc .` returns plausible totals on a small repo.
- `--json` produces valid JSON schema: `{ totals: {...}, languages: { "Rust": {...} } }`.
- Respect `.gitignore` by default when present.
- Reasonable performance on medium repos (e.g., tens of thousands of lines) courtesy of parallelism.
## Developer tasks (first sprint, itemized)
1. Implement `Language` registry and lookup by extension and filename.
2. Implement `analyze_file` with line and block comment handling. Add unit tests.
3. Implement file traversal using `walkdir` and basic filtering. Add integration test that runs analyzer on `tests/fixtures/simple_repo`.
4. Wire up CLI with `clap` and a `run` function that aggregates results.
5. Implement Table and JSON formatters and add tests for their outputs.
6. Add GitHub Actions workflow: run tests, clippy, fmt check on push and PR.
## Example commands the agent can run locally
- `cargo test --lib`
- `cargo run --release -- . --json > sample.json`
- `cargo clippy -- -D warnings`
- `cargo fmt -- --check`
## Notes and risks
- Python triple-quoted strings are hard to perfectly classify without AST parsing; document known limitations and aim for heuristic correctness.
- Nested block comments are language-specific; initially document as unsupported or implement per-language rules.
- `.gitignore` pattern support can get complex; leverage the `ignore` crate to avoid reimplementation.
## Extensions and future work
- Per-file language detection using content heuristics or `enry`/`linguist` style detection.
- Add a language-agnostic token-based complexity estimate.
- Add a server mode to stream results over HTTP for large-scale analysis.
- Add a `--watch` mode to update counts incrementally.
---