ocloc 0.4.1

Fast, reliable lines-of-code counter with JSON/CSV output
Documentation
# ocloc Plan 2 — Diff Mode & CI Integration

A concrete plan to introduce a “diff mode” for ocloc so teams can measure lines-of-code deltas by language between two Git refs (or the working tree) and integrate results into CI as tables, JSON, or Markdown summaries.

---

## Objective

- Report per-language and per-file LOC deltas (code, comment, blank, total) between a base and head.
- Make it easy to run in CI and gate builds based on thresholds.
- Keep performance high by only analyzing changed files and reading blobs directly from Git.

## Scope

- New CLI subcommand: `ocloc diff` (plus supporting flags)
- Git integration using `git2` (no shelling out to `git`)
- Analyzer refactor to support in-memory content (`analyze_reader`)
- New diff formatters (table, JSON, CSV, Markdown)
- CI-friendly thresholds and non-zero exit on violations
- Tests: unit + integration with a temporary Git repo

## CLI Design

- Subcommand: `ocloc diff`
- Sources:
  - `--base <rev>`: base commit/tag/ref
  - `--head <rev>`: head commit/tag/ref (default: `HEAD`)
  - `--merge-base <rev>`: compare `merge-base(HEAD, <rev>)` to `HEAD`
  - `--staged`: compare `HEAD` vs index
  - `--working-tree`: compare index vs working tree
- Filters and output:
  - `--ext rs,py,...` filter by extensions
  - `--by-file` include per-file rows
  - `--json`, `--csv`, `--markdown`, default to table
  - `--summary-only` hide per-file details
- CI guardrails:
  - `--max-code-added <N>`
  - `--max-total-changed <N>`
  - `--max-files <N>`
  - `--fail-on-threshold` (non-zero exit if any threshold exceeded)

## What To Measure

- Per-language deltas: files changed, code_added, code_removed, comment_added, blank_added, total_net
- Per-file deltas (optional): status (A/M/D/R), language, signed deltas
- Global totals and change metadata: files_added, files_deleted, files_modified, files_renamed

## Approach

- Use `git2` to:
  - Resolve base/head commits and diffs
  - Enumerate changed files with statuses (A/M/D/R)
  - Fetch file contents as blobs in both versions without writing to disk
- Counting strategy:
  - Analyze both sides and compute deltas:
    - Added: base = 0s; head = analyze(head blob)
    - Deleted: base = analyze(base blob); head = 0s
    - Modified: delta = head − base (per metric)
    - Renamed: treat under head path/language; include rename metadata
  - Aggregate per-language (default: head language; consider option to attribute deletions to base language)
  - Parallelize per file with `rayon`

## Analyzer Refactor

- Add `fn analyze_reader<R: std::io::BufRead>(rdr: R, path_hint: &Path) -> Result<FileCounts>`
  - `path_hint` is used for language detection (extension/special name)
  - Share logic with `analyze_file` (which becomes a thin wrapper)
- Keep behavior identical across `analyze_file` and `analyze_reader` (unit tests to assert parity)

## Output Formats

- Table: aligned columns with signed integers and a “Net” column
- JSON: CI-friendly schema (see sketch)
- CSV: flat rows; suitable for spreadsheets
- Markdown: table suitable for PR comments / job summaries

### JSON Schema (sketch)

```bash
{
  "base": { "ref": "abc123", "short": "abc1234" },
  "head": { "ref": "def456", "short": "def4567" },
  "files": 42,
  "files_added": 5,
  "files_deleted": 3,
  "files_modified": 32,
  "files_renamed": 2,
  "languages": {
    "Rust": { "files": 3, "code_added": 90, "code_removed": 10, "comment_added": 5, "blank_added": 5, "total_net": 80 }
  },
  "by_file": [
    { "path": "src/lib.rs", "status": "M", "language": "Rust", "delta": { "code": 10, "comment": 1, "blank": 0, "total": 11 } }
  ],
  "totals": { "code_added": 120, "code_removed": 25, "comment_added": 7, "blank_added": 4, "total_net": 106 }
}
```

## CI Integration

- GitHub Actions (example):

```bash
- uses: actions/checkout@v4
  with:
    fetch-depth: 0
- name: Build ocloc
  run: cargo build --release --locked
- name: LOC diff (table + JSON)
  run: |
    BASE="${{ github.event.pull_request.base.sha || github.event.before }}"
    HEAD="${{ github.sha }}"
    ./target/release/ocloc diff --base "$BASE" --head "$HEAD" --json > loc_diff.json
    ./target/release/ocloc diff --base "$BASE" --head "$HEAD" --markdown > loc_diff.md
    if [ -n "${GITHUB_STEP_SUMMARY:-}" ]; then
      cat loc_diff.md >> "$GITHUB_STEP_SUMMARY"
    fi
- name: Gate on LOC increase
  run: |
    CODE_ADDED=$(jq '.totals.code_added' loc_diff.json)
    if [ "$CODE_ADDED" -gt 2500 ]; then
      echo "Too many code lines added: $CODE_ADDED" >&2
      exit 1
    fi
```

- GitLab/Jenkins: consume JSON/CSV similarly; print Markdown/table in logs or MR comments.

## Testing Plan

- Unit tests
  - `analyze_reader` vs `analyze_file` parity on identical content
- Integration tests
  - Create a temp Git repo and cover: add, modify, delete, rename
  - Assert per-language and per-file deltas and totals
  - Verify JSON keys and signed values
- Edge cases
  - Shebang-only files without extensions
  - Rename across languages (`.txt` → `.md`): attributed under new language, net totals correct
  - Skip binaries (as with traversal heuristic)

## Performance Considerations

- Only analyze changed files
- Read blobs in memory (no temp files)
- Parallelize with rayon; avoid shared mutable state
- Reuse buffers in analyzer to minimize allocations

## Roadmap

- v0.2: Core diff mode
  - `ocloc diff --base <rev> --head <rev>` with table/json/csv
  - `analyze_reader` refactor
  - `git2` integration and initial tests
- v0.3: Working tree / staged support
  - `--staged`, `--working-tree`, `--merge-base`
- v0.4: Markdown formatter + GH Summary helper
  - `--markdown` output + README examples
  - Makefile `make diff BASE=<rev> HEAD=<rev>` (default `HEAD~1...HEAD`)
- v0.5: CI guardrails
  - Threshold flags + `--fail-on-threshold`
  - Optional per-language thresholds
- v0.6+: Enhancements
  - PR comment script using JSON output
  - SARIF or custom annotations if desired

## Docs & DX

- README: add “Diff Mode” and “CI Integration” sections with examples
- Update `documentation/` as features land
- Keep clippy, fmt, tests green

## Makefile Targets (planned)

- `make diff BASE=<rev> HEAD=<rev>` → runs `ocloc diff` with table output
- `make diff-json BASE=<rev> HEAD=<rev>` → writes `loc_diff.json`

## Risks / Known Limitations

- Language detection for renamed files: attribution defaults to head language (documented)
- Diff accuracy may vary with complex generated files or vendored code; consider future include/exclude patterns for diff mode
- Nested block comments remain out of scope (same as current analyzer)

---

This plan complements PLAN1 and focuses on CI workflows and developer insight into per-language changes per commit/PR while keeping performance and determinism as core goals.