trusty-git-analytics

Analyze git repositories to measure developer productivity — classify commit work types, track weekly velocity, and export CSV/JSON/Markdown reports.

What It Does

tga walks one or more local git repositories, collects every commit into a SQLite database, classifies each commit into a work category (feature, bugfix, refactor, etc.) using a four-tier rule cascade, then aggregates the results into per-author and per-week reports. It is a Rust port of gitflow-analytics with the same YAML config schema and the same SQLite schema — existing config files work without modification.

Quick Start

Installation

# From crates.io (once published)
cargo install tga

# From source
git clone https://github.com/bobmatnyc/trusty-git-analytics
cargo build --release
# Binary: ./target/release/tga

Run Your First Analysis

Step 1 — Create a config.yaml:

repositories:
  - path: ~/code/my-project
    name: my-project

output:
  directory: ./reports
  formats: [csv, json, markdown]

Step 2 — Run the full pipeline:

tga analyze --config config.yaml

Step 3 — Find reports in ./reports/:

reports/
├── authors.csv         # Per-author commit summary
├── weekly_activity.csv # Week-by-week breakdown
├── report.json         # Full structured payload
└── report.md           # Narrative Markdown report

Configuration

Minimal config.yaml

repositories:
  - path: ~/code/my-repo
    name: my-repo

All other sections are optional. When output.formats is omitted, all three formats (CSV, JSON, Markdown) are written.

Full config reference

Key	Type	Default	Description
`repositories`	list	required	Repos to analyze
`developer_aliases`	map	`{}`	Canonical name → list of emails/aliases
`team`	object	—	Alternative to `developer_aliases`; roster with email
`output.directory`	path	`./reports`	Where reports are written
`output.formats`	list	`[csv, json, markdown]`	`csv`, `json`, and/or `markdown`
`output.include_unclassified`	bool	`false`	Include commits with no category
`output.include_merges`	bool	`false`	Include merge commits
`output.include_files`	bool	`false`	Include file-level change detail
`classification.rules_file`	path	—	Path to custom rules YAML/JSON
`classification.use_llm`	bool	`false`	Enable LLM fallback tier
`classification.llm_model`	string	`gpt-4o-mini`	LLM model identifier
`classification.confidence_threshold`	float	`0.7`	Minimum acceptance confidence
`github.token`	string	`$GITHUB_TOKEN`	GitHub PAT for PR fetch
`github.org`	string	—	Org slug for org-wide PR queries
`github.repo`	string	—	Single repo slug (`owner/name`)
`github.fetch_prs`	bool	`false`	Fetch pull request metadata
`jira.url`	string	—	JIRA base URL
`jira.username`	string	—	JIRA API username (email for Cloud)
`jira.token`	string	—	JIRA API token
`jira.project_key`	string	—	Project key filter (e.g. `API`)
`cache.directory`	path	—	Cache directory (supports `~`)
`version`	string	—	Schema version; stored for compatibility
`profile`	string	—	Named profile; stored for compatibility

Paths support ~ expansion. Config files from the Python gitflow-analytics tool load without changes — unknown keys are silently ignored.

developer_aliases vs team.members

developer_aliases (Python-compatible flat map):

developer_aliases:
  "Alice Smith":
    - "alice@company.com"
    - "asmith@company.com"
    - "alice@personal.dev"
  "Bob Jones":
    - "bob@company.com"
    - "129991831+bobgithub@users.noreply.github.com"

team.members (structured roster with canonical email):

team:
  members:
    - name: Alice Smith
      email: alice@company.com
      aliases:
        - asmith@company.com
        - alice@personal.dev

When developer_aliases is non-empty it takes precedence over team.members. Use developer_aliases when migrating an existing Python config file; use team.members for new setups where canonical email matters for downstream tooling.

Example: multi-repo config with GitHub

See configs/duetto-contractors.yaml for a working real-world example that covers multiple repositories, developer aliases, and CSV+Markdown output.

CLI Reference

All subcommands accept these global flags:

Flag	Default	Description
`--config <PATH>`	`config.yaml`	Path to config YAML
`--database <PATH>`	`tga.db`	Path to SQLite database
`-v / -vv / -vvv`	warnings only	Increase log verbosity

tga analyze

Run the full pipeline: collect → classify → report.

tga analyze [--config <PATH>] [--database <PATH>] [--output <DIR>]
            [--skip-collect] [--skip-classify]

Flag	Description
`--skip-collect`	Skip Stage 1; use commits already in the database
`--skip-classify`	Skip Stage 2; use existing classifications
`--output <DIR>`	Override `output.directory` from config

# Full pipeline
tga analyze --config config.yaml

# Re-run reports only (commits already collected and classified)
tga analyze --skip-collect --skip-classify --output ./reports-v2

tga collect

Stage 1: extract commits from git repositories into the database.

tga collect [--config <PATH>] [--database <PATH>]
            [--repos <NAME,...>] [--since <DATE>] [--until <DATE>]

Flag	Description
`--repos <NAME,...>`	Comma-separated list of repository names to collect; others are skipped
`--since <DATE>`	Collect commits on or after this ISO 8601 date (overrides config)
`--until <DATE>`	Collect commits on or before this ISO 8601 date (overrides config)

tga collect --repos my-project --since 2024-01-01 --until 2024-03-31

tga classify

Stage 2: run the classification cascade over collected commits.

tga classify [--config <PATH>] [--database <PATH>]
             [--rules <PATH>] [--use-llm]

Flag	Description
`--rules <PATH>`	Override `classification.rules_file` from config
`--use-llm`	Enable LLM fallback regardless of config setting

tga classify --rules ./custom-rules.yaml --use-llm

tga report

Stage 3: generate reports from classified commits.

tga report [--config <PATH>] [--database <PATH>]
           [--output <DIR>] [--formats <FMT,...>]

Flag	Description
`--output <DIR>`	Override `output.directory` from config
`--formats <FMT,...>`	Comma-separated: `csv`, `json`, `markdown`

tga report --output ./q1-reports --formats csv,json

Pipeline Architecture

git repos ──┐
             │  tga-collect   SQLite (tga.db)  tga-classify   SQLite   tga-report
GitHub API ──┼─────────────► [commits]        ──────────────► [classif]─────────► CSV
JIRA API ───┘  (libgit2,      [authors]          (rules +              ► JSON
                reqwest)      [pull_requests]     LLM fallback)        ► Markdown

Stage 1 — collect (tga-collect): opens each repository with libgit2, walks the configured branch, extracts commit metadata and diff stats, resolves author identities, optionally fetches GitHub PR metadata via the REST API, and writes everything to SQLite.

Stage 2 — classify (tga-classify): reads unclassified commits from the database, runs each message through the four-tier cascade (see below), and writes a classification verdict back. Tiers 1–3 execute in parallel via Rayon.

Stage 3 — report (tga-report): reads the classified database, aggregates per-author and per-week statistics, and writes the configured output formats to the output directory.

Classification

Four-Tier Cascade

Each commit message is tested against tiers in order. The first match wins.

Tier 1 — Exact (Aho-Corasick): builds a single finite-state machine from all keyword lists across all rules, then scans the message in O(n) time. Matches feat:, fix:, chore:, etc. Confidence: 0.85–0.95.

Tier 2 — Regex: applies pre-compiled regex patterns from rules. Handles anchored conventional-commit patterns (^feat($[^)]*$)?!?:) and JIRA ticket IDs (\b[A-Z][A-Z0-9]+-\d+\b).

Tier 3 — Fuzzy heuristics: detects merge commits (via is_merge flag or "Merge pull request" prefix) and reverts (via "Revert" prefix). No external dependencies.

Tier 4 — LLM fallback (optional, async): calls an OpenAI-compatible API when tiers 1–3 all fail. Reads OPENAI_API_KEY from the environment. Disabled by default; enable with classification.use_llm: true or --use-llm.

Default Rules

ID	Category	Keywords / Patterns
`cc-feat`	`feature`	`feat:`, `feature:`, `^feat(...)?!?:`
`cc-fix`	`bugfix`	`fix:`, `bugfix:`, `hotfix`, `^fix(...)?!?:`
`cc-chore`	`chore`	`chore:`, `^chore(...)?!?:`
`cc-docs`	`documentation`	`docs:`, `doc:`, `^docs?(...)?!?:`
`cc-refactor`	`refactor`	`refactor:`, `^refactor(...)?!?:`
`cc-test`	`test`	`test:`, `tests:`, `^tests?(...)?!?:`
`cc-ci`	`ci`	`ci:`, `^ci(...)?!?:`
`cc-perf`	`performance`	`perf:`, `^perf(...)?!?:`
`cc-style`	`style`	`style:`, `^style(...)?!?:`
`cc-build`	`build`	`build:`, `^build(...)?!?:`
`cc-revert`	`revert`	`revert:`, `^revert(...)?!?:`
`breaking-change`	`breaking`	`breaking change`, `breaking-change`
`jira-ticket`	`feature` (ticketed)	`\b[A-Z][A-Z0-9]+-\d+\b`
`kw-bug`	`bugfix`	`bug`, `defect`
`kw-security`	`bugfix` (security)	`security`, `cve-`, `vulnerability`

Commits that match no rule are assigned category uncategorized with confidence 0.0.

Custom Rules File

Supply your own rules alongside the defaults:

# my-rules.yaml
version: "1.0"
rules:
  - id: my-deploy
    category: deployment
    keywords:
      - "deploy:"
      - "release:"
    patterns:
      - "(?i)^deploy(ment)?:"
    priority: 80
    confidence: 0.9

tga classify --rules ./my-rules.yaml
# or in config.yaml:
# classification:
#   rules_file: ./my-rules.yaml

Output Formats

CSV

Two files are written when csv is in the format list:

authors.csv — one row per author:

Column	Description
`name`	Canonical author name
`email`	Canonical author email
`commit_count`	Total commits
`insertions`	Total lines added
`deletions`	Total lines deleted
`files_changed`	Total files changed
`first_commit`	ISO 8601 timestamp of earliest commit
`last_commit`	ISO 8601 timestamp of most recent commit

weekly_activity.csv — one row per week/author/repository bucket:

Column	Description
`week`	ISO week label, e.g. `2024-W03`
`author`	Author name
`repository`	Repository name
`commit_count`	Commits in this bucket
`insertions`	Lines added in this bucket
`deletions`	Lines deleted in this bucket

JSON

report.json — full structured payload:

{
  "generated_at": "2024-03-15T10:00:00Z",
  "period_start": "2024-01-01T00:00:00Z",
  "period_end":   "2024-03-14T23:59:59Z",
  "total_commits": 347,
  "total_authors": 8,
  "category_breakdown": { "feature": 120, "bugfix": 45, ... },
  "authors": [
    {
      "name": "Alice Smith",
      "email": "alice@company.com",
      "commit_count": 87,
      "insertions": 4200,
      "deletions": 1100,
      "files_changed": 310,
      "categories": { "feature": 50, "bugfix": 20, ... },
      "first_commit": "...",
      "last_commit": "..."
    }
  ],
  "repositories": [
    {
      "name": "my-project",
      "commit_count": 347,
      "author_count": 8,
      "insertions": 18000,
      "deletions": 6000,
      "top_categories": [["feature", 120], ["bugfix", 45]]
    }
  ],
  "weekly_activity": [
    {
      "week": "2024-W03",
      "author": "Alice Smith",
      "repository": "my-project",
      "commit_count": 12,
      "insertions": 500,
      "deletions": 120,
      "categories": { "feature": 8, "bugfix": 4 }
    }
  ]
}

Markdown

report.md — a narrative report containing a summary header, per-author commit table, category breakdown, and weekly activity section. Suitable for pasting into Confluence or a PR description.

Development

Build and Test

# Build everything
cargo build

# Build release binary
cargo build --release

# Run all tests
cargo test

# Lint (zero warnings required)
cargo clippy -- -D warnings

# Format check (CI gate)
cargo fmt --check

# Auto-format
cargo fmt

# Generate rustdoc
cargo doc --open

Running Against Real Repos

configs/duetto-contractors.yaml is a working example that analyzes three repositories using developer_aliases. Adjust paths to match your local checkout:

tga analyze --config configs/duetto-contractors.yaml --database duetto.db

CI Gates

The GitHub Actions workflow (ci.yml) requires:

cargo fmt --all -- --check
cargo clippy --workspace --all-targets -- -D warnings
cargo test --workspace
cargo doc --workspace --no-deps with RUSTDOCFLAGS="-D warnings"

Crate Structure

Crate	Purpose	crates.io
`tga-core`	Shared types, config, DB schema, migrations, error types	`tga-core`
`tga-collect`	Stage 1: git extraction (libgit2), GitHub/JIRA clients	`tga-collect`
`tga-classify`	Stage 2: four-tier classification cascade	`tga-classify`
`tga-report`	Stage 3: CSV/JSON/Markdown output	`tga-report`
`tga-cli`	Binary entry point (`tga`), clap CLI	`tga-cli`

License

MIT

tga 0.1.0