trusty-git-analytics

Analyze git repositories to measure developer productivity — classify commit work types, track weekly velocity, and export CSV/JSON/Markdown reports.

What It Does

tga walks one or more local git repositories, collects every commit into a SQLite database, classifies each commit into a work category (feature, bugfix, refactor, etc.) using a seven-tier classification cascade, then aggregates the results into per-author, per-week, DORA, velocity, and quality reports. It is a feature-complete Rust port of gitflow-analytics with the same YAML config schema and the same SQLite schema — existing config files work without modification.

Installation

From crates.io (recommended)

cargo install tga

This installs the tga binary to ~/.cargo/bin/. Ensure ~/.cargo/bin is in your PATH.

From source

git clone https://github.com/bobmatnyc/trusty-tools
cd trusty-tools/crates/trusty-git-analytics
cargo install --path .

Verify installation

tga --version
tga --help

Quick Start

Run Your First Analysis

Step 1 — Create a config.yaml:

repositories:
  - path: ~/code/my-project
    name: my-project

output:
  directory: ./reports
  formats: [csv, json, markdown]

Step 2 — Run the full pipeline:

tga analyze --config config.yaml

Step 3 — Find reports in ./reports/:

A full run writes 14 files: 9 CSV, 4 JSON, and 1 Markdown report. The most commonly used ones are:

reports/
├── authors.csv         # Per-author commit summary
├── weekly_activity.csv # Week-by-week breakdown
├── ... (7 more CSV files: DORA, velocity, quality, etc.)
├── report.json         # Full structured payload
├── ... (3 more JSON files)
└── report.md           # Narrative Markdown report

Configuration

Minimal config.yaml

repositories:
  - path: ~/code/my-repo
    name: my-repo

All other sections are optional. When output.formats is omitted, all three formats (CSV, JSON, Markdown) are written.

Full config reference

Key	Type	Default	Description
`repositories`	list	required	Repos to analyze
`developer_aliases`	map	`{}`	Canonical name → list of emails/aliases
`team`	object	—	Alternative to `developer_aliases`; roster with email
`output.directory`	path	`./reports`	Where reports are written
`output.formats`	list	`[csv, json, markdown]`	`csv`, `json`, and/or `markdown`
`output.include_unclassified`	bool	`false`	Include commits with no category
`output.include_merges`	bool	`false`	Include merge commits
`output.include_files`	bool	`false`	Include file-level change detail
`classification.rules_file`	path	—	Path to custom rules YAML/JSON
`classification.use_llm`	bool	`false`	Enable LLM fallback tier
`classification.llm_model`	string	`gpt-4o-mini`	LLM model identifier
`classification.confidence_threshold`	float	`0.7`	Minimum acceptance confidence
`classification.llm_fallback_threshold`	float	`0.0`	Commits with confidence above this value skip the LLM tier
`classification.llm_fallback_concurrency`	uint	`8`	Max concurrent LLM requests during fallback
`github.token`	string	`$GITHUB_TOKEN`	GitHub PAT for PR fetch
`github.org`	string	—	Org slug for org-wide PR queries
`github.repo`	string	—	Single repo slug (`owner/name`)
`github.fetch_prs`	bool	`false`	Fetch pull request metadata
`github.ticket_regex`	string	—	Override regex for detecting GitHub ticket refs in commit messages
`jira.url`	string	—	JIRA base URL
`jira.username`	string	—	JIRA API username (email for Cloud)
`jira.token`	string	—	JIRA API token
`jira.project_key`	string	—	Project key filter (e.g. `API`)
`jira.ticket_regex`	string	—	Override regex for detecting JIRA ticket refs in commit messages
`linear.ticket_regex`	string	—	Override regex for detecting Linear ticket refs in commit messages
`pm.azure_devops.organization_url`	string	—	ADO org URL (e.g. `https://dev.azure.com/myorg`)
`pm.azure_devops.pat`	string	—	Azure DevOps Personal Access Token
`pm.azure_devops.project`	string	—	Default ADO project name
`pm.azure_devops.fetch_on_reference`	bool	`false`	Fetch work items when `AB#N` refs appear in commits
`pm.azure_devops.fetch_prs`	bool	`false`	Fetch ADO pull requests and reviewer data
`pm.azure_devops.ticket_regex`	string	`AB#(\d+)`	Override regex for detecting ADO work item refs in commit messages
`pm.bitbucket.workspace`	string	—	Bitbucket Cloud workspace slug
`pm.bitbucket.repo_slug`	string	—	Repository slug within the workspace
`pm.bitbucket.fetch_prs`	bool	`false`	Fetch Bitbucket Cloud pull request metadata
`pm.bitbucket.token`	string	`$BITBUCKET_TOKEN`	Bearer token (App password or OAuth)
`pm.bitbucket.username`	string	—	Atlassian account username for Basic auth
`pm.bitbucket.app_password`	string	—	Atlassian App password for Basic auth (alternative to `token`)
`cache.directory`	path	—	Cache directory (supports `~`)
`version`	string	—	Schema version; stored for compatibility
`profile`	string	—	Named profile; stored for compatibility

Paths support ~ expansion. Config files from the Python gitflow-analytics tool load without changes — unknown keys are silently ignored.

developer_aliases vs team.members

developer_aliases (Python-compatible flat map):

developer_aliases:
  "Alice Smith":
    - "alice@company.com"
    - "asmith@company.com"
    - "alice@personal.dev"
  "Bob Jones":
    - "bob@company.com"
    - "129991831+bobgithub@users.noreply.github.com"

team.members (structured roster with canonical email):

team:
  members:
    - name: Alice Smith
      email: alice@company.com
      aliases:
        - asmith@company.com
        - alice@personal.dev

When developer_aliases is non-empty it takes precedence over team.members. Use developer_aliases when migrating an existing Python config file; use team.members for new setups where canonical email matters for downstream tooling.

Example: multi-repo config with GitHub

See configs/example-config.yaml for a working example that covers multiple repositories, developer aliases, and CSV+Markdown output.

CLI Reference

All subcommands accept these global flags:

Flag	Default	Description
`--config <PATH>`	`config.yaml`	Path to config YAML
`--database <PATH>`	`tga.db`	Path to SQLite database
`-v / -vv / -vvv`	warnings only	Increase log verbosity

tga analyze

Run the full pipeline: collect → classify → report.

tga analyze [--config <PATH>] [--database <PATH>] [--output <DIR>]
            [--skip-collect] [--skip-classify] [--weeks <N>]
            [--from <DATE>] [--to <DATE>] [--dry-run]
            [--validate-only] [--no-validate]

Flag	Description
`--skip-collect`	Skip Stage 1; use commits already in the database
`--skip-classify`	Skip Stage 2; use existing classifications
`--output <DIR>`	Override `output.directory` from config
`--weeks <N>`	Limit collection to the last N weeks (overrides config `start_date`)
`--from <DATE>`	Start date for collection (ISO 8601 `YYYY-MM-DD`); mutually exclusive with `--weeks`
`--to <DATE>`	End date for collection (ISO 8601 `YYYY-MM-DD`); defaults to today
`--dry-run`	Perform all steps against an in-memory database; the on-disk database is left untouched
`--validate-only`	Run configuration validation and exit (0 on success, 1 on errors)
`--no-validate`	Skip pre-flight configuration validation

# Full pipeline
tga analyze --config config.yaml

# Re-run reports only (commits already collected and classified)
tga analyze --skip-collect --skip-classify --output ./reports-v2

tga collect

Stage 1: extract commits from git repositories into the database.

tga collect [--config <PATH>] [--database <PATH>]
            [--repos <NAME,...>] [--from <DATE>] [--to <DATE>] [--weeks <N>]
            [--since <DATE>] [--until <DATE>] [--dry-run]
            [--force-refresh-prs] [--validate-only] [--no-validate]

Flag	Description
`--repos <NAME,...>`	Comma-separated list of repository names to collect; others are skipped
`--from <DATE>`	Collect commits on or after this ISO 8601 date; mutually exclusive with `--weeks`
`--to <DATE>`	Collect commits on or before this ISO 8601 date; defaults to today
`--since <DATE>`	Legacy alias for `--from` (Python-predecessor compatibility); `--from` takes precedence
`--until <DATE>`	Legacy alias for `--to` (Python-predecessor compatibility); `--to` takes precedence
`--weeks <N>`	Limit collection to the last N weeks; `--weeks` takes precedence over `--from`/`--to`
`--dry-run`	Run collection against an in-memory database; the on-disk database is left untouched
`--force-refresh-prs`	Re-fetch ADO pull requests even when already cached (backfills pre-v1.0.9 rows)
`--validate-only`	Run configuration validation and exit (0 on success, 1 on errors)
`--no-validate`	Skip pre-flight configuration validation

tga collect --repos my-project --from 2024-01-01 --to 2024-03-31
tga collect --weeks 4   # collect last 4 weeks across all repos

tga classify

Stage 2: run the classification cascade over collected commits.

tga classify [--config <PATH>] [--database <PATH>]
             [--rules <PATH>] [--use-llm] [--backfill-complexity]

Flag	Description
`--rules <PATH>`	Override `classification.rules_file` from config
`--use-llm`	Enable LLM fallback regardless of config setting
`--backfill-complexity`	Fill missing complexity scores (1–5) for already-classified commits via the LLM, without re-running the full cascade; category, confidence, and method are left untouched

tga classify --rules ./custom-rules.yaml --use-llm

tga report

Stage 3: generate reports from classified commits.

tga report [--config <PATH>] [--database <PATH>]
           [--output <DIR>] [--formats <FMT,...>]

Flag	Description
`--output <DIR>`	Override `output.directory` from config
`--formats <FMT,...>`	Comma-separated: `csv`, `json`, `markdown`

tga report --output ./q1-reports --formats csv,json

tga pr-metrics

Aggregate pull-request metrics per engineer from the pull_requests cache.

tga pr-metrics [--config <PATH>] [--database <PATH>]
               [--weeks <N>] [--csv] [--output <PATH>]

Flag	Description
`--weeks <N>`	Limit metrics to PRs created within the last N weeks
`--csv`	Emit CSV instead of an aligned text table
`--output <PATH>`	Write output to a file (CSV with `--csv`, otherwise the text table)

tga pr-metrics --weeks 12 --csv --output pr-metrics.csv

The pr_comments_given and avg_revisions columns are reserved for future use and currently always output 0.0; the underlying review-comment and revision-count data is not yet tracked.

tga backfill

Retroactive maintenance operations that update existing commit rows in place (outside the normal collect → classify → report pipeline).

tga backfill <SUBCOMMAND> [--dry-run]

Subcommand	Description
`ai-detection`	Re-run LLM classification on low-confidence prior LLM verdicts
`revert-flags`	Scan commit messages for revert patterns and set `is_revert`
`ticket-ids`	Scan commit messages for ticket refs and update `ticket_id`/`ticketed`

Flag	Description
`--dry-run`	Report how many rows would change without writing

tga backfill revert-flags --dry-run
tga backfill ticket-ids

tga override

Manage manual classification overrides (Tier 0). Rows here pin a commit's verdict regardless of what the rule-based or LLM tiers would produce.

tga override <SUBCOMMAND>

Subcommand	Description
`add <SHA> <WORK_TYPE> <CHANGE_TYPE> [--notes <TEXT>] [--repo <PATH>]`	Insert (or replace) an override row for a commit SHA
`list [--repo <PATH>]`	List every override row, optionally filtered by repository
`remove <SHA> [--yes]`	Delete the override row(s) for a SHA (`--yes` skips confirmation)

tga override add abc1234 feature feature --notes "manual review"
tga override list
tga override remove abc1234 --yes

Pipeline Architecture

git repos ──┐
             │   collect      SQLite (tga.db)   classify       SQLite    report
GitHub API ──┼──────────────► [commits]        ──────────────► [classif]─────────► CSV (×9)
JIRA API ────┤   (libgit2,    [authors]         (7-tier                 ► JSON
Linear API ──┤   reqwest)     [pull_requests]   cascade,                ► Markdown
ADO API  ────┘                [work_items]      Rayon-parallel)

Stage 1 — collect (tga::collect): opens each repository with libgit2, walks the configured branch, extracts commit metadata and diff stats, resolves author identities, fetches GitHub PR / JIRA issue / Linear / Azure DevOps work item metadata via REST/GraphQL, and writes everything to SQLite.

Stage 2 — classify (tga::classify): reads unclassified commits from the database, runs each message through the seven-tier cascade (see below), and writes a classification verdict back. Rule-based tiers execute in parallel via Rayon.

Stage 3 — report (tga::report): reads the classified database, aggregates per-author, per-week, DORA, velocity, and quality statistics, and writes the configured output formats to the output directory.

Classification

Seven-Tier Cascade

Each commit message is tested against tiers in order. The first tier to produce a confident result wins.

Tier 0 — Manual Override (confidence 1.0): looks up the (commit_hash, repo_path) pair in the classification_overrides table. Managed via tga override add|list|remove.

Tier 1.5 — Issue Type (confidence 0.90): when the commit has ticket references resolving to rows in issue_cache, maps the upstream issue type (bug, story, task, spike, etc.) directly to a change_type.

Tier 3 — JIRA Project Mapping (confidence 0.95): when jira_project_mappings is configured, maps the JIRA project key prefix of any [A-Z]+-\d+ reference to a change_type.

Tier 4 — Exact (Aho-Corasick): builds a single finite-state machine from every keyword list across every rule and scans the message in O(n) time. Matches feat:, fix:, chore:, etc. Confidence 0.85–0.95.

Tier 5 — Regex: applies pre-compiled regex patterns from the rule set. Handles anchored conventional-commit patterns (^feat($[^)]*$)?!?:) and JIRA ticket IDs (\b[A-Z][A-Z0-9]+-\d+\b).

Tier 6 — Fuzzy heuristics: detects merge commits (via is_merge flag or Merge pull request prefix) and reverts (via Revert prefix). No external dependencies.

Tier 7 — LLM fallback (optional, async): calls an OpenAI-compatible API (OpenRouter by default, AWS Bedrock behind the bedrock cargo feature) when tiers 0–6 leave a commit in a fallthrough category. Disabled by default; enable with analysis.llm_classification.enabled: true or --use-llm. Results are only accepted when confidence >= confidence_threshold (default 0.7).

Default Rules

ID	Category	Keywords / Patterns
`cc-feat`	`feature`	`feat:`, `feature:`, `^feat(...)?!?:`
`cc-fix`	`bugfix`	`fix:`, `bugfix:`, `hotfix`, `^fix(...)?!?:`
`cc-chore`	`chore`	`chore:`, `^chore(...)?!?:`
`cc-docs`	`documentation`	`docs:`, `doc:`, `^docs?(...)?!?:`
`cc-refactor`	`refactor`	`refactor:`, `^refactor(...)?!?:`
`cc-test`	`test`	`test:`, `tests:`, `^tests?(...)?!?:`
`cc-ci`	`ci`	`ci:`, `^ci(...)?!?:`
`cc-perf`	`performance`	`perf:`, `^perf(...)?!?:`
`cc-style`	`style`	`style:`, `^style(...)?!?:`
`cc-build`	`build`	`build:`, `^build(...)?!?:`
`cc-revert`	`revert`	`revert:`, `^revert(...)?!?:`
`breaking-change`	`breaking`	`breaking change`, `breaking-change`
`jira-ticket`	`feature` (ticketed)	`\b[A-Z][A-Z0-9]+-\d+\b`
`kw-bug`	`bugfix`	`bug`, `defect`
`kw-security`	`bugfix` (security)	`security`, `cve-`, `vulnerability`

Commits that match no rule are assigned category uncategorized with confidence 0.0.

Custom Rules File

Supply your own rules alongside the defaults:

# my-rules.yaml
version: "1.0"
rules:
  - id: my-deploy
    category: deployment
    keywords:
      - "deploy:"
      - "release:"
    patterns:
      - "(?i)^deploy(ment)?:"
    priority: 80
    confidence: 0.9

tga classify --rules ./my-rules.yaml
# or in config.yaml:
# classification:
#   rules_file: ./my-rules.yaml

Output Formats

CSV

Nine CSV files are written when csv is in the format list (per-author, weekly activity, DORA metrics, velocity, quality, and related breakdowns). The two most commonly used are documented below.

authors.csv — one row per author:

Column	Description
`name`	Canonical author name
`email`	Canonical author email
`commit_count`	Total commits
`insertions`	Total lines added
`deletions`	Total lines deleted
`files_changed`	Total files changed
`first_commit`	ISO 8601 timestamp of earliest commit
`last_commit`	ISO 8601 timestamp of most recent commit

weekly_activity.csv — one row per week/author/repository bucket:

Column	Description
`week`	ISO week label, e.g. `2024-W03`
`author`	Author name
`repository`	Repository name
`commit_count`	Commits in this bucket
`insertions`	Lines added in this bucket
`deletions`	Lines deleted in this bucket

JSON

Four JSON files are written when json is in the format list: report.json, velocity_summary.json, quality_summary.json, and dora_summary.json. The primary one is documented below.

report.json — full structured payload:

{
  "generated_at": "2024-03-15T10:00:00Z",
  "period_start": "2024-01-01T00:00:00Z",
  "period_end":   "2024-03-14T23:59:59Z",
  "total_commits": 347,
  "total_authors": 8,
  "category_breakdown": { "feature": 120, "bugfix": 45, ... },
  "authors": [
    {
      "name": "Alice Smith",
      "email": "alice@company.com",
      "commit_count": 87,
      "insertions": 4200,
      "deletions": 1100,
      "files_changed": 310,
      "categories": { "feature": 50, "bugfix": 20, ... },
      "first_commit": "...",
      "last_commit": "..."
    }
  ],
  "repositories": [
    {
      "name": "my-project",
      "commit_count": 347,
      "author_count": 8,
      "insertions": 18000,
      "deletions": 6000,
      "top_categories": [["feature", 120], ["bugfix", 45]]
    }
  ],
  "weekly_activity": [
    {
      "week": "2024-W03",
      "author": "Alice Smith",
      "repository": "my-project",
      "commit_count": 12,
      "insertions": 500,
      "deletions": 120,
      "categories": { "feature": 8, "bugfix": 4 }
    }
  ]
}

Markdown

report.md — a narrative report containing a summary header, per-author commit table, category breakdown, and weekly activity section. Suitable for pasting into Confluence or a PR description.

Development

Build and Test

# Build everything
cargo build

# Build release binary
cargo build --release

# Run all tests
cargo test

# Lint (zero warnings required)
cargo clippy -- -D warnings

# Format check (CI gate)
cargo fmt --check

# Auto-format
cargo fmt

# Generate rustdoc
cargo doc --open

Running Against Real Repos

configs/example-config.yaml is a working example that analyzes repositories using developer_aliases. Copy it, adjust paths and names to match your setup, then run:

tga analyze --config configs/example-config.yaml --database tga.db

CI Gates

The GitHub Actions workflow (ci.yml) requires:

cargo fmt -- --check
cargo clippy --all-targets -- -D warnings
cargo test
cargo doc --no-deps with RUSTDOCFLAGS="-D warnings"

Crate Structure

Single tga crate (consolidated from the original 5-crate workspace):

Module	Path	Purpose
`tga::core`	`src/core/`	Shared types, config, DB schema, migrations, error types
`tga::collect`	`src/collect/`	Stage 1: git extraction (libgit2), GitHub/JIRA/Linear/ADO clients, `PmAdapter` trait
`tga::classify`	`src/classify/`	Stage 2: seven-tier classification cascade
`tga::report`	`src/report/`	Stage 3: CSV/JSON/Markdown output
`commands` (binary-private)	`src/commands/`	Subcommand handlers wired into `src/main.rs`

License

Non-commercial use only. See LICENSE for terms.

Elastic License 2.0

tga 1.0.13