trusty-git-analytics
Analyze git repositories to measure developer productivity — classify commit work types, track weekly velocity, and export CSV/JSON/Markdown reports.
What It Does
tga walks one or more local git repositories, collects every commit into a SQLite database, classifies each commit into a work category (feature, bugfix, refactor, etc.) using a seven-tier classification cascade, then aggregates the results into per-author, per-week, DORA, velocity, and quality reports. It is a feature-complete Rust port of gitflow-analytics with the same YAML config schema and the same SQLite schema — existing config files work without modification.
Installation
From crates.io (recommended)
This installs the tga binary to ~/.cargo/bin/. Ensure ~/.cargo/bin is in your PATH.
From source
Verify installation
Quick Start
Run Your First Analysis
Step 1 — Create a config.yaml:
repositories:
- path: ~/code/my-project
name: my-project
output:
directory: ./reports
formats:
Step 2 — Run the full pipeline:
Step 3 — Find reports in ./reports/:
A full run writes 14 files: 9 CSV, 4 JSON, and 1 Markdown report. The most commonly used ones are:
reports/
├── authors.csv # Per-author commit summary
├── weekly_activity.csv # Week-by-week breakdown
├── ... (7 more CSV files: DORA, velocity, quality, etc.)
├── report.json # Full structured payload
├── ... (3 more JSON files)
└── report.md # Narrative Markdown report
Configuration
Minimal config.yaml
repositories:
- path: ~/code/my-repo
name: my-repo
All other sections are optional. When output.formats is omitted, all three formats (CSV, JSON, Markdown) are written.
Full config reference
| Key | Type | Default | Description |
|---|---|---|---|
repositories |
list | required | Repos to analyze |
developer_aliases |
map | {} |
Canonical name → list of emails/aliases |
team |
object | — | Alternative to developer_aliases; roster with email |
output.directory |
path | ./reports |
Where reports are written |
output.formats |
list | [csv, json, markdown] |
csv, json, and/or markdown |
output.include_unclassified |
bool | false |
Include commits with no category |
output.include_merges |
bool | false |
Include merge commits |
output.include_files |
bool | false |
Include file-level change detail |
classification.rules_file |
path | — | Path to custom rules YAML/JSON |
classification.use_llm |
bool | false |
Enable LLM fallback tier |
classification.llm_model |
string | gpt-4o-mini |
LLM model identifier |
classification.confidence_threshold |
float | 0.7 |
Minimum acceptance confidence |
classification.llm_fallback_threshold |
float | 0.0 |
Commits with confidence above this value skip the LLM tier |
classification.llm_fallback_concurrency |
uint | 8 |
Max concurrent LLM requests during fallback |
github.token |
string | $GITHUB_TOKEN |
GitHub PAT for PR fetch. Required scopes: public_repo for public repos, repo for private repos. Without a token, GitHub rate-limits anonymous traffic to 60 requests/hour and most PRs will be missed. |
github.org |
string | — | Org slug for org-wide PR queries |
github.repo |
string | — | Single repo slug (owner/name) |
github.fetch_prs |
bool | false |
Fetch pull request metadata. Must be true for tga pr-metrics to return data — when left at the default, tga collect writes zero rows to pull_requests (issue #211). |
github.ticket_regex |
string | — | Override regex for detecting GitHub ticket refs in commit messages |
jira.url |
string | — | JIRA base URL |
jira.username |
string | — | JIRA API username (email for Cloud) |
jira.token |
string | — | JIRA API token |
jira.project_key |
string | — | Project key filter (e.g. API) |
jira.ticket_regex |
string | — | Override regex for detecting JIRA ticket refs in commit messages |
linear.ticket_regex |
string | — | Override regex for detecting Linear ticket refs in commit messages |
pm.azure_devops.organization_url |
string | — | ADO org URL (e.g. https://dev.azure.com/myorg) |
pm.azure_devops.pat |
string | — | Azure DevOps Personal Access Token |
pm.azure_devops.project |
string | — | Default ADO project name |
pm.azure_devops.fetch_on_reference |
bool | false |
Fetch work items when AB#N refs appear in commits |
pm.azure_devops.fetch_prs |
bool | false |
Fetch ADO pull requests and reviewer data |
pm.azure_devops.ticket_regex |
string | AB#(\d+) |
Override regex for detecting ADO work item refs in commit messages |
pm.bitbucket.workspace |
string | — | Bitbucket Cloud workspace slug |
pm.bitbucket.repo_slug |
string | — | Repository slug within the workspace |
pm.bitbucket.fetch_prs |
bool | false |
Fetch Bitbucket Cloud pull request metadata |
pm.bitbucket.token |
string | $BITBUCKET_TOKEN |
Bearer token (App password or OAuth) |
pm.bitbucket.username |
string | — | Atlassian account username for Basic auth |
pm.bitbucket.app_password |
string | — | Atlassian App password for Basic auth (alternative to token) |
cache.directory |
path | — | Cache directory (supports ~) |
version |
string | — | Schema version; stored for compatibility |
profile |
string | — | Named profile; stored for compatibility |
dora.deployment_source |
string | git_tags |
Source for tga deployments collect. One of git_tags, github_releases, github_actions, manual. |
dora.deployment_tag_pattern |
regex | ^v?[0-9]+\.[0-9]+\.[0-9]+(-...)?$ |
Tags matching this regex are ingested as deployments. |
dora.production_branch |
string | main |
Default branch that production deployments come from. |
dora.failure_signals |
list | [] |
One signal per entry; work_type (classification category) and/or commit_message_pattern (regex) + within_hours window. |
dora.datadog_dir |
path | — | Directory of Datadog incident exports (.json). Currently a stub — JIRA SRE path is the default MTTR source. |
jira.jira_project_mappings |
map | {} |
Project key → work type (issue #206). Fires as Tier 1.6 — outranks the generic ticket regex. |
jira.jira_project_mapping_confidence |
float | 0.88 |
Per-verdict confidence for the JIRA mapping tier. |
Paths support ~ expansion. Config files from the Python gitflow-analytics tool load without changes — unknown keys are silently ignored.
developer_aliases vs team.members
developer_aliases (Python-compatible flat map):
developer_aliases:
"Alice Smith":
- "alice@company.com"
- "asmith@company.com"
- "alice@personal.dev"
"Bob Jones":
- "bob@company.com"
- "129991831+bobgithub@users.noreply.github.com"
team.members (structured roster with canonical email):
team:
members:
- name: Alice Smith
email: alice@company.com
aliases:
- asmith@company.com
- alice@personal.dev
When developer_aliases is non-empty it takes precedence over team.members. Use developer_aliases when migrating an existing Python config file; use team.members for new setups where canonical email matters for downstream tooling.
Example: multi-repo config with GitHub
See configs/example-config.yaml for a working example that covers multiple repositories, developer aliases, and CSV+Markdown output.
CLI Reference
All subcommands accept these global flags:
| Flag | Default | Description |
|---|---|---|
--config <PATH> |
config.yaml |
Path to config YAML |
--database <PATH> |
tga.db |
Path to SQLite database |
-v / -vv / -vvv |
warnings only | Increase log verbosity |
tga analyze
Run the full pipeline: collect → classify → report.
| Flag | Description |
|---|---|
--skip-collect |
Skip Stage 1; use commits already in the database |
--skip-classify |
Skip Stage 2; use existing classifications |
--output <DIR> |
Override output.directory from config |
--weeks <N> |
Limit collection to the last N weeks (overrides config start_date) |
--from <DATE> |
Start date for collection (ISO 8601 YYYY-MM-DD); mutually exclusive with --weeks |
--to <DATE> |
End date for collection (ISO 8601 YYYY-MM-DD); defaults to today |
--dry-run |
Perform all steps against an in-memory database; the on-disk database is left untouched |
--validate-only |
Run configuration validation and exit (0 on success, 1 on errors) |
--no-validate |
Skip pre-flight configuration validation |
# Full pipeline
# Re-run reports only (commits already collected and classified)
tga collect
Stage 1: extract commits from git repositories into the database.
| Flag | Description |
|---|---|
--repos <NAME,...> |
Comma-separated list of repository names to collect; others are skipped |
--from <DATE> |
Collect commits on or after this ISO 8601 date; mutually exclusive with --weeks |
--to <DATE> |
Collect commits on or before this ISO 8601 date; defaults to today |
--since <DATE> |
Legacy alias for --from (Python-predecessor compatibility); --from takes precedence |
--until <DATE> |
Legacy alias for --to (Python-predecessor compatibility); --to takes precedence |
--weeks <N> |
Limit collection to the last N weeks; --weeks takes precedence over --from/--to |
--dry-run |
Run collection against an in-memory database; the on-disk database is left untouched |
--force-refresh-prs |
Re-fetch ADO pull requests even when already cached (backfills pre-v1.0.9 rows) |
--validate-only |
Run configuration validation and exit (0 on success, 1 on errors) |
--no-validate |
Skip pre-flight configuration validation |
tga classify
Stage 2: run the classification cascade over collected commits.
| Flag | Description |
|---|---|
--rules <PATH> |
Override classification.rules_file from config. Custom rules default to priority 110 (above built-in 100) and are standalone by default (extend_defaults: false). |
--use-llm |
Enable LLM fallback regardless of config setting |
--backfill-complexity |
Fill missing complexity scores (1–5) for already-classified commits via the LLM, without re-running the full cascade; category, confidence, and method are left untouched |
--no-external |
Disable all external classification sources (JIRA, GitHub Issues) for this run, even if configured in the rules file or config |
tga report
Stage 3: generate reports from classified commits.
| Flag | Description |
|---|---|
--output <DIR> |
Override output.directory from config |
--formats <FMT,...> |
Comma-separated: csv, json, markdown |
tga pr-metrics
Aggregate pull-request metrics per engineer from the pull_requests cache.
| Flag | Description |
|---|---|
--weeks <N> |
Limit metrics to PRs created within the last N weeks |
--csv |
Emit CSV instead of an aligned text table |
--output <PATH> |
Write output to a file (CSV with --csv, otherwise the text table) |
The pr_comments_given and avg_revisions columns are reserved for future
use and currently always output 0.0; the underlying review-comment and
revision-count data is not yet tracked.
tga backfill
Retroactive maintenance operations that update existing commit rows in place
(outside the normal collect → classify → report pipeline).
| Subcommand | Description |
|---|---|
ai-detection |
Re-run LLM classification on low-confidence prior LLM verdicts |
revert-flags |
Scan commit messages for revert patterns and set is_revert |
ticket-ids |
Scan commit messages for ticket refs and update ticket_id/ticketed |
| Flag | Description |
|---|---|
--dry-run |
Report how many rows would change without writing |
tga override
Manage manual classification overrides (Tier 0). Rows here pin a commit's verdict regardless of what the rule-based or LLM tiers would produce.
| Subcommand | Description |
|---|---|
add <SHA> <WORK_TYPE> <CHANGE_TYPE> [--notes <TEXT>] [--repo <PATH>] |
Insert (or replace) an override row for a commit SHA |
list [--repo <PATH>] |
List every override row, optionally filtered by repository |
remove <SHA> [--yes] |
Delete the override row(s) for a SHA (--yes skips confirmation) |
tga rules
Introspect the classification rule set (issue #209). Useful for tuning rules and answering "why was this commit classified as X?".
tga deployments / tga incidents / tga dora
DORA metrics infrastructure (issues #207, #208, #212, #213).
# Step 1: ingest deployment events (default: git tags matching dora.deployment_tag_pattern)
# Step 2 (optional): ingest production incidents from JIRA SRE issues
# Step 3: compute Deployment Frequency, Lead Time, Change Failure Rate, MTTR.
# Also rebuilds the `deployment_failures` derived join from the current
# `dora.failure_signals` config — safe to re-run after a config edit.
The four DORA metrics land in pre-computed SQL views
(v_deployment_frequency, v_lead_time, v_change_failure_rate,
v_mttr) so dashboards can read them directly without re-aggregating.
Tag & Release-Branch Reachability (issue #279)
tga collect automatically populates fact_commit_reachability with four new columns that distinguish "deployed via cherry-pick to a release branch and tagged" from "abandoned WIP":
| Column | Type | Meaning |
|---|---|---|
on_any_tag |
boolean | true if any git tag reaches this commit |
reachable_from_tags |
JSON array | Tag names that contain this commit |
on_release_branch |
boolean | true if commit is on any configured release branch |
release_branches |
JSON array | Matching release-branch names |
This resolves a systematic blind spot: bug fixes and security patches cherry-picked to release/* branches, tagged for production, and never merged to main previously showed on_default_branch = false — making them indistinguishable from abandoned WIP. With this feature, the 32% "merged rate" for bug fixes turns out to be much higher when deployed-via-tag commits are counted.
Config:
reachability:
track_tags: true # default: true
track_release_branches: true # default: true
release_branch_patterns: # default: ["release/*", "hotfix/*", "chore/release-*", "v*"]
- "release/*"
- "hotfix/*"
- "chore/release-*"
- "v*"
Set track_tags: false to skip the tag scan (useful for trunk-based repos with thousands of tags). The default is true because the scan is O(repo + refs), not O(repo × refs × commits).
Optionally disable the scan for a single run:
Useful derived queries:
-- What % of bug fixes actually shipped (via any path)?
SELECT
COUNT(*) FILTER (WHERE on_default_branch OR on_any_tag OR on_release_branch)
* 100.0 / COUNT(*) AS shipped_pct
FROM classifications cl
JOIN commits c ON c.classification_id = cl.id
JOIN fact_commit_reachability fcr ON fcr.commit_sha = c.sha
WHERE cl.category = 'bug_fix';
-- Cherry-pick rate: commits reachable from tags but not on main
SELECT repo, COUNT(*)
FROM fact_commit_reachability
WHERE on_any_tag AND NOT on_default_branch
GROUP BY repo;
Pipeline Architecture
git repos ──┐
│ collect SQLite (tga.db) classify SQLite report
GitHub API ──┼──────────────► [commits] ──────────────► [classif]─────────► CSV (×9)
JIRA API ────┤ (libgit2, [authors] (7-tier ► JSON
Linear API ──┤ reqwest) [pull_requests] cascade, ► Markdown
ADO API ────┘ [work_items] Rayon-parallel)
Stage 1 — collect (tga::collect): opens each repository with libgit2, walks the configured branch, extracts commit metadata and diff stats, resolves author identities, fetches GitHub PR / JIRA issue / Linear / Azure DevOps work item metadata via REST/GraphQL, and writes everything to SQLite.
Stage 2 — classify (tga::classify): reads unclassified commits from the database, runs each message through the seven-tier cascade (see below), and writes a classification verdict back. Rule-based tiers execute in parallel via Rayon.
Stage 3 — report (tga::report): reads the classified database, aggregates per-author, per-week, DORA, velocity, and quality statistics, and writes the configured output formats to the output directory.
Classification
Seven-Tier Cascade
Each commit message is tested against tiers in order. The first tier to produce a confident result wins.
Tier 0 — Manual Override (confidence 1.0): looks up the (commit_hash, repo_path) pair in the classification_overrides table. Managed via tga override add|list|remove.
Tier 1.5 — Issue Type (confidence 0.90): when the commit has ticket references resolving to rows in issue_cache, maps the upstream issue type (bug, story, task, spike, etc.) directly to a change_type.
Tier 1.6 — JIRA Project Mapping (default confidence 0.88, override via jira.jira_project_mapping_confidence): when jira.jira_project_mappings is configured, maps the JIRA project key prefix of any [A-Z]+-\d+ reference to a change_type. Fires before the regex tier so project mappings outrank the generic jira-ticket regex rule (issue #206). Example config:
jira:
jira_project_mappings:
TQL: bug_fix
APEX: integration
INFRA: platform_infrastructure
SEC: security
jira_project_mapping_confidence: 0.88 # optional, default 0.88
Tier-0 manual overrides and exact-keyword conventional-commit prefixes (e.g. fix:) still beat this tier.
Tier 4 — Exact (Aho-Corasick): builds a single finite-state machine from every keyword list across every rule and scans the message in O(n) time. Matches feat:, fix:, chore:, etc. Confidence 0.85–0.95.
Tier 5 — Regex: applies pre-compiled regex patterns from the rule set. Handles anchored conventional-commit patterns (^feat(\([^)]*\))?!?:) and JIRA ticket IDs (\b[A-Z][A-Z0-9]+-\d+\b).
Tier 6 — Fuzzy heuristics: detects merge commits (via is_merge flag or Merge pull request prefix) and reverts (via Revert prefix). No external dependencies.
Tier 7 — LLM fallback (optional, async): calls an OpenAI-compatible API (OpenRouter by default, AWS Bedrock behind the bedrock cargo feature) when tiers 0–6 leave a commit in a fallthrough category. Disabled by default; enable with analysis.llm_classification.enabled: true or --use-llm. Results are only accepted when confidence >= confidence_threshold (default 0.7).
Default Rules
| ID | Category | Keywords / Patterns |
|---|---|---|
cc-feat |
feature |
feat:, feature:, ^feat(...)?!?: |
cc-fix |
bugfix |
fix:, bugfix:, hotfix, ^fix(...)?!?: |
cc-chore |
chore |
chore:, ^chore(...)?!?: |
cc-docs |
documentation |
docs:, doc:, ^docs?(...)?!?: |
cc-refactor |
refactor |
refactor:, ^refactor(...)?!?: |
cc-test |
test |
test:, tests:, ^tests?(...)?!?: |
cc-ci |
ci |
ci:, ^ci(...)?!?: |
cc-perf |
performance |
perf:, ^perf(...)?!?: |
cc-style |
style |
style:, ^style(...)?!?: |
cc-build |
build |
build:, ^build(...)?!?: |
cc-revert |
revert |
revert:, ^revert(...)?!?: |
breaking-change |
breaking |
breaking change, breaking-change |
jira-ticket |
feature (ticketed) |
\b[A-Z][A-Z0-9]+-\d+\b |
kw-bug |
bugfix |
bug, defect |
kw-security |
bugfix (security) |
security, cve-, vulnerability |
Commits that match no rule are assigned category uncategorized with confidence 0.0.
Custom Rules File
Supply your own rules alongside the defaults:
# my-rules.yaml
version: "1.0"
rules:
- id: my-deploy
category: deployment
keywords:
- "deploy:"
- "release:"
patterns:
- "(?i)^deploy(ment)?:"
priority: 80
confidence: 0.9
# or in config.yaml:
# classification:
# rules_file: ./my-rules.yaml
Rule Priority and extend_defaults (issue #259)
Custom rules default to priority 110 — one step above the highest built-in rule priority (100). This means user rules win over the default ruleset without needing an explicit priority: entry in every rule.
Custom rule files also default to standalone mode (extend_defaults: false): only the rules in the file are applied. Opt in to merging with the built-in defaults by adding extend_defaults: true.
# my-rules.yaml — standalone by default (no built-in rules loaded)
extend_defaults: false # optional: explicitly document the default
rules:
- id: my-deploy
category: deployment
keywords:
# priority defaults to 110 — beats the built-in cc-feat (100)
# my-addons.yaml — augments the defaults
extend_defaults: true
rules:
- id: my-payments
category: payments
keywords:
confidence: 0.92
Multi-source classification (issue #260)
tga classify can consult external ticket systems — JIRA Cloud/Server and
GitHub Issues — as high-confidence classification signals before the
commit-message rule tiers run.
Priority model (highest to lowest):
- Manual overrides (
tga override add) - External sources — JIRA issue type / GitHub Issues labels (confidence 0.92)
- Custom regex rules (default priority 110)
- Built-in TGA rules (priority 100)
- LLM fallback (when
use_llm: true)
Configuration — add a sources: list to config.yaml under
classification::
classification:
sources:
# JIRA source
- type: jira
base_url: "https://yourco.atlassian.net"
token_env: JIRA_API_TOKEN # env var carrying the API token
username: "you@yourco.com" # omit for Bearer-only tokens
project_keys: # empty = all projects
field_mappings:
issue_type:
Bug: bug_fix
Story: new_feature
Task: tech_debt_refactoring
Epic: new_feature
labels:
ktlo: tech_debt_refactoring
security: security
components:
Platform: platform_infrastructure
# GitHub Issues source
- type: github_issues
repo: "acme/widgets"
token_env: GITHUB_TOKEN
label_mappings:
bug: bug_fix
enhancement: new_feature
dependencies: tech_debt_refactoring
documentation: documentation
Credentials are read from the named environment variables at runtime — never store tokens directly in config files:
Caching: each unique ticket is fetched at most once per tga classify
run (in-memory cache, never persisted to disk). On HTTP failure or missing
token, the source is skipped with a warning and the pipeline falls through
to commit-message rules.
Disabling external sources: use --no-external to skip all sources for
a run (useful in CI or offline environments):
See examples/multi-source-config.yaml
for a fully annotated example.
Deferred sources (planned for a future release): Linear, Shortcut, Confluence, Datadog.
Output Formats
CSV
Nine CSV files are written when csv is in the format list (per-author,
weekly activity, DORA metrics, velocity, quality, and related breakdowns).
The two most commonly used are documented below.
authors.csv — one row per author:
| Column | Description |
|---|---|
name |
Canonical author name |
email |
Canonical author email |
commit_count |
Total commits |
insertions |
Total lines added |
deletions |
Total lines deleted |
files_changed |
Total files changed |
first_commit |
ISO 8601 timestamp of earliest commit |
last_commit |
ISO 8601 timestamp of most recent commit |
weekly_activity.csv — one row per week/author/repository bucket:
| Column | Description |
|---|---|
week |
ISO week label, e.g. 2024-W03 |
author |
Author name |
repository |
Repository name |
commit_count |
Commits in this bucket |
insertions |
Lines added in this bucket |
deletions |
Lines deleted in this bucket |
JSON
Four JSON files are written when json is in the format list:
report.json, velocity_summary.json, quality_summary.json, and
dora_summary.json. The primary one is documented below.
report.json — full structured payload:
Markdown
report.md — a narrative report containing a summary header, per-author commit table, category breakdown, and weekly activity section. Suitable for pasting into Confluence or a PR description.
Development
Build and Test
# Build everything
# Build release binary
# Run all tests
# Lint (zero warnings required)
# Format check (CI gate)
# Auto-format
# Generate rustdoc
Running Against Real Repos
configs/example-config.yaml is a working example that analyzes repositories using developer_aliases. Copy it, adjust paths and names to match your setup, then run:
CI Gates
The GitHub Actions workflow (ci.yml) requires:
cargo fmt -- --checkcargo clippy --all-targets -- -D warningscargo testcargo doc --no-depswithRUSTDOCFLAGS="-D warnings"
Crate Structure
Single tga crate (consolidated from the original 5-crate workspace):
| Module | Path | Purpose |
|---|---|---|
tga::core |
src/core/ |
Shared types, config, DB schema, migrations, error types |
tga::collect |
src/collect/ |
Stage 1: git extraction (libgit2), GitHub/JIRA/Linear/ADO clients, PmAdapter trait |
tga::classify |
src/classify/ |
Stage 2: seven-tier classification cascade |
tga::report |
src/report/ |
Stage 3: CSV/JSON/Markdown output |
commands (binary-private) |
src/commands/ |
Subcommand handlers wired into src/main.rs |
License
Non-commercial use only. See LICENSE for terms.
Elastic License 2.0