paceflow 0.2.3 - Docs.rs

# Data Model Migration Proposal

## Goal
Replace the current Rust-centric metric calculation flow with a data model that separates:

- `metadata_*` tables for enrichment context
- `event_*` tables for analytic occurrences at explicit grain
- `view_*` SQL views for metric outputs

Rust should ingest raw data, attach context, and materialize derived events. SQL should aggregate those events into descriptive metrics.

## Why Change
Today, metric logic is spread across Rust code that mixes:

- source parsing
- context derivation
- git inspection
- attribution logic
- aggregation
- report formatting

This makes metrics harder to test, harder to evolve, and harder to recompute at different scopes.

The replacement model is:

`raw sources -> metadata tables -> event tables -> SQL metric views`

## Design Principles
### 1. Every table has one declared grain
Do not mix session-grain, commit-grain, and task-session-grain rows in the same table.

### 2. Metadata is not metric input
`metadata_*` tables describe entities and reusable context. They are joined onto events but should not usually be aggregated directly.

### 3. Events are metric input
`event_*` tables represent one analytic occurrence each and should already contain the dimensions needed for grouping.

### 4. SQL owns ratios and rollups
Rust produces rows. SQL computes counts, sums, averages, rates, and status thresholds.

## Proposed Schema
### Metadata tables
These are the main enrichment tables.

- `metadata_repositories`
  - one row per repo root
- `metadata_sessions`
  - one row per AI session
- `metadata_files`
  - one row per canonical file path within a repo
- `metadata_commits`
  - one row per git commit
- `metadata_commit_file_changes`
  - one row per `(commit, file)`
- `metadata_tasks`
  - one row per recognized task key
- `metadata_branches`
  - one row per observed branch

### Event tables
These are the main analytic inputs.

- `event_session_messages`
  - grain: one row per ingested message
- `event_session_code_changes`
  - grain: one row per accepted or extracted file write operation
- `event_session_quality`
  - grain: one row per session
  - includes user turn count, repeated-debug-loop flag, mid-session-error-paste flag
- `event_commit_attributions`
  - grain: one row per `(commit, provider, session)`
  - includes matched lines and attribution shares
- `event_commit_outcomes`
  - grain: one row per commit
  - includes heavy-AI flag, merged-to-mainline flag, reverted-later flag
- `event_commit_churn`
  - grain: one row per commit
  - includes AI-added lines reaching mainline and churned lines within the churn window
- `event_task_commits`
  - grain: one row per `(task, commit)`
  - includes branch name, fallback flag, confidence
- `event_task_sessions`
  - grain: one row per `(task, session)`
  - includes session weight and commit-within-window flag

### SQL views
Views expose metric-ready outputs with descriptive names.

- `view_repository_commit_quality`
- `view_global_session_quality`
- `view_task_execution_quality`
- `view_repository_churn_summary`
- `view_session_productivity`

## Mapping From Current Tables
Current tables can be migrated roughly as follows:

- `events` -> split into `metadata_sessions`, `event_session_messages`, `event_session_code_changes`
- `change_sessions` -> merge into `metadata_sessions` or keep as provider-source metadata
- `change_ops` -> `event_session_code_changes`
- `git_commits` -> `metadata_commits`
- `git_commit_file_diffs` -> `metadata_commit_file_changes`
- `commit_ai_session_attributions` -> `event_commit_attributions`
- `commit_task_attributions` -> `event_task_commits`
- derived Rust outputs from `metrics/quality.rs` and `task_stats.rs` -> replace with `event_*` builders plus `view_*`

## Implementation Plan
### Phase 1: Introduce the new schema
Add the new `metadata_*`, `event_*`, and `view_*` schema alongside the current one. Do not remove old tables yet.

### Phase 2: Split existing ingestion outputs by grain
Refactor ingest so provider parsing still writes raw events, but then materialization jobs populate:

- `metadata_sessions`
- `event_session_messages`
- `event_session_code_changes`

### Phase 3: Move commit association outputs into event tables
Keep the existing commit-matching logic, but change persistence targets to:

- `metadata_commits`
- `metadata_commit_file_changes`
- `event_commit_attributions`
- `event_task_commits`

### Phase 4: Materialize session and commit outcome events
Move logic that currently lives in Rust metric code into event builders:

- session behavior builder -> `event_session_quality`
- commit outcome builder -> `event_commit_outcomes`
- churn builder -> `event_commit_churn`
- task-session linker -> `event_task_sessions`

At this stage, Rust still computes expensive booleans such as “merged to mainline” or “reverted later”, but it stores them as event fields rather than final metrics.

### Phase 5: Replace Rust aggregations with SQL views
Rewrite reporting commands to query SQL views instead of computing metrics in Rust.

Examples:

- `stats` reads from `view_global_session_quality`, `view_repository_commit_quality`, and `view_session_productivity`
- `task-stats` reads from `view_task_execution_quality`

### Phase 6: Remove old metric code
After parity is verified, retire:

- direct aggregation logic in [`src/metrics/quality.rs`](/home/tadas/Work/paceflow/ai-engineering-analytics/src/metrics/quality.rs)
- task metric aggregation logic in [`src/commands/task_stats.rs`](/home/tadas/Work/paceflow/ai-engineering-analytics/src/commands/task_stats.rs)

Keep only:

- ingestion
- enrichment
- materialization
- rendering

## Concrete Responsibility Split
### Rust responsibilities
- parse provider data
- inspect git history
- resolve repo, file, branch, and task context
- compute expensive derived flags and windows once
- write `metadata_*` and `event_*` rows

### SQL responsibilities
- counts
- sums
- averages
- rates
- filtered rollups
- grouping by repo, task, branch, provider, and time
- optional grading/status logic

## Example View Shapes
### `view_global_session_quality`
Built from `event_session_quality`.

- total sessions
- average user prompts per session
- repeated-debug-loop session rate
- mid-session-error-paste session rate

### `view_repository_commit_quality`
Built from `event_commit_outcomes` and `event_commit_churn`.

- heavy-AI commit count
- heavy-AI reverted-later rate
- heavy-AI merged-to-mainline rate
- churn rate for AI-added code after merge

### `view_task_execution_quality`
Built from `event_task_sessions` and `event_task_commits`.

- average user prompts for task-linked sessions
- repeated-debug-loop rate for task-linked sessions
- mid-session-error-paste rate for task-linked sessions
- session-with-commit-within-window rate
- commit count per task

## Migration Risks
- Reusing the current `events` table as both metadata and event source will keep the model muddy unless it is clearly split.
- Some current report outputs are not metrics and should stay outside views.

Examples:

- dominant branch selection
- `vs staging` diff output
- display formatting

- Backfill logic must preserve row grain exactly, or SQL views will overcount.

## Recommendation
Implement the new model incrementally, with old and new outputs running in parallel until the SQL views match today’s reports.

The target architecture should be:

`provider and git inputs -> metadata_* tables -> event_* tables -> view_* metric outputs -> CLI rendering`

This gives a clean separation between enrichment data and analytic events, keeps each table at one grain, and moves metric logic into SQL where it is easier to inspect, test, and extend.