paceflow 0.2.1 - Docs.rs

# Data Model Migration Proposal V2

## Goal
Rework the current analytics pipeline into four explicit layers:

- `metadata_*` for enrichment context
- `fact_*` for atomic source-derived records
- `event_*` for enriched analytic events at reporting grain
- `view_*` for SQL metric outputs

The target architecture is:

`raw sources -> metadata_* + fact_* -> Rust-built event_* -> SQL-built view_* -> CLI rendering`

This replaces the current approach where Rust both derives facts and computes final metrics.

## Operating Model
There should be two clearly separated phases.

### Build phase: Rust
Rust owns ingestion, source interpretation, and event materialization.

Flow:

1. run `ingest`
2. read provider files, repositories, git history, and local source artifacts
3. normalize the raw inputs
4. write or update `metadata_*` tables
5. write or update `fact_*` tables
6. read `metadata_*` and `fact_*`
7. build or refresh `event_*`

Output of the build phase:

- reference context in `metadata_*`
- atomic source-derived records in `fact_*`
- enriched analytic rows in `event_*`

Rust still does not calculate final metric ratios or reporting rollups.

### Read phase: SQL query layer
SQL owns aggregation and final metric calculation.

Flow:

1. read from `event_*`
2. create or refresh `view_*`
3. read from `view_*` to derive metrics
4. display query results in the CLI

Output of the read phase:

- `view_*` as final reporting surfaces

## Why This Version Is Better
The earlier proposal split data into metadata and events. This version adds a separate `fact_*` layer, which is important because “source-derived truth” and “report-ready analytic event” are not the same thing.

The separation becomes:

- `metadata_*`: what things are
- `fact_*`: what was directly observed or derived from one source pass
- `event_*`: what happened in business/analytic terms after enrichment
- `view_*`: aggregated outputs for reporting

This gives:

- cleaner grain boundaries
- easier recomputation when enrichment rules change
- better remote-sync options
- better auditability

## Layer Definitions
### `metadata_*`
Reference and enrichment tables. These should be stable and reusable across many pipelines.

Examples:

- `metadata_repositories`
- `metadata_sessions`
- `metadata_files`
- `metadata_commits`
- `metadata_tasks`
- `metadata_branches`

These tables are not metric inputs by themselves. They provide keys, labels, classification, and context.

### `fact_*`
Atomic source-derived records. These are close to the original observed data and should preserve one clear grain.

Examples:

- `fact_session_message`
  - grain: one row per ingested message
- `fact_session_code_change`
  - grain: one row per accepted or extracted file write
- `fact_commit`
  - grain: one row per scanned commit
- `fact_commit_file_change`
  - grain: one row per `(commit, file)`
- `fact_commit_session_match`
  - grain: one row per `(commit, provider, session)` raw attribution result

These tables should avoid unnecessary denormalization. They are the durable analytic source of truth after parsing.

### `event_*`
Curated, enriched, denormalized event tables used for metrics and remote analytics.

Examples:

- `event_session_quality`
  - grain: one row per session
- `event_commit_outcome`
  - grain: one row per commit
- `event_commit_churn`
  - grain: one row per commit
- `event_task_commit`
  - grain: one row per `(task, commit)`
- `event_task_session`
  - grain: one row per `(task, session)`

These tables should already carry the business dimensions needed for reporting, such as repo, provider, task key, branch name, and timestamps.

### `view_*`
SQL views that compute counts, averages, rates, filters, and presentation-friendly rollups.

Examples:

- `view_global_session_quality`
- `view_repository_commit_quality`
- `view_repository_churn_summary`
- `view_task_execution_quality`
- `view_session_productivity`

## Design Principles
### 1. One table, one grain
Every table must declare its grain. Mixed-grain tables cause double counting and ambiguous SQL.

### 2. Metrics come from `event_*`, not from raw `fact_*`
`fact_*` tables preserve source truth. `event_*` tables encode the canonical enriched interpretation for analysis.

### 3. SQL owns aggregation
Rust prepares `metadata_*`, `fact_*`, and `event_*`. SQL computes rates and rollups and exposes final reporting views.

### 4. Expensive interpretation happens once
Rust should do source-specific work once and persist the result into `metadata_*`, `fact_*`, and `event_*`.

Examples:

- repeated debug loop
- mid-session error paste
- heavy AI commit
- merged to mainline
- reverted later
- commit within session window

## Proposed Schema
### Metadata tables
- `metadata_repositories`
  - grain: one row per repository
- `metadata_sessions`
  - grain: one row per session
- `metadata_files`
  - grain: one row per `(repository, relative_path)`
- `metadata_tasks`
  - grain: one row per task key
- `metadata_branches`
  - grain: one row per `(repository, branch_name)`

### Fact tables
- `fact_session_message`
  - grain: one row per message
  - keys: session id, provider, timestamp
- `fact_session_code_change`
  - grain: one row per code change operation
  - keys: session id, provider, file, timestamp, parser/call id
- `fact_commit`
  - grain: one row per commit
  - keys: repository, commit sha
- `fact_commit_file_change`
  - grain: one row per `(commit, file)`
- `fact_commit_session_match`
  - grain: one row per `(commit, provider, session)`
  - includes matched lines and attribution shares
- `fact_task_commit_assignment`
  - grain: one row per `(task, commit)`
  - includes source, fallback flag, confidence

### Event tables
- `event_session_quality`
  - grain: one row per session
  - fields:
    - repository
    - provider
    - session id
    - task key if known
    - session start/end
    - user turn count
    - repeated debug loop flag
    - mid-session error paste flag
- `event_session_productivity`
  - grain: one row per session
  - fields:
    - accepted lines added
    - accepted lines removed
    - accepted total changed lines
    - user word count
- `event_commit_outcome`
  - grain: one row per commit
  - fields:
    - repository
    - commit sha
    - heavy AI flag
    - merged to mainline flag
    - reverted later flag
    - total matched AI lines
    - commit total changed lines
- `event_commit_churn`
  - grain: one row per commit
  - fields:
    - repository
    - commit sha
    - AI-added lines that reached mainline
    - AI-added lines removed within churn window
- `event_task_commit`
  - grain: one row per `(task, commit)`
  - fields:
    - repository
    - task key
    - branch name
    - commit sha
    - fallback flag
    - confidence
- `event_task_session`
  - grain: one row per `(task, session)`
  - fields:
    - repository
    - task key
    - provider
    - session id
    - attribution weight
    - commit-within-window flag
    - session quality fields copied in for simpler SQL

## Mapping From Current Implementation
### Current `events`
Split into:

- `metadata_sessions`
- `fact_session_message`
- `fact_session_code_change`

### Current `change_sessions`
Move into session-related metadata or keep as provider-source metadata if needed for lineage.

### Current `change_ops`
Becomes `fact_session_code_change`.

### Current `git_commits`
Becomes `fact_commit` or `metadata_commits` depending on whether you want commit rows treated as source facts or reference entities. I recommend `fact_commit`, since commit ingestion is itself an observed dataset.

### Current `git_commit_file_diffs`
Becomes `fact_commit_file_change`.

### Current `commit_ai_session_attributions`
Becomes `fact_commit_session_match`.

### Current `commit_task_attributions`
Becomes `fact_task_commit_assignment`, then feeds `event_task_commit`.

### Current Rust metric code
Replace with Rust-built `event_*` materializers and SQL `view_*` objects:

- [`src/metrics/quality.rs`](/home/tadas/Work/paceflow/ai-engineering-analytics/src/metrics/quality.rs) -> mostly replaced by builders for `event_session_quality`, `event_commit_outcome`, `event_commit_churn`
- [`src/commands/task_stats.rs`](/home/tadas/Work/paceflow/ai-engineering-analytics/src/commands/task_stats.rs) -> mostly replaced by builder for `event_task_session` plus SQL view queries

## Migration Plan
### Phase 1: Add new schema side by side
Create the new `metadata_*`, `fact_*`, `event_*`, and `view_*` objects without removing current tables.

### Phase 2: Materialize fact tables from existing ingestion
Keep the current provider and git parsing flow, but persist outputs into `fact_*` tables with explicit grain.

### Phase 3: Build event materializers
Add Rust materializers that read `metadata_*` and `fact_*` and create or refresh:

- `event_session_quality`
- `event_session_productivity`
- `event_commit_outcome`
- `event_commit_churn`
- `event_task_commit`
- `event_task_session`

These materializers are where reporting-grain joins and denormalization happen.

### Phase 4: Add SQL metric views
Create descriptive `view_*` objects such as:

- `view_global_session_quality`
- `view_repository_commit_quality`
- `view_repository_churn_summary`
- `view_task_execution_quality`
- `view_session_productivity`

These views should compute counts, averages, and rates only from `event_*`.

### Phase 5: Switch CLI commands to read-only mode
Update `stats` and `task-stats` to query `view_*` outputs and format results. The SQL layer becomes the source of truth for metric values.

### Phase 6: Remove old Rust metric aggregation
After parity is verified, retire direct metric computation in Rust.

## Remote Sync / Data Lake Compatibility
This design is friendly to remote sync.

### What to sync by default
- all `event_*` tables
- key `metadata_*` tables needed for joins and labels

Recommended minimum metadata sync:

- `metadata_repositories`
- `metadata_sessions`
- `metadata_tasks`
- `metadata_branches`

Add `metadata_files` if remote file-level analysis matters.

### What to sync optionally
- `fact_*` tables for audit, replay, debugging, or model rebuilds

### What not to sync unless needed
- `view_*` tables
- local ingest cursors
- parse error logs
- transient caches

The main sync contract should be `event_*`, because those rows are already enriched and analytically meaningful.

## Example Responsibility Split
### Rust
- parse provider data
- scan git history
- attach source-level repo/task/branch/file/session context where needed
- write `metadata_*`
- write `fact_*`
- build or refresh `event_*`

### SQL
- compute averages
- compute rates
- group by repository, task, provider, branch, and time
- expose final reporting views

## Recommendation
Adopt this four-layer model.

It gives the clearest separation of concerns:

- `metadata_*` for context
- `fact_*` for source-derived truth
- `event_*` for canonical analytic events
- `view_*` for metric outputs

This is the best fit for future SQL-driven reporting and later remote data lake sync.