converge-analytics 0.1.2

Analytics and ML training for Converge agents using Burn and Polars
Documentation

Converge Analytics

An analytics-focused agent stack built on the Converge agent framework that automates the complete machine learning pipeline—from raw data to trained models to deployment decisions. It demonstrates how agents can drive systems toward convergence by iteratively improving model quality through data splitting, feature engineering, training, evaluation, and monitoring.

Key Characteristics

  • Data-aware: understands dataset schemas, missing values, and data quality.
  • Feature-first: generates deterministic feature vectors from raw inputs (Polars).
  • Model-capable: trains or loads models and runs inference (Burn).
  • Traceable: emits facts/proposals with provenance and confidence.
  • Composable: agents can be chained (features → embeddings → training → inference).
  • Repeatable: deterministic runs with versioned data and models.
  • Convergent: iterative training loop that expands data until quality threshold is met.

Architecture Overview

The project is built on three core technologies:

Component Technology Purpose
Data Processing Polars CSV/Parquet loading, feature extraction
Deep Learning Burn MLP models, inference
Orchestration Converge Core Agent framework, fact management

What This Repo Provides

Core Modules

  • engine.rs — Feature extraction agent using Polars (CSV/Parquet → feature vectors)
  • model.rs — Inference agent using Burn MLP (feature vectors → predictions)
  • training.rs — Complete training pipeline with 10 specialized agents

Training Pipeline

A 10-agent pipeline that orchestrates the full ML lifecycle:

  1. DatasetAgent — Downloads datasets, splits into train/val/infer
  2. DataValidationAgent — Quality checks, drift detection, outlier counts
  3. FeatureEngineeringAgent — Defines feature specs and interactions
  4. HyperparameterSearchAgent — Plans hyperparameter search
  5. ModelTrainingAgent — Trains models (baseline mean predictor)
  6. ModelEvaluationAgent — Computes MAE, success_ratio metrics
  7. SampleInferenceAgent — Runs inference on sample data
  8. ModelRegistryAgent — Catalogs versioned models with metrics
  9. MonitoringAgent — Health status based on thresholds
  10. DeploymentAgent — Deploy/hold/retrain decisions

See AGENTS.md for detailed documentation of each agent.

Convergence Loop

The training_flow example demonstrates iterative convergence:

Iteration 1: Train on 500 rows → Evaluate
    ↓
Quality Check: success_ratio >= 0.75?
    ↓ No
Expansion: Double max_rows (up to dataset size)
    ↓
Repeat (up to 5 iterations)

Data Flow

Seeds (intent to train)
  ↓
DatasetAgent → DatasetSplit (train/val/infer files)
  ↓
DataValidationAgent → DataQualityReport (stats, drift)
  ↓
FeatureEngineeringAgent → FeatureSpec (numeric/categorical, interactions)
  ↓
HyperparameterSearchAgent → HyperparameterSearchPlan & Result
  ↓
ModelTrainingAgent → BaselineModel (JSON with mean)
  ↓
ModelEvaluationAgent → EvaluationReport (MAE, success_ratio)
  ↓
SampleInferenceAgent → InferenceSample (predictions vs actuals)
  ↓
ModelRegistryAgent → ModelRegistryRecord (versioned metadata)
  ↓
MonitoringAgent → MonitoringReport (health status)
  ↓
DeploymentAgent → DeploymentDecision (deploy/hold/retrain)

Project Structure

converge-analytics/
├── Cargo.toml                    # Project manifest
├── src/
│   ├── lib.rs                    # Module exports
│   ├── engine.rs                 # Feature extraction (Polars)
│   ├── model.rs                  # Inference (Burn MLP)
│   └── training.rs               # 10-agent training pipeline
├── examples/
│   ├── agent_loop.rs             # Feature → inference demo
│   └── training_flow.rs          # Full training with convergence
├── tests/
│   ├── agent_flow.rs             # Integration: feature → inference
│   └── polars_burn_integration.rs # Polars + Burn interop
├── data/                         # Training data storage
├── models/                       # Trained model artifacts
├── datasets/                     # Dataset documentation
└── Justfile                      # Dataset loading commands

Current Implementation Status

Fully Implemented:

  • Dataset download and splitting (California Housing from HuggingFace)
  • Polars-based feature extraction
  • Baseline mean model training
  • MAE evaluation with success_ratio
  • Sample inference with predictions vs actuals
  • Multi-iteration convergence loop

Simplified/Mock (for learning purposes):

  • Hyperparameter search is heuristic (no actual trials)
  • Feature specs defined but not applied to data
  • Baseline model is mean predictor (not learned)
  • Drift detection is simple mean-delta

Roadmap

  • Wire feature specs into training/inference
  • Replace mean baseline with learned models using engineered features
  • Implement real hyperparameter search with actual model trials
  • Add persistent artifact storage and evaluation versioning
  • Improve monitoring with statistical drift and bias checks
  • Integrate deployment orchestration with retraining triggers

Suggested Datasets

Each dataset has a short write-up under datasets/.

Mobility and travel:

  • nyu-mll/nyc-taxi-trip-durationdatasets/nyc-taxi-trip-duration.md
  • UberMovement/uber-movementdatasets/uber-movement.md
  • OpenFlightsdatasets/openflights.md

Housing and pricing:

  • gvlassis/california_housingdatasets/california-housing.md
  • kraina/airbnbdatasets/airbnb.md
  • kaggle/airbnbdatasets/kaggle-airbnb.md

Procurement and sourcing:

  • theRACER/Procurementdatasets/procurement.md
  • mhimchak/supply_chain_datadatasets/supply-chain.md
  • shuyan/walmart-tripsdatasets/walmart-trips.md

Marketing and social:

  • cardiffnlp/tweet_evaldatasets/tweet-eval.md
  • megagonlabs/finance-tweetdatasets/finance-tweet.md
  • mteb/twitterdatasets/mteb-twitter.md
  • mteb/redditdatasets/mteb-reddit.md

Quickstart

cargo test
cargo run --example agent_loop
cargo run --example training_flow

License

MIT. See LICENSE.

Development Tools

This project is built with the help of modern AI-assisted development tools:

Agent Tooling (Examples)

This project supports a tool-agnostic agent workflow. Claude, Codex, Gemini, Cursor, and similar tools are optional frontends; the shared contract is visible task state, scoped changes, validation, and explicit handoffs.

  • Claude Code - Example interactive coding agent
  • Cursor - Example AI-powered IDE workflow
  • Antigravity - Example AI pair-programming tool
  • Frontier models (Claude / Gemini / others) - Use any provider that fits the task and team policy

Version Control & Task Tracking

  • Jujutsu (jj) - Use jj on top of Git for day-to-day version control (commit/diff/rebase/undo)
  • Task tracking (tool-agnostic) - Use GitHub Issues, Jira, Linear, or a repo-local TASKS.md
# Quick workflow (agent-friendly)
jj status                 # See changes
jj diff                   # Review changes
jj commit -m "message"    # Commit
jj git push               # Push via git remote
# Update tracker or TASKS.md with status + handoff

Key Rust Crates

Crate Purpose
tokio Async runtime
axum HTTP framework
serde / serde_json Serialization
thiserror Error handling
tracing Structured logging
rayon Parallel computation
proptest Property-based testing
burn ML/deep learning (converge-llm)
tonic / prost gRPC support