lawkit
π Multi-law statistical analysis toolkit - Uncover hidden patterns and detect anomalies with confidence
ζ₯ζ¬θͺη README | δΈζη README | English README
A next-generation statistical analysis toolkit that detects anomalies, patterns, and insights using multiple statistical laws. Perfect for fraud detection, data quality assessment, and business intelligence.
# Traditional tools analyze one pattern at a time
# lawkit provides comprehensive multi-law analysis
)
)
)
β¨ Key Features
- π― Multi-Law Analysis: Benford, Pareto, Zipf, Normal, Poisson distributions
- π International Input: Parse numbers in English, Japanese, Chinese, Hindi, Arabic formats
- π€ Smart Integration: Compare multiple laws for comprehensive insights
- β‘ High Performance: Built in Rust with parallel processing
- π Rich Output: Text, JSON, CSV, YAML, TOML, XML formats
- π Meta-Chaining: Analyze trends in statistical patterns over time
- π Advanced Outlier Detection: LOF, Isolation Forest, DBSCAN, Ensemble methods
- π Time Series Analysis: Trend detection, seasonality, changepoint analysis
- π Memory Efficient: Streaming mode for large datasets with chunked processing
π Performance Benchmarks
# Benchmark on 100K data points
))
))
| Dataset Size | Single Law | Multi-Law | Memory Usage |
|---|---|---|---|
| 1K points | 8ms | 25ms | 2.1MB |
| 10K points | 45ms | 180ms | 8.4MB |
| 100K points | 180ms | 850ms | 32MB |
| 1M points | 2.1s | 9.2s | 128MB |
ποΈ Architecture
Multi-Law Analysis Pipeline
graph TB
subgraph "Input Processing"
A[Raw Data] --> B[International Number Parser]
B --> C[Format Detector]
C --> D[Data Validator]
end
subgraph "Statistical Analysis Engine"
D --> E[Benford's Law]
D --> F[Pareto Analysis]
D --> G[Zipf's Law]
D --> H[Normal Distribution]
D --> I[Poisson Distribution]
end
subgraph "Integration Layer"
E --> J[Risk Assessor]
F --> J
G --> J
H --> J
I --> J
J --> K[Contradiction Detector]
K --> L[Recommendation Engine]
end
subgraph "Output Generation"
L --> M[Multi-format Output]
M --> N[CLI Display]
M --> O[JSON/CSV Export]
M --> P[Integration APIs]
end
Meta-Chaining: Advanced Pattern Tracking
One of lawkit's unique capabilities is meta-chaining - analyzing how statistical patterns evolve over time by comparing analysis results themselves.
graph LR
subgraph "Time Series Analysis"
A[data_jan.csv] --> D1[lawkit compare]
B[data_feb.csv] --> D1
D1 --> R1[analysis_jan_feb.json]
B --> D2[lawkit compare]
C[data_mar.csv] --> D2
D2 --> R2[analysis_feb_mar.json]
R1 --> D3[lawkit compare]
R2 --> D3
D3 --> M[Meta-Analysis Report]
end
subgraph "Insights Generated"
M --> |"Pattern Evolution"| T1[Trend Detection]
M --> |"Anomaly Tracking"| T2[Risk Escalation]
M --> |"Quality Drift"| T3[Process Monitoring]
M --> |"Fraud Progression"| T4[Security Alerts]
end
π Quick Start
Installation
# Install from crates.io
# Or download pre-built binaries
Basic Usage
# Single law analysis
# Multi-law comparison (recommended)
# Advanced analysis with filtering
π Supported Statistical Laws
1. Benford's Law
Use Case: Fraud detection in financial data
Detects unnatural digit distributions that may indicate data manipulation.
2. Pareto Analysis (80/20 Rule)
Use Case: Business prioritization and inequality measurement
Identifies the vital few that drive the majority of results.
3. Zipf's Law
Use Case: Frequency analysis and text mining
Analyzes power-law distributions in rankings and frequencies.
4. Normal Distribution
Use Case: Quality control and outlier detection
Statistical process control and anomaly detection.
5. Poisson Distribution
Use Case: Event occurrence and rare event modeling
Models and predicts discrete event occurrences.
π² Data Generation & Testing
lawkit includes powerful data generation capabilities for education, testing, and demonstration purposes.
Generate Sample Data
Create statistically accurate sample data following specific laws:
# Generate 1000 Benford's law samples
# Generate Pareto distribution (80/20 rule)
# Generate Zipf distribution with custom parameters
# Generate normal distribution data
# Generate Poisson event data
Generate-to-Analysis Pipeline
Combine generation with analysis for testing and validation:
# Test Benford's law detection
|
# Verify Pareto principle with generated data
|
# Validate statistical methods
|
# Test with fraud injection
|
Self-Testing
Run comprehensive self-tests to verify lawkit functionality:
# Run all self-tests
# Test specific functionality
|
Advanced Analysis Features
lawkit 2.1 introduces sophisticated analysis capabilities:
# Advanced outlier detection with ensemble methods
# LOF (Local Outlier Factor) for complex patterns
# Time series analysis for trend detection
# Parallel processing for large datasets
# Memory-efficient streaming for huge files
Educational Use Cases
Perfect for teaching statistical concepts:
# Demonstrate central limit theorem
for; do
|
done
# Show Pareto principle in action
|
|
# Compare different distributions
International Numeral Support
Supported Number Formats
1. Full-width Digits
|
2. Kanji Numerals (Basic)
|
3. Kanji Numerals (Positional)
|
4. Mixed Patterns
|
Conversion Rules
| Kanji | Number | Notes |
|---|---|---|
| δΈ | 1 | Basic digit |
| ε | 10 | Tens place |
| ηΎ | 100 | Hundreds place |
| ε | 1000 | Thousands place |
| δΈ | 10000 | Ten thousands place |
| δΈεδΊηΎδΈεε | 1234 | Positional notation |
Decimal Numbers
# Only numbers β₯ 1 are analyzed
|
# Result: 1, (excluded), 1 (numbers < 1 are excluded)
Negative Numbers
# Uses absolute value's first digit
|
# Result: 1, 4, 7
Chinese Numeral Compatibility
Current implementation supports basic Chinese numerals that are identical to Japanese kanji:
Supported (Basic Forms)
- δΈδΊδΈεδΊε δΈε «δΉ (1-9) - Same as Japanese
- εηΎε (10, 100, 1000) - Positional markers
Planned Support
- Financial forms: ε£Ήθ²³εθδΌιΈζζη (anti-fraud variants)
- Traditional: θ¬ (10,000) vs Japanese δΈ
- Regional variants: Traditional vs Simplified Chinese
Other Numeral Systems
lawkit focuses on the most widely used numeral systems in international business and financial analysis:
- Core Support: English (ASCII), Japanese (full-width, kanji), Chinese (simplified/traditional)
- Additional Input: Hindi (Devanagari), Arabic (Eastern Arabic-Indic numerals)
- Business Focus: International accounting standards, cross-border financial analysis
- Output Language: English (for maximum compatibility and searchability)
Note: Input supports all major numeral systems globally. Documentation focuses on primary business markets (English/Japanese/Chinese). Output standardized to English for international compatibility.
Installation
From Source
Binary Releases
Download from releases page
Quick Start
# Analyze CSV file
# Analyze website data with curl
|
# Pipe data
|
# JSON output for automation
Usage
Basic Syntax
Input Methods
- File path:
benf financial_data.xlsx - Web data with curl:
curl -s https://api.example.com/data | benf - String:
benf "123 456 789 101112" - Pipe:
cat data.txt | benf
Priority: File > String > Pipe
Options
| Option | Description |
|---|---|
--format <FORMAT> |
Output format: text, csv, json, yaml, toml, xml |
--quiet |
Minimal output (numbers only) |
--verbose |
Detailed statistics |
--input-encoding <ENCODING> |
Input character encoding (default: auto-detect) |
--filter <RANGE> |
Filter numbers (e.g., --filter ">=100") |
--threshold <LEVEL> |
Alert threshold: low, medium, high, critical |
--min-count <NUMBER> |
Minimum number of data points required for analysis |
--help, -h |
Show help |
--version, -V |
Show version |
Supported File Formats
| Format | Extensions | Notes |
|---|---|---|
| Microsoft Excel | .xlsx, .xls | Spreadsheet data |
| Microsoft Word | .docx, .doc | Document analysis |
| Microsoft PowerPoint | .pptx, .ppt | Presentation data |
| OpenDocument | ods, .odt | OpenOffice/LibreOffice files |
| Text extraction | ||
| CSV/TSV | .csv, .tsv | Structured data |
| JSON/XML | .json, .xml | API responses |
| YAML/TOML | .yaml, .toml | Configuration files |
| HTML | .html | Web pages |
| Text | .txt | Plain text |
Real-World Usage Examples
Benf follows Unix philosophy and works excellently with standard Unix tools for processing multiple files:
Financial Audit Workflows
# Quarterly financial audit - check all Excel reports
| while ; do
done
# Monthly expense report validation
for; do
| \
| |
done
# Tax document verification (high-precision analysis)
| | \
Automated Monitoring & Alerts
# Daily monitoring script for accounting system exports
#!/bin/bash
ALERT_EMAIL="audit@company.com"
| while ; do
|
done |
# Continuous integration fraud detection
| \
# Real-time folder monitoring with inotify
| while ; do
if ; then
|| \
fi
done
Large-Scale Data Processing
# Process entire corporate filesystem for compliance audit
| \
| \
# Archive analysis - process historical data efficiently
| \
| \
# Network storage scanning with progress tracking
total_files=
| | while ; do
|
done |
Advanced Reporting & Analytics
# Risk distribution analysis across departments
for; do
| | \
| | |
done
# Time-series risk analysis (requires date-sorted files)
| | while ; do
month=
risk=
done
# Statistical summary generation
{
| while ; do
| \
done
}
# Comparative analysis between periods
q3_high=
q4_high=
Integration with Other Tools
# Git pre-commit hook for data validation
#!/bin/bash
# .git/hooks/pre-commit
changed_files=
for; do
if ! ; then
|
fi
done
# Database import validation
# Slack/Teams webhook integration
high_risk_files=
if [; then
fi
# Excel macro integration (save as macro-enabled workbook)
# VBA code to call benf from Excel:
# Shell "benf """ & ActiveWorkbook.FullName & """ --format json > benf-result.json"
Specialized Use Cases
# Election audit (checking vote counts)
| | \
# Scientific data validation
| while ; do
lab=
result=
done |
# Supply chain invoice verification
|
Web & API Analysis Integration
# Financial API monitoring - real-time fraud detection
#!/bin/bash
API_BASE="https://api.company.com"
ENDPOINTS=("daily-transactions" "expense-reports" "payroll-data")
for; do
| \
| \
done
# Stock market data validation
| \
| | \
# Government data integrity checking
| \
| \
# Cryptocurrency exchange analysis
for; do
| \
| | \
done |
# Banking API compliance monitoring
#!/bin/bash
BANKS=("bank1.api.com" "bank2.api.com" "bank3.api.com")
DATE=
{
for; do
for; do
result=
if [; then
|
fi
done
done
}
# E-commerce fraud detection pipeline
| \
| \
| \
if ; then
# Alert webhook
fi
# International financial data analysis
for; do
| \
| \
done |
Advanced GNU parallel Integration
GNU parallel provides powerful features for high-performance batch processing:
# High-performance parallel processing with load balancing
| \
| \
| |
# Dynamic load adjustment based on system resources
| \
| \
# Progress monitoring with ETA and throughput stats
| \
# Retry failed jobs automatically
| \
# Distribute work across multiple machines (SSH)
| \
# Memory-conscious processing for large datasets
| \
| \
# Smart workload distribution with different file types
{
|
|
|
} |
# Resource-aware batch processing with custom slots
| \
| \
# Complex pipeline with conditional processing
| \
# Benchmark and performance optimization
| | \
| \
&& \
&& \
# Advanced filtering and routing based on results
| \
| \
while ; do ; done
# Cross-platform compatibility testing
| \
| \
|
Key GNU parallel Features Leveraged:
--eta: Shows estimated completion time--progress: Real-time progress bar--load 80%: CPU load-aware scheduling--memfree 1G: Memory-aware processing--retries 3: Automatic retry for failed jobs--sshloginfile: Distribute across multiple servers--joblog: Detailed execution logging--bar: Visual progress indicator-j+0: Use all CPU cores optimally
Output
Default Output
Benford's Law Analysis Results
Dataset: financial_data.csv
Numbers analyzed: 1,247
Risk Level: HIGH β οΈ
First Digit Distribution:
1: ββββββββββββββββββββββββββββ 28.3% (expected: 30.1%)
2: ββββββββββββββββββββ 20.1% (expected: 17.6%)
3: ββββββββββ 9.8% (expected: 12.5%)
...
Statistical Tests:
Chi-square: 23.45 (p-value: 0.003)
Mean Absolute Deviation: 2.1%
Verdict: SIGNIFICANT DEVIATION DETECTED
JSON Output
Examples
Fraud Detection
# Monitor sales data
# Real-time log monitoring
| |
# Batch analysis
|
Japanese Numerals
# Full-width digits
|
# Kanji numerals
|
# Mixed patterns
Web Analysis with curl
# Financial website analysis
|
# API endpoint with authentication
|
# Handle redirects and cookies
|
# Proxy usage (curl handles all proxy scenarios)
|
# SSL/TLS options
| |
# Multiple endpoints processing
for; do
| |
done
# Rate limiting and retries
|
# POST requests with JSON data
|
Automation with curl Integration
# Daily fraud check from web API
#!/bin/bash
RESULT=
RISK=
if [; then
|
fi
# Multi-source monitoring script
#!/bin/bash
for; do
| \
| \
done
# Webhook integration
#!/bin/bash
WEBHOOK_URL="https://hooks.slack.com/services/YOUR/WEBHOOK/URL"
DATA=
if | ; then
fi
Risk Levels
| Level | Chi-square p-value | Interpretation |
|---|---|---|
| LOW | p > 0.1 | Normal distribution |
| MEDIUM | 0.05 < p β€ 0.1 | Slight deviation |
| HIGH | 0.01 < p β€ 0.05 | Significant deviation |
| CRITICAL | p β€ 0.01 | Strong evidence of manipulation |
Common Use Cases
- Accounting audits: Detect manipulated financial records
- Tax investigations: Identify suspicious income declarations
- Election monitoring: Verify vote count authenticity
- Insurance claims: Spot fraudulent claim patterns
- Scientific data: Validate research results
- Quality control: Monitor manufacturing data
β οΈ Important Limitations
Benford's Law does NOT apply to:
- Constrained ranges: Adult heights (1.5-2.0m), ages (0-100), temperatures
- Sequential data: Invoice numbers, employee IDs, zip codes
- Assigned numbers: Phone numbers, social security numbers, lottery numbers
- Small datasets: Less than 30-50 numbers (insufficient for statistical analysis)
- Single-source data: All from same process/machine with similar magnitudes
- Rounded data: Heavily rounded amounts (e.g., all ending in 00)
Best suited for:
- Multi-scale natural data: Financial transactions, populations, physical measurements
- Diverse sources: Mixed data from different processes/timeframes
- Large datasets: 100+ numbers for reliable analysis
- Unmanipulated data: Naturally occurring, not artificially constrained
Historical Background
Discovery and Development:
- 1881: Simon Newcomb first observed the phenomenon while studying logarithm tables
- 1938: Physicist Frank Benford rediscovered and formalized the law with extensive research
- 1972: First application to accounting and fraud detection in academic literature
- 1980s: Major accounting firms began using Benford's Law as a standard audit tool
- 1990s: Mark Nigrini popularized its use in forensic accounting and tax fraud detection
- 2000s+: Expanded to election monitoring, scientific data validation, and financial crime investigation
Modern Applications:
- Used by IRS for tax audit screening
- Standard tool in Big Four accounting firms
- Applied in election fraud detection (notably 2009 Iran election analysis)
- Employed in anti-money laundering investigations
Exit Codes
| Code | Meaning |
|---|---|
| 0 | Success |
| 1 | General error |
| 2 | Invalid arguments |
| 3 | File/network error |
| 10 | High risk detected |
| 11 | Critical risk detected |
Configuration
Benf is designed to work with standard Unix tools. For web data access, use curl with its rich configuration options:
# Proxy configuration (handled by curl)
# Use curl's proxy options directly
|
# SSL/TLS configuration
|
|
Benf itself requires no configuration files or environment variables.
Documentation
For comprehensive documentation, see:
- π English Documentation - Complete user guide and API reference
Contributing
See CONTRIBUTING.md for development guidelines.
License
MIT License - see LICENSE file.