# lawkit
> **π Multi-law statistical analysis toolkit - Uncover hidden patterns and detect anomalies with confidence**
[](https://github.com/kako-jun/lawkit/actions/workflows/ci.yml)
[](https://crates.io/crates/lawkit)
[](https://crates.io/crates/lawkit-core)
[](https://www.npmjs.com/package/lawkit-js)
[](https://pypi.org/project/lawkit-python/)
[](https://github.com/kako-jun/lawkit/tree/main/docs/index.md)
[](https://docs.rs/lawkit-core)
[](LICENSE)
A next-generation statistical analysis toolkit that detects anomalies, patterns, and insights using multiple statistical laws. Perfect for fraud detection, data quality assessment, and business intelligence.
```bash
# Traditional tools analyze one pattern at a time
$ other-tool data.csv # Single statistical analysis
# lawkit provides comprehensive multi-law analysis
$ lawkit compare --laws all data.csv
π Benford's Law: β οΈ MEDIUM risk (chi-square: 15.2)
π Pareto Analysis: β
Normal distribution (Gini: 0.31)
π Zipf's Law: β HIGH risk (correlation: 0.45)
π Normal Distribution: β
Gaussian (p-value: 0.12)
π― Poisson Distribution: β οΈ MEDIUM risk (Ξ»=2.3)
π§ Recommendation: Focus on Zipf analysis - unusual frequency pattern detected
```
## β¨ Key Features
- **π― Multi-Law Analysis**: Benford, Pareto, Zipf, Normal, Poisson distributions
- **π International Input**: Parse numbers in English, Japanese, Chinese, Hindi, Arabic formats
- **π€ Smart Integration**: Compare multiple laws for comprehensive insights
- **β‘ High Performance**: Built in Rust with parallel processing
- **π Rich Output**: Text, JSON, CSV, YAML, TOML, XML formats
- **π Meta-Chaining**: Analyze trends in statistical patterns over time
- **π Advanced Outlier Detection**: LOF, Isolation Forest, DBSCAN, Ensemble methods
- **π Time Series Analysis**: Trend detection, seasonality, changepoint analysis
- **π Memory Efficient**: Streaming mode for large datasets with chunked processing
## π Performance Benchmarks
```bash
# Benchmark on 100K data points
Traditional single-law tools: ~2.1s
lawkit (single law): ~180ms (11.7x faster)
lawkit (multi-law compare): ~850ms (2.5x faster than sequential)
```
| 1K points | 8ms | 25ms | 2.1MB |
| 10K points | 45ms | 180ms | 8.4MB |
| 100K points | 180ms | 850ms | 32MB |
| 1M points | 2.1s | 9.2s | 128MB |
## ποΈ Architecture
### Multi-Law Analysis Pipeline
```mermaid
graph TB
subgraph "Input Processing"
A[Raw Data] --> B[International Number Parser]
B --> C[Format Detector]
C --> D[Data Validator]
end
subgraph "Statistical Analysis Engine"
D --> E[Benford's Law]
D --> F[Pareto Analysis]
D --> G[Zipf's Law]
D --> H[Normal Distribution]
D --> I[Poisson Distribution]
end
subgraph "Integration Layer"
E --> J[Risk Assessor]
F --> J
G --> J
H --> J
I --> J
J --> K[Contradiction Detector]
K --> L[Recommendation Engine]
end
subgraph "Output Generation"
L --> M[Multi-format Output]
M --> N[CLI Display]
M --> O[JSON/CSV Export]
M --> P[Integration APIs]
end
```
### Meta-Chaining: Advanced Pattern Tracking
One of lawkit's unique capabilities is **meta-chaining** - analyzing how statistical patterns evolve over time by comparing analysis results themselves.
```mermaid
graph LR
subgraph "Time Series Analysis"
A[data_jan.csv] --> D1[lawkit compare]
B[data_feb.csv] --> D1
D1 --> R1[analysis_jan_feb.json]
B --> D2[lawkit compare]
C[data_mar.csv] --> D2
D2 --> R2[analysis_feb_mar.json]
R1 --> D3[lawkit compare]
R2 --> D3
D3 --> M[Meta-Analysis Report]
end
subgraph "Insights Generated"
M --> |"Pattern Evolution"| T1[Trend Detection]
M --> |"Anomaly Tracking"| T2[Risk Escalation]
M --> |"Quality Drift"| T3[Process Monitoring]
M --> |"Fraud Progression"| T4[Security Alerts]
end
```
## π Quick Start
### Installation
```bash
# Install from crates.io
cargo install lawkit
# Or download pre-built binaries
wget https://github.com/kako-jun/lawkit/releases/latest/download/lawkit-linux-x86_64.tar.gz
tar -xzf lawkit-linux-x86_64.tar.gz
```
### Basic Usage
```bash
# Single law analysis
lawkit benf data.csv
lawkit pareto sales.csv
lawkit normal measurements.csv
# Multi-law comparison (recommended)
lawkit compare --laws benf,pareto data.csv
lawkit compare --laws all financial_data.csv
# Advanced analysis with filtering
lawkit compare --laws all --filter ">=1000" --format json data.csv
```
## π Supported Statistical Laws
### 1. Benford's Law
**Use Case**: Fraud detection in financial data
```bash
lawkit benf transactions.csv --threshold high
```
Detects unnatural digit distributions that may indicate data manipulation.
### 2. Pareto Analysis (80/20 Rule)
**Use Case**: Business prioritization and inequality measurement
```bash
lawkit pareto customer_revenue.csv --verbose
```
Identifies the vital few that drive the majority of results.
### 3. Zipf's Law
**Use Case**: Frequency analysis and text mining
```bash
lawkit zipf --text document.txt
lawkit zipf website_traffic.csv
```
Analyzes power-law distributions in rankings and frequencies.
### 4. Normal Distribution
**Use Case**: Quality control and outlier detection
```bash
lawkit normal --quality-control --spec-limits 9.5,10.5 production.csv
lawkit normal --outliers process_data.csv
```
Statistical process control and anomaly detection.
### 5. Poisson Distribution
**Use Case**: Event occurrence and rare event modeling
```bash
lawkit poisson --predict --rare-events incident_data.csv
```
Models and predicts discrete event occurrences.
## π² Data Generation & Testing
lawkit includes powerful data generation capabilities for education, testing, and demonstration purposes.
### Generate Sample Data
Create statistically accurate sample data following specific laws:
```bash
# Generate 1000 Benford's law samples
lawkit generate benf --samples 1000
# Generate Pareto distribution (80/20 rule)
lawkit generate pareto --samples 5000 --concentration 0.8
# Generate Zipf distribution with custom parameters
lawkit generate zipf --samples 2000 --exponent 1.0 --vocabulary-size 1000
# Generate normal distribution data
lawkit generate normal --samples 1000 --mean 100 --stddev 15
# Generate Poisson event data
lawkit generate poisson --samples 500 --lambda 2.5
```
### Generate-to-Analysis Pipeline
Combine generation with analysis for testing and validation:
```bash
# Test Benford's law detection
# Verify Pareto principle with generated data
# Validate statistical methods
# Test with fraud injection
### Self-Testing
Run comprehensive self-tests to verify lawkit functionality:
```bash
# Run all self-tests
lawkit selftest
# Test specific functionality
### Advanced Analysis Features
lawkit 2.1 introduces sophisticated analysis capabilities:
```bash
# Advanced outlier detection with ensemble methods
lawkit normal dataset.csv --outliers --outlier-method ensemble
# LOF (Local Outlier Factor) for complex patterns
lawkit normal financial_data.csv --outliers --outlier-method lof
# Time series analysis for trend detection
lawkit normal timeseries_data.csv --enable-timeseries --timeseries-window 20
# Parallel processing for large datasets
lawkit compare large_dataset.csv --parallel --threads 8
# Memory-efficient streaming for huge files
lawkit benf massive_file.csv --streaming --chunk-size 50000
```
### Educational Use Cases
Perfect for teaching statistical concepts:
```bash
# Demonstrate central limit theorem
for i in {1..5}; do
lawkit generate normal --samples 1000 --mean 100 --stddev 15 |
lawkit normal --verbose
done
# Show Pareto principle in action
lawkit generate pareto --samples 10000 --concentration 0.8 |
# Compare different distributions
lawkit generate benf --samples 1000 > benf_data.txt
lawkit generate normal --samples 1000 > normal_data.txt
lawkit compare --laws benf,normal benf_data.txt
lawkit compare --laws benf,normal normal_data.txt
```
## International Numeral Support
### Supported Number Formats
#### 1. Full-width Digits
```bash
#### 2. Kanji Numerals (Basic)
```bash
#### 3. Kanji Numerals (Positional)
```bash
#### 4. Mixed Patterns
```bash
### Conversion Rules
| δΈ | 1 | Basic digit |
| ε | 10 | Tens place |
| ηΎ | 100 | Hundreds place |
| ε | 1000 | Thousands place |
| δΈ | 10000 | Ten thousands place |
| δΈεδΊηΎδΈεε | 1234 | Positional notation |
#### Decimal Numbers
```bash
# Only numbers β₯ 1 are analyzed
```
#### Negative Numbers
```bash
# Uses absolute value's first digit
```
### Chinese Numeral Compatibility
Current implementation supports basic Chinese numerals that are identical to Japanese kanji:
#### Supported (Basic Forms)
- δΈδΊδΈεδΊε
δΈε
«δΉ (1-9) - Same as Japanese
- εηΎε (10, 100, 1000) - Positional markers
#### Planned Support
- **Financial forms**: ε£Ήθ²³εθδΌιΈζζη (anti-fraud variants)
- **Traditional**: θ¬ (10,000) vs Japanese δΈ
- **Regional variants**: Traditional vs Simplified Chinese
### Other Numeral Systems
lawkit focuses on the most widely used numeral systems in international business and financial analysis:
- **Core Support**: English (ASCII), Japanese (full-width, kanji), Chinese (simplified/traditional)
- **Additional Input**: Hindi (Devanagari), Arabic (Eastern Arabic-Indic numerals)
- **Business Focus**: International accounting standards, cross-border financial analysis
- **Output Language**: English (for maximum compatibility and searchability)
> **Note**: Input supports all major numeral systems globally. Documentation focuses on primary business markets (English/Japanese/Chinese). Output standardized to English for international compatibility.
## Installation
### From Source
```bash
git clone https://github.com/kako-jun/benf
cd benf
cargo build --release
cp target/release/benf /usr/local/bin/
```
### Binary Releases
Download from [releases page](https://github.com/kako-jun/benf/releases)
## Quick Start
```bash
# Analyze CSV file
benf data.csv
# Analyze website data with curl
# Pipe data
# JSON output for automation
benf data.csv --format json
```
## Usage
### Basic Syntax
```bash
benf [OPTIONS] [INPUT]
```
### Input Methods
1. **File path**: `benf financial_data.xlsx`
2. **Web data with curl**: `curl -s https://api.example.com/data | benf`
3. **String**: `benf "123 456 789 101112"`
4. **Pipe**: `cat data.txt | benf`
Priority: File > String > Pipe
### Options
| `--format <FORMAT>` | Output format: text, csv, json, yaml, toml, xml |
| `--quiet` | Minimal output (numbers only) |
| `--verbose` | Detailed statistics |
| `--input-encoding <ENCODING>` | Input character encoding (default: auto-detect) |
| `--filter <RANGE>` | Filter numbers (e.g., `--filter ">=100"`) |
| `--threshold <LEVEL>` | Alert threshold: low, medium, high, critical |
| `--min-count <NUMBER>` | Minimum number of data points required for analysis |
| `--help, -h` | Show help |
| `--version, -V` | Show version |
### Supported File Formats
| Microsoft Excel | .xlsx, .xls | Spreadsheet data |
| Microsoft Word | .docx, .doc | Document analysis |
| Microsoft PowerPoint | .pptx, .ppt | Presentation data |
| OpenDocument | ods, .odt | OpenOffice/LibreOffice files |
| PDF | .pdf | Text extraction |
| CSV/TSV | .csv, .tsv | Structured data |
| JSON/XML | .json, .xml | API responses |
| YAML/TOML | .yaml, .toml | Configuration files |
| HTML | .html | Web pages |
| Text | .txt | Plain text |
## Real-World Usage Examples
Benf follows Unix philosophy and works excellently with standard Unix tools for processing multiple files:
### Financial Audit Workflows
```bash
# Quarterly financial audit - check all Excel reports
find ./Q4-2024 -name "*.xlsx" | while read file; do
echo "Auditing: $file"
benf "$file" --filter ">=1000" --threshold critical --verbose
echo "---"
done
# Monthly expense report validation
for dept in accounting marketing sales; do
echo "Department: $dept"
find "./expenses/$dept" -name "*.xlsx" -exec benf {} --format json \; | \
jq '.risk_level' | sort | uniq -c
done
# Tax document verification (high-precision analysis)
find ./tax-filings -name "*.pdf" | parallel benf {} --min-count 50 --format csv | \
awk -F, '$3=="Critical" {print "π¨ CRITICAL:", $1}'
```
### Automated Monitoring & Alerts
```bash
# Daily monitoring script for accounting system exports
#!/bin/bash
ALERT_EMAIL="audit@company.com"
find /exports/daily -name "*.csv" -newer /var/log/last-benf-check | while read file; do
benf "$file" --format json | jq -r 'select(.risk_level=="Critical" or .risk_level=="High") | "\(.dataset): \(.risk_level)"'
# Continuous integration fraud detection
find ./uploaded-reports -name "*.xlsx" -mtime -1 | \
# Real-time folder monitoring with inotify
echo "$(date): Analyzing $file" >> /var/log/benf-monitor.log
benf "./financial-uploads/$file" --threshold high || \
echo "$(date): ALERT - Suspicious file: $file" >> /var/log/fraud-alerts.log
fi
done
```
### Large-Scale Data Processing
```bash
# Process entire corporate filesystem for compliance audit
find /corporate-data -type f \( -name "*.xlsx" -o -name "*.csv" -o -name "*.pdf" \) | \
# Archive analysis - process historical data efficiently
find ./archives/2020-2024 -name "*.xlsx" -print0 | \
# Network storage scanning with progress tracking
total_files=$(find /nfs/financial-data -name "*.xlsx" | wc -l)
find /nfs/financial-data -name "*.xlsx" | nl | while read num file; do
echo "[$num/$total_files] Processing: $(basename "$file")"
benf "$file" --format json | jq -r '"File: \(.dataset), Risk: \(.risk_level), Numbers: \(.numbers_analyzed)"'
### Advanced Reporting & Analytics
```bash
# Risk distribution analysis across departments
for dept in */; do
echo "=== Department: $dept ==="
find "$dept" -name "*.xlsx" | xargs -I {} benf {} --format json 2>/dev/null | \
jq -r '.risk_level' | sort | uniq -c | awk '{print $2": "$1" files"}'
echo
done
# Time-series risk analysis (requires date-sorted files)
find ./monthly-reports -name "202[0-4]-*.xlsx" | sort | while read file; do
month=$(basename "$file" .xlsx)
risk=$(benf "$file" --format json 2>/dev/null | jq -r '.risk_level // "N/A"')
echo "$month,$risk"
done > risk-timeline.csv
# Statistical summary generation
{
echo "file,risk_level,numbers_count,chi_square,p_value"
find ./audit-sample -name "*.xlsx" | while read file; do
benf "$file" --format json 2>/dev/null | \
jq -r '"\(.dataset),\(.risk_level),\(.numbers_analyzed),\(.statistics.chi_square),\(.statistics.p_value)"'
done
} > statistical-analysis.csv
# Comparative analysis between periods
echo "Comparing Q3 vs Q4 risk levels..."
q3_high=$(find ./Q3-2024 -name "*.xlsx" | xargs -I {} benf {} --format json 2>/dev/null | jq -r 'select(.risk_level=="High" or .risk_level=="Critical")' | wc -l)
q4_high=$(find ./Q4-2024 -name "*.xlsx" | xargs -I {} benf {} --format json 2>/dev/null | jq -r 'select(.risk_level=="High" or .risk_level=="Critical")' | wc -l)
echo "Q3 high-risk files: $q3_high"
echo "Q4 high-risk files: $q4_high"
echo "Change: $((q4_high - q3_high))"
```
### Integration with Other Tools
```bash
# Git pre-commit hook for data validation
#!/bin/bash
# .git/hooks/pre-commit
changed_files=$(git diff --cached --name-only --diff-filter=A | grep -E '\.(xlsx|csv|pdf)$')
for file in $changed_files; do
if ! benf "$file" --min-count 10 >/dev/null 2>&1; then
echo "β οΈ Warning: $file may contain suspicious data patterns"
benf "$file" --format json | jq '.risk_level'
fi
done
# Database import validation
psql -c "COPY suspicious_files FROM STDIN CSV HEADER" <<< $(
echo "filename,risk_level,chi_square,p_value"
find ./import-data -name "*.csv" | while read file; do
benf "$file" --format json 2>/dev/null | \
jq -r '"\(.dataset),\(.risk_level),\(.statistics.chi_square),\(.statistics.p_value)"'
done
)
# Slack/Teams webhook integration
high_risk_files=$(find ./daily-uploads -name "*.xlsx" -mtime -1 | \
xargs -I {} benf {} --format json 2>/dev/null | \
jq -r 'select(.risk_level=="High" or .risk_level=="Critical") | .dataset')
if [ -n "$high_risk_files" ]; then
curl -X POST -H 'Content-type: application/json' \
--data "{\"text\":\"π¨ High-risk files detected:\n$high_risk_files\"}" \
$SLACK_WEBHOOK_URL
fi
# Excel macro integration (save as macro-enabled workbook)
# VBA code to call benf from Excel:
# Shell "benf """ & ActiveWorkbook.FullName & """ --format json > benf-result.json"
```
### Specialized Use Cases
```bash
# Election audit (checking vote counts)
find ./election-data -name "*.csv" | parallel benf {} --min-count 100 --threshold low | \
# Scientific data validation
find ./research-data -name "*.xlsx" | while read file; do
lab=$(dirname "$file" | xargs basename)
result=$(benf "$file" --format json | jq -r '.risk_level')
echo "$lab,$file,$result"
# Supply chain invoice verification
find ./invoices/2024 -name "*.pdf" | parallel 'vendor=$(dirname {} | xargs basename); benf {} --format json | jq --arg v "$vendor" '"'"'{vendor: $v, file: .dataset, risk: .risk_level}'"'"' > invoice-analysis.jsonl
# Insurance claim analysis
find ./claims -name "*.xlsx" | while read file; do
claim_id=$(basename "$file" .xlsx)
benf "$file" --filter ">=1000" --format json | \
jq --arg id "$claim_id" '{claim_id: $id, risk_assessment: .risk_level, total_numbers: .numbers_analyzed}'
### Web & API Analysis Integration
```bash
# Financial API monitoring - real-time fraud detection
#!/bin/bash
API_BASE="https://api.company.com"
ENDPOINTS=("daily-transactions" "expense-reports" "payroll-data")
for endpoint in "${ENDPOINTS[@]}"; do
echo "Analyzing $endpoint..."
curl -s -H "Authorization: Bearer $API_TOKEN" "$API_BASE/$endpoint" | \
benf --format json --min-count 10 | \
jq -r 'select(.risk_level=="High" or .risk_level=="Critical") |
"π¨ \(.dataset): \(.risk_level) risk (\(.numbers_analyzed) numbers)"'
done
# Stock market data validation
jq 'if .risk_level == "Critical" then "β οΈ Unusual price pattern detected" else "β
Normal price distribution" end'
# Government data integrity checking
awk -F, '$3=="Critical" {print "Department:", $1, "Risk:", $3}'
# Cryptocurrency exchange analysis
for exchange in binance coinbase kraken; do
echo "Checking $exchange volume data..."
curl -s "https://api.$exchange.com/v1/ticker/24hr" | \
jq -r '.[].volume' | benf --min-count 20 --format json | \
jq --arg ex "$exchange" '{exchange: $ex, risk: .risk_level, numbers: .numbers_analyzed}'
# Banking API compliance monitoring
#!/bin/bash
BANKS=("bank1.api.com" "bank2.api.com" "bank3.api.com")
DATE=$(date +%Y-%m-%d)
{
echo "bank,endpoint,risk_level,chi_square,p_value,timestamp"
for bank in "${BANKS[@]}"; do
for endpoint in deposits withdrawals transfers; do
result=$(curl -s "https://$bank/$endpoint?date=$DATE" | benf --format json 2>/dev/null)
if [ $? -eq 0 ]; then
echo "$result" | jq -r --arg bank "$bank" --arg ep "$endpoint" \
'"\($bank),\($ep),\(.risk_level),\(.statistics.chi_square),\(.statistics.p_value),\(now)"'
fi
done
done
} > banking-compliance-$(date +%Y%m%d).csv
# E-commerce fraud detection pipeline
benf --threshold critical --format json | \
if jq -e '.risk_level == "Critical"' >/dev/null; then
# Alert webhook
curl -X POST "https://alerts.company.com/webhook" \
-H "Content-Type: application/json" \
-d '{"alert": "Critical fraud pattern in daily orders", "timestamp": "'$(date -Iseconds)'"}'
fi
# International financial data analysis
declare -A REGIONS=([us]="api.us.finance.com" [eu]="api.eu.finance.com" [asia]="api.asia.finance.com")
for region in "${!REGIONS[@]}"; do
echo "Processing $region region..."
curl -s "https://${REGIONS[$region]}/financial-data" | \
benf --format json | \
jq --arg region "$region" '{region: $region, risk: .risk_level, timestamp: now}'
### Advanced GNU parallel Integration
GNU parallel provides powerful features for high-performance batch processing:
```bash
# High-performance parallel processing with load balancing
find /massive-dataset -name "*.xlsx" | \
# Dynamic load adjustment based on system resources
find ./financial-data -name "*.xlsx" | \
# Progress monitoring with ETA and throughput stats
find /audit-files -name "*.csv" | \
parallel --progress --eta --joblog parallel-audit.log \
# Retry failed jobs automatically
find ./suspicious-files -name "*.xlsx" | \
parallel --retries 3 --joblog failed-jobs.log \
# Distribute work across multiple machines (SSH)
find /shared-storage -name "*.xlsx" | \
parallel --sshloginfile servers.txt --transfer --return audit-{/}.json \
'benf {} --format json > audit-{/}.json'
# Memory-conscious processing for large datasets
find /enterprise-data -name "*.xlsx" | \
parallel --memfree 1G --delay 0.1 \
# Smart workload distribution with different file types
{
find ./reports -name "*.xlsx" | sed 's/$/ xlsx/'
find ./reports -name "*.pdf" | sed 's/$/ pdf/'
find ./reports -name "*.csv" | sed 's/$/ csv/'
# Resource-aware batch processing with custom slots
find ./quarterly-data -name "*.xlsx" | \
parallel --jobs 50% --max-replace-args 1 --max-chars 1000 \
# Complex pipeline with conditional processing
find ./invoices -name "*.pdf" | \
benf {} --verbose | mail -s "Critical Invoice Alert: $(basename {})" auditor@company.com
fi'
# Benchmark and performance optimization
find ./test-data -name "*.xlsx" | head -100 | \
parallel --dry-run --joblog performance-test.log \
'benf {} --format json' && \
echo "Performance analysis:" && \
awk '{sum+=$4; count++} END {print "Average time:", sum/count, "seconds"}' performance-actual.log
# Advanced filtering and routing based on results
find ./mixed-data -name "*.xlsx" | \
parallel --pipe --block 10M --cat \
elif .risk_level == \"High\" then \"high/\" + .dataset
else \"normal/\" + .dataset
end"' | \
while read dest; do mkdir -p "$(dirname "$dest")"; done
# Cross-platform compatibility testing
find ./samples -name "*.xlsx" | \
parallel --env PATH --sshlogin :,windows-server,mac-server \
```
**Key GNU parallel Features Leveraged:**
- **`--eta`**: Shows estimated completion time
- **`--progress`**: Real-time progress bar
- **`--load 80%`**: CPU load-aware scheduling
- **`--memfree 1G`**: Memory-aware processing
- **`--retries 3`**: Automatic retry for failed jobs
- **`--sshloginfile`**: Distribute across multiple servers
- **`--joblog`**: Detailed execution logging
- **`--bar`**: Visual progress indicator
- **`-j+0`**: Use all CPU cores optimally
## Output
### Default Output
```
Benford's Law Analysis Results
Dataset: financial_data.csv
Numbers analyzed: 1,247
Risk Level: HIGH β οΈ
First Digit Distribution:
1: ββββββββββββββββββββββββββββ 28.3% (expected: 30.1%)
2: ββββββββββββββββββββ 20.1% (expected: 17.6%)
3: ββββββββββ 9.8% (expected: 12.5%)
...
Statistical Tests:
Chi-square: 23.45 (p-value: 0.003)
Mean Absolute Deviation: 2.1%
Verdict: SIGNIFICANT DEVIATION DETECTED
```
### JSON Output
```json
{
"dataset": "financial_data.csv",
"numbers_analyzed": 1247,
"risk_level": "HIGH",
"digits": {
"1": {"observed": 28.3, "expected": 30.1, "deviation": -1.8},
"2": {"observed": 20.1, "expected": 17.6, "deviation": 2.5}
},
"statistics": {
"chi_square": 23.45,
"p_value": 0.003,
"mad": 2.1
},
"verdict": "SIGNIFICANT_DEVIATION"
}
```
## Examples
### Fraud Detection
```bash
# Monitor sales data
benf sales_report.xlsx --threshold high
# Real-time log monitoring
# Batch analysis
find . -name "*.csv" -exec benf {} \; | grep "HIGH"
```
### Japanese Numerals
```bash
# Full-width digits
# Kanji numerals
# Mixed patterns
benf japanese_financial_report.pdf
```
### Web Analysis with curl
```bash
# Financial website analysis
# API endpoint with authentication
# Handle redirects and cookies
# Proxy usage (curl handles all proxy scenarios)
# SSL/TLS options
# Multiple endpoints processing
for url in api1.com/data api2.com/data api3.com/data; do
echo "Processing: $url"
curl -s "https://$url" | benf --format json | jq '.risk_level'
done
# Rate limiting and retries
# POST requests with JSON data
curl -s -X POST -H "Content-Type: application/json" \
-d '{"query":"financial_data","format":"numbers"}' \
https://api.company.com/search | benf
```
### Automation with curl Integration
```bash
# Daily fraud check from web API
#!/bin/bash
if [ "$RISK" = "HIGH" ]; then
echo $RESULT | mail -s "Fraud Alert" admin@company.com
fi
# Multi-source monitoring script
#!/bin/bash
for endpoint in sales expenses payroll; do
echo "Checking $endpoint..."
curl -s "https://api.company.com/$endpoint" | \
benf --format json | \
jq -r 'select(.risk_level=="High" or .risk_level=="Critical") |
"ALERT: \(.dataset) shows \(.risk_level) risk"'
done
# Webhook integration
#!/bin/bash
WEBHOOK_URL="https://hooks.slack.com/services/YOUR/WEBHOOK/URL"
curl -X POST -H 'Content-type: application/json' \
--data "{\"text\":\"π¨ Critical fraud pattern detected in latest report\"}" \
"$WEBHOOK_URL"
fi
```
## Risk Levels
| LOW | p > 0.1 | Normal distribution |
| MEDIUM | 0.05 < p β€ 0.1 | Slight deviation |
| HIGH | 0.01 < p β€ 0.05 | Significant deviation |
| CRITICAL | p β€ 0.01 | Strong evidence of manipulation |
## Common Use Cases
- **Accounting audits**: Detect manipulated financial records
- **Tax investigations**: Identify suspicious income declarations
- **Election monitoring**: Verify vote count authenticity
- **Insurance claims**: Spot fraudulent claim patterns
- **Scientific data**: Validate research results
- **Quality control**: Monitor manufacturing data
## β οΈ Important Limitations
**Benford's Law does NOT apply to:**
- **Constrained ranges**: Adult heights (1.5-2.0m), ages (0-100), temperatures
- **Sequential data**: Invoice numbers, employee IDs, zip codes
- **Assigned numbers**: Phone numbers, social security numbers, lottery numbers
- **Small datasets**: Less than 30-50 numbers (insufficient for statistical analysis)
- **Single-source data**: All from same process/machine with similar magnitudes
- **Rounded data**: Heavily rounded amounts (e.g., all ending in 00)
**Best suited for:**
- **Multi-scale natural data**: Financial transactions, populations, physical measurements
- **Diverse sources**: Mixed data from different processes/timeframes
- **Large datasets**: 100+ numbers for reliable analysis
- **Unmanipulated data**: Naturally occurring, not artificially constrained
## Historical Background
**Discovery and Development:**
- **1881**: Simon Newcomb first observed the phenomenon while studying logarithm tables
- **1938**: Physicist Frank Benford rediscovered and formalized the law with extensive research
- **1972**: First application to accounting and fraud detection in academic literature
- **1980s**: Major accounting firms began using Benford's Law as a standard audit tool
- **1990s**: Mark Nigrini popularized its use in forensic accounting and tax fraud detection
- **2000s+**: Expanded to election monitoring, scientific data validation, and financial crime investigation
**Modern Applications:**
- Used by IRS for tax audit screening
- Standard tool in Big Four accounting firms
- Applied in election fraud detection (notably 2009 Iran election analysis)
- Employed in anti-money laundering investigations
## Exit Codes
| 0 | Success |
| 1 | General error |
| 2 | Invalid arguments |
| 3 | File/network error |
| 10 | High risk detected |
| 11 | Critical risk detected |
## Configuration
Benf is designed to work with standard Unix tools. For web data access, use curl with its rich configuration options:
```bash
# Proxy configuration (handled by curl)
export HTTP_PROXY=http://proxy.company.com:8080
export HTTPS_PROXY=https://proxy.company.com:8080
export NO_PROXY=localhost,*.internal.com
# Use curl's proxy options directly
# SSL/TLS configuration
```
Benf itself requires no configuration files or environment variables.
## Documentation
For comprehensive documentation, see:
- π **[English Documentation](docs/index.md)** - Complete user guide and API reference
## Contributing
See [CONTRIBUTING.md](CONTRIBUTING.md) for development guidelines.
## License
MIT License - see [LICENSE](LICENSE) file.
## References
- [Benford's Law - Wikipedia](https://en.wikipedia.org/wiki/Benford%27s_law)
- [Using Benford's Law for Fraud Detection](https://example.com/benford-fraud)