benf
A CLI tool for detecting anomalies using Benford's Law with support for international numerals (Japanese, Chinese, Hindi, Arabic).
Overview
benf analyzes numerical data to check if it follows Benford's Law, which states that in many naturally occurring datasets, the digit 1 appears as the first (leading) digit about 30.1% of the time, 2 appears 17.6% of the time, and so on. Deviations from this law can indicate data manipulation or fraud.
Note: This tool analyzes only the first digit of each number, not the entire number sequence.
Unique Features:
- 🌍 International numeral support: English, Japanese (全角・漢数字), Chinese (中文数字), Hindi (हिन्दी अंक), Arabic (الأرقام العربية)
- 📊 Multiple input formats (Microsoft Excel, Word, PowerPoint, PDF, etc.)
- 🌐 Direct URL analysis with HTML parsing
- 🔍 Fraud detection focus with risk level indicators
International Numeral Support
Supported Number Formats
1. Full-width Digits
|
2. Kanji Numerals (Basic)
|
3. Kanji Numerals (Positional)
|
4. Mixed Patterns
|
Conversion Rules
| Kanji | Number | Notes |
|---|---|---|
| 一 | 1 | Basic digit |
| 十 | 10 | Tens place |
| 百 | 100 | Hundreds place |
| 千 | 1000 | Thousands place |
| 万 | 10000 | Ten thousands place |
| 一千二百三十四 | 1234 | Positional notation |
Decimal Numbers
# Only numbers ≥ 1 are analyzed
|
# Result: 1, (excluded), 1 (numbers < 1 are excluded)
Negative Numbers
# Uses absolute value's first digit
|
# Result: 1, 4, 7
Chinese Numeral Compatibility
Current implementation supports basic Chinese numerals that are identical to Japanese kanji:
Supported (Basic Forms)
- 一二三四五六七八九 (1-9) - Same as Japanese
- 十百千 (10, 100, 1000) - Positional markers
Planned Support
- Financial forms: 壹貳參肆伍陸柒捌玖 (anti-fraud variants)
- Traditional: 萬 (10,000) vs Japanese 万
- Regional variants: Traditional vs Simplified Chinese
Hindi Numerals (हिन्दी अंक)
# Devanagari numerals
|
Arabic Numerals (الأرقام العربية)
# Eastern Arabic-Indic numerals
|
Other Numeral Systems (Future Support)
Additional Scripts (Planned)
- Persian: ۰۱۲۳۴۵۶۷۸۹ (Iran, Afghanistan)
- Bengali: ০১২৩৪৫৬৭৮৯ (Bangladesh)
- Tamil: ௦௧௨௩௪௫௬௭௮௯ (Tamil Nadu)
- Thai: ๐๑๒๓๔๕๖๗๘๙ (Thailand)
- Myanmar: ၀၁၂၃၄၅၆၇၈၉ (Myanmar)
Note: International numeral support continues expanding based on user demand. Current priority: Japanese/Chinese/Hindi/Arabic financial document analysis.
Installation
From Source
Binary Releases
Download from releases page
Quick Start
# Analyze CSV file
# Analyze website data with curl
|
# Pipe data
|
# JSON output for automation
Usage
Basic Syntax
Input Methods
- File path:
benf financial_data.xlsx - Web data with curl:
curl -s https://api.example.com/data | benf - String:
benf "123 456 789 101112" - Pipe:
cat data.txt | benf
Priority: File > String > Pipe
Options
| Option | Description |
|---|---|
--format <FORMAT> |
Output format: text, csv, json, yaml, toml, xml |
--quiet |
Minimal output (numbers only) |
--verbose |
Detailed statistics |
--lang <LANGUAGE> |
Output language: en, ja, zh, hi, ar (default: auto) |
--filter <RANGE> |
Filter numbers (e.g., --filter ">=100") |
--threshold <LEVEL> |
Alert threshold: low, medium, high, critical |
--min-count <NUMBER> |
Minimum number of data points required for analysis |
--help, -h |
Show help |
--version, -V |
Show version |
Supported File Formats
| Format | Extensions | Notes |
|---|---|---|
| Microsoft Excel | .xlsx, .xls | Spreadsheet data |
| Microsoft Word | .docx, .doc | Document analysis |
| Microsoft PowerPoint | .pptx, .ppt | Presentation data |
| OpenDocument | ods, .odt | OpenOffice/LibreOffice files |
| Text extraction | ||
| CSV/TSV | .csv, .tsv | Structured data |
| JSON/XML | .json, .xml | API responses |
| YAML/TOML | .yaml, .toml | Configuration files |
| HTML | .html | Web pages |
| Text | .txt | Plain text |
Real-World Usage Examples
Benf follows Unix philosophy and works excellently with standard Unix tools for processing multiple files:
Financial Audit Workflows
# Quarterly financial audit - check all Excel reports
| while ; do
done
# Monthly expense report validation
for; do
| \
| |
done
# Tax document verification (high-precision analysis)
| | \
Automated Monitoring & Alerts
# Daily monitoring script for accounting system exports
#!/bin/bash
ALERT_EMAIL="audit@company.com"
| while ; do
|
done |
# Continuous integration fraud detection
| \
# Real-time folder monitoring with inotify
| while ; do
if ; then
|| \
fi
done
Large-Scale Data Processing
# Process entire corporate filesystem for compliance audit
| \
| \
# Archive analysis - process historical data efficiently
| \
| \
# Network storage scanning with progress tracking
total_files=
| | while ; do
|
done |
Advanced Reporting & Analytics
# Risk distribution analysis across departments
for; do
| | \
| | |
done
# Time-series risk analysis (requires date-sorted files)
| | while ; do
month=
risk=
done
# Statistical summary generation
{
| while ; do
| \
done
}
# Comparative analysis between periods
q3_high=
q4_high=
Integration with Other Tools
# Git pre-commit hook for data validation
#!/bin/bash
# .git/hooks/pre-commit
changed_files=
for; do
if ! ; then
|
fi
done
# Database import validation
# Slack/Teams webhook integration
high_risk_files=
if [; then
fi
# Excel macro integration (save as macro-enabled workbook)
# VBA code to call benf from Excel:
# Shell "benf """ & ActiveWorkbook.FullName & """ --format json > benf-result.json"
Specialized Use Cases
# Election audit (checking vote counts)
| | \
# Scientific data validation
| while ; do
lab=
result=
done |
# Supply chain invoice verification
|
Web & API Analysis Integration
# Financial API monitoring - real-time fraud detection
#!/bin/bash
API_BASE="https://api.company.com"
ENDPOINTS=("daily-transactions" "expense-reports" "payroll-data")
for; do
| \
| \
done
# Stock market data validation
| \
| | \
# Government data integrity checking
| \
| \
# Cryptocurrency exchange analysis
for; do
| \
| | \
done |
# Banking API compliance monitoring
#!/bin/bash
BANKS=("bank1.api.com" "bank2.api.com" "bank3.api.com")
DATE=
{
for; do
for; do
result=
if [; then
|
fi
done
done
}
# E-commerce fraud detection pipeline
| \
| \
| \
if ; then
# Alert webhook
fi
# International data source aggregation
for; do
| \
| \
done |
Advanced GNU parallel Integration
GNU parallel provides powerful features for high-performance batch processing:
# High-performance parallel processing with load balancing
| \
| \
| |
# Dynamic load adjustment based on system resources
| \
| \
# Progress monitoring with ETA and throughput stats
| \
# Retry failed jobs automatically
| \
# Distribute work across multiple machines (SSH)
| \
# Memory-conscious processing for large datasets
| \
| \
# Smart workload distribution with different file types
{
|
|
|
} |
# Resource-aware batch processing with custom slots
| \
| \
# Complex pipeline with conditional processing
| \
# Benchmark and performance optimization
| | \
| \
&& \
&& \
# Advanced filtering and routing based on results
| \
| \
while ; do ; done
# Cross-platform compatibility testing
| \
| \
|
Key GNU parallel Features Leveraged:
--eta: Shows estimated completion time--progress: Real-time progress bar--load 80%: CPU load-aware scheduling--memfree 1G: Memory-aware processing--retries 3: Automatic retry for failed jobs--sshloginfile: Distribute across multiple servers--joblog: Detailed execution logging--bar: Visual progress indicator-j+0: Use all CPU cores optimally
Output
Default Output
Benford's Law Analysis Results
Dataset: financial_data.csv
Numbers analyzed: 1,247
Risk Level: HIGH ⚠️
First Digit Distribution:
1: ████████████████████████████ 28.3% (expected: 30.1%)
2: ████████████████████ 20.1% (expected: 17.6%)
3: ██████████ 9.8% (expected: 12.5%)
...
Statistical Tests:
Chi-square: 23.45 (p-value: 0.003)
Mean Absolute Deviation: 2.1%
Verdict: SIGNIFICANT DEVIATION DETECTED
JSON Output
Examples
Fraud Detection
# Monitor sales data
# Real-time log monitoring
| |
# Batch analysis
|
Japanese Numerals
# Full-width digits
|
# Kanji numerals
|
# Mixed patterns
Web Analysis with curl
# Financial website analysis
|
# API endpoint with authentication
|
# Handle redirects and cookies
|
# Proxy usage (curl handles all proxy scenarios)
|
# SSL/TLS options
| |
# Multiple endpoints processing
for; do
| |
done
# Rate limiting and retries
|
# POST requests with JSON data
|
Automation with curl Integration
# Daily fraud check from web API
#!/bin/bash
RESULT=
RISK=
if [; then
|
fi
# Multi-source monitoring script
#!/bin/bash
for; do
| \
| \
done
# Webhook integration
#!/bin/bash
WEBHOOK_URL="https://hooks.slack.com/services/YOUR/WEBHOOK/URL"
DATA=
if | ; then
fi
Risk Levels
| Level | Chi-square p-value | Interpretation |
|---|---|---|
| LOW | p > 0.1 | Normal distribution |
| MEDIUM | 0.05 < p ≤ 0.1 | Slight deviation |
| HIGH | 0.01 < p ≤ 0.05 | Significant deviation |
| CRITICAL | p ≤ 0.01 | Strong evidence of manipulation |
Common Use Cases
- Accounting audits: Detect manipulated financial records
- Tax investigations: Identify suspicious income declarations
- Election monitoring: Verify vote count authenticity
- Insurance claims: Spot fraudulent claim patterns
- Scientific data: Validate research results
- Quality control: Monitor manufacturing data
⚠️ Important Limitations
Benford's Law does NOT apply to:
- Constrained ranges: Adult heights (1.5-2.0m), ages (0-100), temperatures
- Sequential data: Invoice numbers, employee IDs, zip codes
- Assigned numbers: Phone numbers, social security numbers, lottery numbers
- Small datasets: Less than 30-50 numbers (insufficient for statistical analysis)
- Single-source data: All from same process/machine with similar magnitudes
- Rounded data: Heavily rounded amounts (e.g., all ending in 00)
Best suited for:
- Multi-scale natural data: Financial transactions, populations, physical measurements
- Diverse sources: Mixed data from different processes/timeframes
- Large datasets: 100+ numbers for reliable analysis
- Unmanipulated data: Naturally occurring, not artificially constrained
Historical Background
Discovery and Development:
- 1881: Simon Newcomb first observed the phenomenon while studying logarithm tables
- 1938: Physicist Frank Benford rediscovered and formalized the law with extensive research
- 1972: First application to accounting and fraud detection in academic literature
- 1980s: Major accounting firms began using Benford's Law as a standard audit tool
- 1990s: Mark Nigrini popularized its use in forensic accounting and tax fraud detection
- 2000s+: Expanded to election monitoring, scientific data validation, and financial crime investigation
Modern Applications:
- Used by IRS for tax audit screening
- Standard tool in Big Four accounting firms
- Applied in election fraud detection (notably 2009 Iran election analysis)
- Employed in anti-money laundering investigations
Exit Codes
| Code | Meaning |
|---|---|
| 0 | Success |
| 1 | General error |
| 2 | Invalid arguments |
| 3 | File/network error |
| 10 | High risk detected |
| 11 | Critical risk detected |
Configuration
Benf is designed to work with standard Unix tools. For web data access, use curl with its rich configuration options:
# Proxy configuration (handled by curl)
# Use curl's proxy options directly
|
# SSL/TLS configuration
|
|
Benf itself requires no configuration files or environment variables.
Contributing
See CONTRIBUTING.md for development guidelines.
License
MIT License - see LICENSE file.