Influence
Privacy-first local LLM inference - Download models from HuggingFace and run them entirely on your machine.
Why Influence?
The Problem: Most LLM tools require cloud APIs, expensive subscriptions, or complex Python setups. Your data leaves your machine, you pay per token, and you're locked into someone else's infrastructure.
The Solution: Influence gives you:
- Complete privacy - All inference happens locally on your machine
- No API costs - Pay once (in compute) and use forever
- No vendor lock-in - Models are downloaded to your disk
- Simplicity - Single binary, no Python, no virtual environments
- GPU acceleration - Metal support for macOS (CUDA coming soon)
What Makes It Different?
| Feature | Influence | Cloud APIs (OpenAI, etc.) | Python Tools |
|---|---|---|---|
| Privacy | 100% local | Data sent to servers | Local but complex |
| Cost | Free (after download) | Pay per token | Free but complex setup |
| Setup | Single binary | API key required | Python, pip, venv |
| GPU Support | Metal (macOS) | Server-side | Hard to configure |
| Offline Use | Yes | No | Yes |
Quick Start
# Build from source
# Search for a model
# Download a model (~1GB for TinyLlama)
# Generate text locally (with Metal GPU on macOS)
Usage Examples
Example 1: Quick Question Answering
# Ask a factual question
Benefit: Get instant answers without:
- Opening a browser
- Waiting for cloud API responses
- Paying per token
- Sending your queries to third parties
Example 2: Code Generation
# Generate code with higher temperature for creativity
Benefit: Generate code locally with:
- No rate limits
- No API keys to manage
- Full context control
- Works offline
Example 3: Content Creation
# Generate blog post or documentation
Benefit: Create content without:
- Using cloud services
- Exposing your ideas to third parties
- Worrying about content policies
Current Status
Version 0.1.0 - Core Features Working
- OK Model search on HuggingFace
- OK Model downloading with progress tracking
- OK Local Llama-architecture inference (Llama, Mistral, Phi, Granite)
- OK Token spacing and formatting
- OK Metal GPU acceleration on macOS (enabled by default)
- OK Streaming text generation
- OK Temperature-based sampling
- OK KV caching for performance
- OK Architecture detection with helpful error messages
Tested Models:
TinyLlama/TinyLlama-1.1B-Chat-v1.0- Working perfectly- Other Llama-architecture models - Supported
Installation
Build from Source
# Clone the repository
# Build release binary with Metal support (macOS)
# The binary will be at target/release/influence
Features:
metal(default) - Metal GPU acceleration for macOSaccelerate- CPU acceleration for macOScuda- CUDA support for NVIDIA GPUs (placeholder)
Build without GPU:
Command Reference
search - Find Models on HuggingFace
Examples:
# Search for llama models
# Search with filters
# Search for small models
Options:
-l, --limit <N>- Max results (default: 20)-a, --author <ORG>- Filter by author
download - Download Model from HuggingFace
Examples:
# Download TinyLlama (recommended for testing)
# Download to custom location
# Use custom mirror
Options:
-m, --model <MODEL>- Model name (required)-r, --mirror <URL>- Mirror URL (default: hf-mirror.com)-o, --output <PATH>- Output directory (default: ./models/)
generate - Generate Text Locally
Examples:
# Basic generation
# With custom parameters
# Lower temperature for more focused output
Options:
-m, --model-path <PATH>- Path to model directory (required)--max-tokens <N>- Max tokens to generate (default: 512)--temperature <0.0-2.0>- Sampling temperature (default: 0.7)- Lower (0.1-0.3): More focused, deterministic
- Higher (0.7-1.0): More creative, diverse
Recommended Models
For Testing & Development
| Model | Size | Speed | Use Case |
|---|---|---|---|
TinyLlama/TinyLlama-1.1B-Chat-v1.0 |
~1GB | Fast | Testing, quick experiments |
microsoft/phi-2 |
~2GB | Medium | Quality vs speed balance |
mistralai/Mistral-7B-v0.1 |
~14GB | Slower | Production-quality output |
Why TinyLlama?
# Download and try TinyLlama first
Benefits:
- Fast downloads (~1GB)
- Quick inference (even on CPU)
- Good quality for many tasks
- Great for learning and experimentation
Benefits Over Alternatives
vs Cloud APIs (OpenAI, Anthropic, etc.)
You Save:
- Money - No per-token costs
- Privacy - Data never leaves your machine
- Latency - No network round-trips
- Reliability - Works offline
- Control - No rate limits or content policies
vs Python Tools (llama.cpp, transformers, etc.)
You Get:
- Simplicity - Single binary, no dependencies
- Performance - Rust speed with GPU acceleration
- Stability - No version conflicts or dependency hell
- Integration - Easy to script and automate
How It Works
┌─────────────┐
│ Your Prompt│
└──────┬──────┘
│
▼
┌──────────────────────────────────┐
│ Tokenization (HuggingFace) │
└──────┬───────────────────────────┘
│
▼
┌──────────────────────────────────┐
│ Model Loading (.safetensors) │
│ - Memory-mapped for efficiency │
│ - GPU acceleration (Metal/CUDA) │
└──────┬───────────────────────────┘
│
▼
┌──────────────────────────────────┐
│ Inference (Candle) │
│ - Forward pass with KV cache │
│ - Temperature-based sampling │
│ - Token-by-token generation │
└──────┬───────────────────────────┘
│
▼
┌─────────────┐
│ Output Text│
└─────────────┘
Technical Details
Model Requirements
Each model directory must contain:
config.json- Model architecture and parameterstokenizer.jsonortokenizer_config.json- Tokenizer*.safetensors- Model weights (memory-mapped)
Supported Architectures
- OK Llama (meta-llama/Llama-2-7b-hf, TinyLlama)
- OK Mistral (mistralai/Mistral-7B-v0.1)
- OK Phi (microsoft/phi-2)
- OK Granite (pure transformer variants)
- X Mamba/Hybrid models (specialized implementation required)
- X MoE models (not yet supported)
- X Encoder-only models (BERT, etc. - not for generation)
Performance
Optimizations:
- KV Caching - Reuse computed tensors for faster generation
- Memory Mapping - Zero-copy model loading
- Streaming Output - Display tokens as they're generated
- GPU Acceleration - Metal support on macOS (enabled by default)
- Proper Token Spacing - Handles SentencePiece space markers correctly
Memory Usage:
- TinyLlama (1B): ~2GB RAM
- Phi-2 (2.7B): ~4GB RAM
- Mistral-7B: ~14GB RAM
- Add model size for total memory requirement
Performance Tips:
- On macOS: Metal GPU is enabled by default for faster inference
- On Linux/Windows: CUDA support planned (use CPU for now)
- Use smaller models (TinyLlama) for faster responses
- Reduce
--max-tokensfor quicker generation
Troubleshooting
Model Not Found Error
# Error: Model directory not found
# Solution: Check the model path exists
Missing Tokenizer Error
# Error: Tokenizer file not found
# Solution: Ensure these files exist in model directory:
# - tokenizer.json (or tokenizer_config.json)
# - config.json
# - *.safetensors files
Unsupported Architecture Error
# Error: Unsupported model architecture (Mamba/MoE)
# Solution: Use a supported model like TinyLlama
Slow Generation on CPU
# CPU inference is slower. Options:
# 1. Use a smaller model (TinyLlama instead of Mistral-7B)
# 2. Reduce max-tokens
# 3. Build with Metal support (macOS):
Development
Build with Debug Logging
RUST_LOG=influence=debug
Run Tests
Roadmap
- CUDA support for NVIDIA GPUs
- Quantized model support (GGUF)
- Chat mode with conversation history
- Batch generation
- HTTP API server mode
- Top-k and nucleus sampling
License
MIT
Contributing
Contributions are welcome! Please feel free to submit a Pull Request.
Acknowledgments
Built with:
- Candle - ML framework by HuggingFace
- Tokenizers - Fast tokenization
- Clap - CLI parsing