influence-0.1.1 is not a library.

Influence

Privacy-first local LLM inference - Download models from HuggingFace and run them entirely on your machine.

Why Influence?

The Problem: Most LLM tools require cloud APIs, expensive subscriptions, or complex Python setups. Your data leaves your machine, you pay per token, and you're locked into someone else's infrastructure.

The Solution: Influence gives you:

Complete privacy - All inference happens locally on your machine
No API costs - Pay once (in compute) and use forever
No vendor lock-in - Models are downloaded to your disk
Simplicity - Single binary, no Python, no virtual environments
GPU acceleration - Metal support for macOS (CUDA coming soon)

What Makes It Different?

Feature	Influence	Cloud APIs (OpenAI, etc.)	Python Tools
Privacy	100% local	Data sent to servers	Local but complex
Cost	Free (after download)	Pay per token	Free but complex setup
Setup	Single binary	API key required	Python, pip, venv
GPU Support	Metal (macOS)	Server-side	Hard to configure
Offline Use	Yes	No	Yes

Quick Start

# Build from source
git clone https://github.com/yingkitw/influence.git
cd influence
cargo build --release

# Search for a model
./target/release/influence search "tinyllama" --limit 5

# Download a model (~1GB for TinyLlama)
./target/release/influence download -m TinyLlama/TinyLlama-1.1B-Chat-v1.0

# Generate text locally (with Metal GPU on macOS)
./target/release/influence generate "Explain quantum computing in simple terms" \
  --model-path ./models/TinyLlama_TinyLlama-1.1B-Chat-v1.0 \
  --device metal

Usage Examples

Example 1: Quick Question Answering

# Ask a factual question
influence generate "What are the main differences between Rust and C++?" \
  --model-path ./models/TinyLlama_TinyLlama-1.1B-Chat-v1.0 \
  --max-tokens 256

Benefit: Get instant answers without:

Opening a browser
Waiting for cloud API responses
Paying per token
Sending your queries to third parties

Example 2: Code Generation

# Generate code with higher temperature for creativity
influence generate "Write a Rust function to merge two sorted vectors" \
  --model-path ./models/TinyLlama_TinyLlama-1.1B-Chat-v1.0 \
  --temperature 0.8 \
  --max-tokens 512

Benefit: Generate code locally with:

No rate limits
No API keys to manage
Full context control
Works offline

Example 3: Content Creation

# Generate blog post or documentation
influence generate "Write a technical introduction to vector databases" \
  --model-path ./models/TinyLlama_TinyLlama-1.1B-Chat-v1.0 \
  --max-tokens 1024

Benefit: Create content without:

Using cloud services
Exposing your ideas to third parties
Worrying about content policies

Current Status

Version 0.1.0 - Core Features Working

OK Model search on HuggingFace
OK Model downloading with progress tracking
OK Local Llama-architecture inference (Llama, Mistral, Phi, Granite)
OK Token spacing and formatting
OK Metal GPU acceleration on macOS (enabled by default)
OK Streaming text generation
OK Temperature-based sampling
OK KV caching for performance
OK Architecture detection with helpful error messages

Tested Models:

TinyLlama/TinyLlama-1.1B-Chat-v1.0 - Working perfectly
Other Llama-architecture models - Supported

Installation

Build from Source

# Clone the repository
git clone https://github.com/yingkitw/influence.git
cd influence

# Build release binary with Metal support (macOS)
cargo build --release

# The binary will be at target/release/influence
./target/release/influence --help

Features:

metal (default) - Metal GPU acceleration for macOS
accelerate - CPU acceleration for macOS
cuda - CUDA support for NVIDIA GPUs (placeholder)

Build without GPU:

cargo build --release --no-default-features

Command Reference

`search` - Find Models on HuggingFace

influence search <query> [options]

Examples:

# Search for llama models
influence search "llama"

# Search with filters
influence search "text-generation" --limit 10 --author meta-llama

# Search for small models
influence search "1b" --limit 5

Options:

-l, --limit <N> - Max results (default: 20)
-a, --author <ORG> - Filter by author

`download` - Download Model from HuggingFace

influence download -m <model> [options]

Examples:

# Download TinyLlama (recommended for testing)
influence download -m TinyLlama/TinyLlama-1.1B-Chat-v1.0

# Download to custom location
influence download -m microsoft/phi-2 -o ~/models

# Use custom mirror
influence download -m ibm/granite-4-h-small -r https://hf-mirror.com

Options:

-m, --model <MODEL> - Model name (required)
-r, --mirror <URL> - Mirror URL (default: hf-mirror.com)
-o, --output <PATH> - Output directory (default: ./models/)

`generate` - Generate Text Locally

influence generate <prompt> [options]

Examples:

# Basic generation
influence generate "What is machine learning?" \
  --model-path ./models/TinyLlama_TinyLlama-1.1B-Chat-v1.0

# With custom parameters
influence generate "Explain async/await" \
  --model-path ./models/TinyLlama_TinyLlama-1.1B-Chat-v1.0 \
  --max-tokens 512 \
  --temperature 0.7

# Lower temperature for more focused output
influence generate "Summarize: Rust is a systems programming language" \
  --model-path ./models/TinyLlama_TinyLlama-1.1B-Chat-v1.0 \
  --temperature 0.3 \
  --max-tokens 100

Options:

-m, --model-path <PATH> - Path to model directory (required)
--max-tokens <N> - Max tokens to generate (default: 512)
--temperature <0.0-2.0> - Sampling temperature (default: 0.7)
- Lower (0.1-0.3): More focused, deterministic
- Higher (0.7-1.0): More creative, diverse

Recommended Models

For Testing & Development

Model	Size	Speed	Use Case
`TinyLlama/TinyLlama-1.1B-Chat-v1.0`	~1GB	Fast	Testing, quick experiments
`microsoft/phi-2`	~2GB	Medium	Quality vs speed balance
`mistralai/Mistral-7B-v0.1`	~14GB	Slower	Production-quality output

Why TinyLlama?

# Download and try TinyLlama first
influence download -m TinyLlama/TinyLlama-1.1B-Chat-v1.0
influence generate "Hello, world!" \
  --model-path ./models/TinyLlama_TinyLlama-1.1B-Chat-v1.0

Benefits:

Fast downloads (~1GB)
Quick inference (even on CPU)
Good quality for many tasks
Great for learning and experimentation

Benefits Over Alternatives

vs Cloud APIs (OpenAI, Anthropic, etc.)

You Save:

Money - No per-token costs
Privacy - Data never leaves your machine
Latency - No network round-trips
Reliability - Works offline
Control - No rate limits or content policies

vs Python Tools (llama.cpp, transformers, etc.)

You Get:

Simplicity - Single binary, no dependencies
Performance - Rust speed with GPU acceleration
Stability - No version conflicts or dependency hell
Integration - Easy to script and automate

How It Works

┌─────────────┐
│  Your Prompt│
└──────┬──────┘
       │
       ▼
┌──────────────────────────────────┐
│  Tokenization (HuggingFace)      │
└──────┬───────────────────────────┘
       │
       ▼
┌──────────────────────────────────┐
│  Model Loading (.safetensors)    │
│  - Memory-mapped for efficiency  │
│  - GPU acceleration (Metal/CUDA) │
└──────┬───────────────────────────┘
       │
       ▼
┌──────────────────────────────────┐
│  Inference (Candle)              │
│  - Forward pass with KV cache    │
│  - Temperature-based sampling    │
│  - Token-by-token generation     │
└──────┬───────────────────────────┘
       │
       ▼
┌─────────────┐
│  Output Text│
└─────────────┘

Technical Details

Model Requirements

Each model directory must contain:

config.json - Model architecture and parameters
tokenizer.json or tokenizer_config.json - Tokenizer
*.safetensors - Model weights (memory-mapped)

Supported Architectures

OK Llama (meta-llama/Llama-2-7b-hf, TinyLlama)
OK Mistral (mistralai/Mistral-7B-v0.1)
OK Phi (microsoft/phi-2)
OK Granite (pure transformer variants)
X Mamba/Hybrid models (specialized implementation required)
X MoE models (not yet supported)
X Encoder-only models (BERT, etc. - not for generation)

Performance

Optimizations:

KV Caching - Reuse computed tensors for faster generation
Memory Mapping - Zero-copy model loading
Streaming Output - Display tokens as they're generated
GPU Acceleration - Metal support on macOS (enabled by default)
Proper Token Spacing - Handles SentencePiece space markers correctly

Memory Usage:

TinyLlama (1B): ~2GB RAM
Phi-2 (2.7B): ~4GB RAM
Mistral-7B: ~14GB RAM
Add model size for total memory requirement

Performance Tips:

On macOS: Metal GPU is enabled by default for faster inference
On Linux/Windows: CUDA support planned (use CPU for now)
Use smaller models (TinyLlama) for faster responses
Reduce --max-tokens for quicker generation

Troubleshooting

Model Not Found Error

# Error: Model directory not found
# Solution: Check the model path exists
ls ./models/TinyLlama_TinyLlama-1.1B-Chat-v1.0

Missing Tokenizer Error

# Error: Tokenizer file not found
# Solution: Ensure these files exist in model directory:
# - tokenizer.json (or tokenizer_config.json)
# - config.json
# - *.safetensors files

Unsupported Architecture Error

# Error: Unsupported model architecture (Mamba/MoE)
# Solution: Use a supported model like TinyLlama
influence download -m TinyLlama/TinyLlama-1.1B-Chat-v1.0

Slow Generation on CPU

# CPU inference is slower. Options:
# 1. Use a smaller model (TinyLlama instead of Mistral-7B)
# 2. Reduce max-tokens
# 3. Build with Metal support (macOS):
cargo build --release --features metal

Development

Build with Debug Logging

RUST_LOG=influence=debug cargo run -- generate "Hello" \
  --model-path ./models/TinyLlama_TinyLlama-1.1B-Chat-v1.0

Run Tests

cargo test

Roadmap

CUDA support for NVIDIA GPUs
Quantized model support (GGUF)
Chat mode with conversation history
Batch generation
HTTP API server mode
Top-k and nucleus sampling

License

MIT

Contributing

Contributions are welcome! Please feel free to submit a Pull Request.

Acknowledgments

Built with:

Candle - ML framework by HuggingFace
Tokenizers - Fast tokenization
Clap - CLI parsing

influence 0.1.1

Influence

Why Influence?

What Makes It Different?

Quick Start

Usage Examples

Example 1: Quick Question Answering

Example 2: Code Generation

Example 3: Content Creation

Current Status

Installation

Build from Source

Command Reference

search - Find Models on HuggingFace

download - Download Model from HuggingFace

generate - Generate Text Locally

Recommended Models

For Testing & Development

Why TinyLlama?

Benefits Over Alternatives

vs Cloud APIs (OpenAI, Anthropic, etc.)

vs Python Tools (llama.cpp, transformers, etc.)

How It Works

Technical Details

Model Requirements

Supported Architectures

Performance

Troubleshooting

Model Not Found Error

Missing Tokenizer Error

Unsupported Architecture Error

Slow Generation on CPU

Development

Build with Debug Logging

Run Tests

Roadmap

License

Contributing

Acknowledgments

`search` - Find Models on HuggingFace

`download` - Download Model from HuggingFace

`generate` - Generate Text Locally