influence 0.1.0

A Rust CLI tool for downloading HuggingFace models and running local LLM inference
influence-0.1.0 is not a library.

Influence

Privacy-first local LLM inference - Download models from HuggingFace and run them entirely on your machine.

Why Influence?

The Problem: Most LLM tools require cloud APIs, expensive subscriptions, or complex Python setups. Your data leaves your machine, you pay per token, and you're locked into someone else's infrastructure.

The Solution: Influence gives you:

  • Complete privacy - All inference happens locally on your machine
  • No API costs - Pay once (in compute) and use forever
  • No vendor lock-in - Models are downloaded to your disk
  • Simplicity - Single binary, no Python, no virtual environments
  • GPU acceleration - Metal support for macOS (CUDA coming soon)

What Makes It Different?

Feature Influence Cloud APIs (OpenAI, etc.) Python Tools
Privacy 100% local Data sent to servers Local but complex
Cost Free (after download) Pay per token Free but complex setup
Setup Single binary API key required Python, pip, venv
GPU Support Metal (macOS) Server-side Hard to configure
Offline Use Yes No Yes

Quick Start

# Build from source
git clone https://github.com/yingkitw/influence.git
cd influence
cargo build --release

# Search for a model
./target/release/influence search "tinyllama" --limit 5

# Download a model (~1GB for TinyLlama)
./target/release/influence download -m TinyLlama/TinyLlama-1.1B-Chat-v1.0

# Generate text locally (with Metal GPU on macOS)
./target/release/influence generate "Explain quantum computing in simple terms" \
  --model-path ./models/TinyLlama_TinyLlama-1.1B-Chat-v1.0

Usage Examples

Example 1: Quick Question Answering

# Ask a factual question
influence generate "What are the main differences between Rust and C++?" \
  --model-path ./models/TinyLlama_TinyLlama-1.1B-Chat-v1.0 \
  --max-tokens 256

Benefit: Get instant answers without:

  • Opening a browser
  • Waiting for cloud API responses
  • Paying per token
  • Sending your queries to third parties

Example 2: Code Generation

# Generate code with higher temperature for creativity
influence generate "Write a Rust function to merge two sorted vectors" \
  --model-path ./models/TinyLlama_TinyLlama-1.1B-Chat-v1.0 \
  --temperature 0.8 \
  --max-tokens 512

Benefit: Generate code locally with:

  • No rate limits
  • No API keys to manage
  • Full context control
  • Works offline

Example 3: Content Creation

# Generate blog post or documentation
influence generate "Write a technical introduction to vector databases" \
  --model-path ./models/TinyLlama_TinyLlama-1.1B-Chat-v1.0 \
  --max-tokens 1024

Benefit: Create content without:

  • Using cloud services
  • Exposing your ideas to third parties
  • Worrying about content policies

Current Status

Version 0.1.0 - Core Features Working

  • OK Model search on HuggingFace
  • OK Model downloading with progress tracking
  • OK Local Llama-architecture inference (Llama, Mistral, Phi, Granite)
  • OK Token spacing and formatting
  • OK Metal GPU acceleration on macOS (enabled by default)
  • OK Streaming text generation
  • OK Temperature-based sampling
  • OK KV caching for performance
  • OK Architecture detection with helpful error messages

Tested Models:

  • TinyLlama/TinyLlama-1.1B-Chat-v1.0 - Working perfectly
  • Other Llama-architecture models - Supported

Installation

Build from Source

# Clone the repository
git clone https://github.com/yingkitw/influence.git
cd influence

# Build release binary with Metal support (macOS)
cargo build --release

# The binary will be at target/release/influence
./target/release/influence --help

Features:

  • metal (default) - Metal GPU acceleration for macOS
  • accelerate - CPU acceleration for macOS
  • cuda - CUDA support for NVIDIA GPUs (placeholder)

Build without GPU:

cargo build --release --no-default-features

Command Reference

search - Find Models on HuggingFace

influence search <query> [options]

Examples:

# Search for llama models
influence search "llama"

# Search with filters
influence search "text-generation" --limit 10 --author meta-llama

# Search for small models
influence search "1b" --limit 5

Options:

  • -l, --limit <N> - Max results (default: 20)
  • -a, --author <ORG> - Filter by author

download - Download Model from HuggingFace

influence download -m <model> [options]

Examples:

# Download TinyLlama (recommended for testing)
influence download -m TinyLlama/TinyLlama-1.1B-Chat-v1.0

# Download to custom location
influence download -m microsoft/phi-2 -o ~/models

# Use custom mirror
influence download -m ibm/granite-4-h-small -r https://hf-mirror.com

Options:

  • -m, --model <MODEL> - Model name (required)
  • -r, --mirror <URL> - Mirror URL (default: hf-mirror.com)
  • -o, --output <PATH> - Output directory (default: ./models/)

generate - Generate Text Locally

influence generate <prompt> [options]

Examples:

# Basic generation
influence generate "What is machine learning?" \
  --model-path ./models/TinyLlama_TinyLlama-1.1B-Chat-v1.0

# With custom parameters
influence generate "Explain async/await" \
  --model-path ./models/TinyLlama_TinyLlama-1.1B-Chat-v1.0 \
  --max-tokens 512 \
  --temperature 0.7

# Lower temperature for more focused output
influence generate "Summarize: Rust is a systems programming language" \
  --model-path ./models/TinyLlama_TinyLlama-1.1B-Chat-v1.0 \
  --temperature 0.3 \
  --max-tokens 100

Options:

  • -m, --model-path <PATH> - Path to model directory (required)
  • --max-tokens <N> - Max tokens to generate (default: 512)
  • --temperature <0.0-2.0> - Sampling temperature (default: 0.7)
    • Lower (0.1-0.3): More focused, deterministic
    • Higher (0.7-1.0): More creative, diverse

Recommended Models

For Testing & Development

Model Size Speed Use Case
TinyLlama/TinyLlama-1.1B-Chat-v1.0 ~1GB Fast Testing, quick experiments
microsoft/phi-2 ~2GB Medium Quality vs speed balance
mistralai/Mistral-7B-v0.1 ~14GB Slower Production-quality output

Why TinyLlama?

# Download and try TinyLlama first
influence download -m TinyLlama/TinyLlama-1.1B-Chat-v1.0
influence generate "Hello, world!" \
  --model-path ./models/TinyLlama_TinyLlama-1.1B-Chat-v1.0

Benefits:

  • Fast downloads (~1GB)
  • Quick inference (even on CPU)
  • Good quality for many tasks
  • Great for learning and experimentation

Benefits Over Alternatives

vs Cloud APIs (OpenAI, Anthropic, etc.)

You Save:

  • Money - No per-token costs
  • Privacy - Data never leaves your machine
  • Latency - No network round-trips
  • Reliability - Works offline
  • Control - No rate limits or content policies

vs Python Tools (llama.cpp, transformers, etc.)

You Get:

  • Simplicity - Single binary, no dependencies
  • Performance - Rust speed with GPU acceleration
  • Stability - No version conflicts or dependency hell
  • Integration - Easy to script and automate

How It Works

┌─────────────┐
│  Your Prompt│
└──────┬──────┘
       │
       ▼
┌──────────────────────────────────┐
│  Tokenization (HuggingFace)      │
└──────┬───────────────────────────┘
       │
       ▼
┌──────────────────────────────────┐
│  Model Loading (.safetensors)    │
│  - Memory-mapped for efficiency  │
│  - GPU acceleration (Metal/CUDA) │
└──────┬───────────────────────────┘
       │
       ▼
┌──────────────────────────────────┐
│  Inference (Candle)              │
│  - Forward pass with KV cache    │
│  - Temperature-based sampling    │
│  - Token-by-token generation     │
└──────┬───────────────────────────┘
       │
       ▼
┌─────────────┐
│  Output Text│
└─────────────┘

Technical Details

Model Requirements

Each model directory must contain:

  • config.json - Model architecture and parameters
  • tokenizer.json or tokenizer_config.json - Tokenizer
  • *.safetensors - Model weights (memory-mapped)

Supported Architectures

  • OK Llama (meta-llama/Llama-2-7b-hf, TinyLlama)
  • OK Mistral (mistralai/Mistral-7B-v0.1)
  • OK Phi (microsoft/phi-2)
  • OK Granite (pure transformer variants)
  • X Mamba/Hybrid models (specialized implementation required)
  • X MoE models (not yet supported)
  • X Encoder-only models (BERT, etc. - not for generation)

Performance

Optimizations:

  • KV Caching - Reuse computed tensors for faster generation
  • Memory Mapping - Zero-copy model loading
  • Streaming Output - Display tokens as they're generated
  • GPU Acceleration - Metal support on macOS (enabled by default)
  • Proper Token Spacing - Handles SentencePiece space markers correctly

Memory Usage:

  • TinyLlama (1B): ~2GB RAM
  • Phi-2 (2.7B): ~4GB RAM
  • Mistral-7B: ~14GB RAM
  • Add model size for total memory requirement

Performance Tips:

  • On macOS: Metal GPU is enabled by default for faster inference
  • On Linux/Windows: CUDA support planned (use CPU for now)
  • Use smaller models (TinyLlama) for faster responses
  • Reduce --max-tokens for quicker generation

Troubleshooting

Model Not Found Error

# Error: Model directory not found
# Solution: Check the model path exists
ls ./models/TinyLlama_TinyLlama-1.1B-Chat-v1.0

Missing Tokenizer Error

# Error: Tokenizer file not found
# Solution: Ensure these files exist in model directory:
# - tokenizer.json (or tokenizer_config.json)
# - config.json
# - *.safetensors files

Unsupported Architecture Error

# Error: Unsupported model architecture (Mamba/MoE)
# Solution: Use a supported model like TinyLlama
influence download -m TinyLlama/TinyLlama-1.1B-Chat-v1.0

Slow Generation on CPU

# CPU inference is slower. Options:
# 1. Use a smaller model (TinyLlama instead of Mistral-7B)
# 2. Reduce max-tokens
# 3. Build with Metal support (macOS):
cargo build --release --features metal

Development

Build with Debug Logging

RUST_LOG=influence=debug cargo run -- generate "Hello" \
  --model-path ./models/TinyLlama_TinyLlama-1.1B-Chat-v1.0

Run Tests

cargo test

Roadmap

  • CUDA support for NVIDIA GPUs
  • Quantized model support (GGUF)
  • Chat mode with conversation history
  • Batch generation
  • HTTP API server mode
  • Top-k and nucleus sampling

License

MIT

Contributing

Contributions are welcome! Please feel free to submit a Pull Request.

Acknowledgments

Built with: