influence 0.1.1

A Rust CLI tool for downloading HuggingFace models and running local LLM inference
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
# Influence

> Privacy-first local LLM inference - Download models from HuggingFace and run them entirely on your machine.

## Why Influence?

**The Problem:** Most LLM tools require cloud APIs, expensive subscriptions, or complex Python setups. Your data leaves your machine, you pay per token, and you're locked into someone else's infrastructure.

**The Solution:** Influence gives you:
- **Complete privacy** - All inference happens locally on your machine
- **No API costs** - Pay once (in compute) and use forever
- **No vendor lock-in** - Models are downloaded to your disk
- **Simplicity** - Single binary, no Python, no virtual environments
- **GPU acceleration** - Metal support for macOS (CUDA coming soon)

## What Makes It Different?

| Feature | Influence | Cloud APIs (OpenAI, etc.) | Python Tools |
|---------|-----------|---------------------------|--------------|
| Privacy | 100% local | Data sent to servers | Local but complex |
| Cost | Free (after download) | Pay per token | Free but complex setup |
| Setup | Single binary | API key required | Python, pip, venv |
| GPU Support | Metal (macOS) | Server-side | Hard to configure |
| Offline Use | Yes | No | Yes |

## Quick Start

```bash
# Build from source
git clone https://github.com/yingkitw/influence.git
cd influence
cargo build --release

# Search for a model
./target/release/influence search "tinyllama" --limit 5

# Download a model (~1GB for TinyLlama)
./target/release/influence download -m TinyLlama/TinyLlama-1.1B-Chat-v1.0

# Generate text locally (with Metal GPU on macOS)
./target/release/influence generate "Explain quantum computing in simple terms" \
  --model-path ./models/TinyLlama_TinyLlama-1.1B-Chat-v1.0 \
  --device metal
```

## Usage Examples

### Example 1: Quick Question Answering

```bash
# Ask a factual question
influence generate "What are the main differences between Rust and C++?" \
  --model-path ./models/TinyLlama_TinyLlama-1.1B-Chat-v1.0 \
  --max-tokens 256
```

**Benefit:** Get instant answers without:
- Opening a browser
- Waiting for cloud API responses
- Paying per token
- Sending your queries to third parties

### Example 2: Code Generation

```bash
# Generate code with higher temperature for creativity
influence generate "Write a Rust function to merge two sorted vectors" \
  --model-path ./models/TinyLlama_TinyLlama-1.1B-Chat-v1.0 \
  --temperature 0.8 \
  --max-tokens 512
```

**Benefit:** Generate code locally with:
- No rate limits
- No API keys to manage
- Full context control
- Works offline

### Example 3: Content Creation

```bash
# Generate blog post or documentation
influence generate "Write a technical introduction to vector databases" \
  --model-path ./models/TinyLlama_TinyLlama-1.1B-Chat-v1.0 \
  --max-tokens 1024
```

**Benefit:** Create content without:
- Using cloud services
- Exposing your ideas to third parties
- Worrying about content policies

## Current Status

**Version 0.1.0** - Core Features Working

- OK Model search on HuggingFace
- OK Model downloading with progress tracking
- OK Local Llama-architecture inference (Llama, Mistral, Phi, Granite)
- OK Token spacing and formatting
- OK Metal GPU acceleration on macOS (enabled by default)
- OK Streaming text generation
- OK Temperature-based sampling
- OK KV caching for performance
- OK Architecture detection with helpful error messages

**Tested Models:**
- `TinyLlama/TinyLlama-1.1B-Chat-v1.0` - Working perfectly
- Other Llama-architecture models - Supported

## Installation

### Build from Source

```bash
# Clone the repository
git clone https://github.com/yingkitw/influence.git
cd influence

# Build release binary with Metal support (macOS)
cargo build --release

# The binary will be at target/release/influence
./target/release/influence --help
```

**Features:**
- `metal` (default) - Metal GPU acceleration for macOS
- `accelerate` - CPU acceleration for macOS
- `cuda` - CUDA support for NVIDIA GPUs (placeholder)

**Build without GPU:**
```bash
cargo build --release --no-default-features
```

## Command Reference

### `search` - Find Models on HuggingFace

```bash
influence search <query> [options]
```

**Examples:**

```bash
# Search for llama models
influence search "llama"

# Search with filters
influence search "text-generation" --limit 10 --author meta-llama

# Search for small models
influence search "1b" --limit 5
```

**Options:**
- `-l, --limit <N>` - Max results (default: 20)
- `-a, --author <ORG>` - Filter by author

### `download` - Download Model from HuggingFace

```bash
influence download -m <model> [options]
```

**Examples:**

```bash
# Download TinyLlama (recommended for testing)
influence download -m TinyLlama/TinyLlama-1.1B-Chat-v1.0

# Download to custom location
influence download -m microsoft/phi-2 -o ~/models

# Use custom mirror
influence download -m ibm/granite-4-h-small -r https://hf-mirror.com
```

**Options:**
- `-m, --model <MODEL>` - Model name (required)
- `-r, --mirror <URL>` - Mirror URL (default: hf-mirror.com)
- `-o, --output <PATH>` - Output directory (default: ./models/)

### `generate` - Generate Text Locally

```bash
influence generate <prompt> [options]
```

**Examples:**

```bash
# Basic generation
influence generate "What is machine learning?" \
  --model-path ./models/TinyLlama_TinyLlama-1.1B-Chat-v1.0

# With custom parameters
influence generate "Explain async/await" \
  --model-path ./models/TinyLlama_TinyLlama-1.1B-Chat-v1.0 \
  --max-tokens 512 \
  --temperature 0.7

# Lower temperature for more focused output
influence generate "Summarize: Rust is a systems programming language" \
  --model-path ./models/TinyLlama_TinyLlama-1.1B-Chat-v1.0 \
  --temperature 0.3 \
  --max-tokens 100
```

**Options:**
- `-m, --model-path <PATH>` - Path to model directory (required)
- `--max-tokens <N>` - Max tokens to generate (default: 512)
- `--temperature <0.0-2.0>` - Sampling temperature (default: 0.7)
  - Lower (0.1-0.3): More focused, deterministic
  - Higher (0.7-1.0): More creative, diverse

## Recommended Models

### For Testing & Development

| Model | Size | Speed | Use Case |
|-------|------|-------|----------|
| `TinyLlama/TinyLlama-1.1B-Chat-v1.0` | ~1GB | Fast | Testing, quick experiments |
| `microsoft/phi-2` | ~2GB | Medium | Quality vs speed balance |
| `mistralai/Mistral-7B-v0.1` | ~14GB | Slower | Production-quality output |

### Why TinyLlama?

```bash
# Download and try TinyLlama first
influence download -m TinyLlama/TinyLlama-1.1B-Chat-v1.0
influence generate "Hello, world!" \
  --model-path ./models/TinyLlama_TinyLlama-1.1B-Chat-v1.0
```

**Benefits:**
- Fast downloads (~1GB)
- Quick inference (even on CPU)
- Good quality for many tasks
- Great for learning and experimentation

## Benefits Over Alternatives

### vs Cloud APIs (OpenAI, Anthropic, etc.)

**You Save:**
- Money - No per-token costs
- Privacy - Data never leaves your machine
- Latency - No network round-trips
- Reliability - Works offline
- Control - No rate limits or content policies

### vs Python Tools (llama.cpp, transformers, etc.)

**You Get:**
- Simplicity - Single binary, no dependencies
- Performance - Rust speed with GPU acceleration
- Stability - No version conflicts or dependency hell
- Integration - Easy to script and automate

## How It Works

```
┌─────────────┐
│  Your Prompt│
└──────┬──────┘
┌──────────────────────────────────┐
│  Tokenization (HuggingFace)      │
└──────┬───────────────────────────┘
┌──────────────────────────────────┐
│  Model Loading (.safetensors)    │
│  - Memory-mapped for efficiency  │
│  - GPU acceleration (Metal/CUDA) │
└──────┬───────────────────────────┘
┌──────────────────────────────────┐
│  Inference (Candle)              │
│  - Forward pass with KV cache    │
│  - Temperature-based sampling    │
│  - Token-by-token generation     │
└──────┬───────────────────────────┘
┌─────────────┐
│  Output Text│
└─────────────┘
```

## Technical Details

### Model Requirements

Each model directory must contain:
- `config.json` - Model architecture and parameters
- `tokenizer.json` or `tokenizer_config.json` - Tokenizer
- `*.safetensors` - Model weights (memory-mapped)

### Supported Architectures

- OK Llama (meta-llama/Llama-2-7b-hf, TinyLlama)
- OK Mistral (mistralai/Mistral-7B-v0.1)
- OK Phi (microsoft/phi-2)
- OK Granite (pure transformer variants)
- X Mamba/Hybrid models (specialized implementation required)
- X MoE models (not yet supported)
- X Encoder-only models (BERT, etc. - not for generation)

### Performance

**Optimizations:**
- KV Caching - Reuse computed tensors for faster generation
- Memory Mapping - Zero-copy model loading
- Streaming Output - Display tokens as they're generated
- GPU Acceleration - Metal support on macOS (enabled by default)
- Proper Token Spacing - Handles SentencePiece space markers correctly

**Memory Usage:**
- TinyLlama (1B): ~2GB RAM
- Phi-2 (2.7B): ~4GB RAM
- Mistral-7B: ~14GB RAM
- Add model size for total memory requirement

**Performance Tips:**
- On macOS: Metal GPU is enabled by default for faster inference
- On Linux/Windows: CUDA support planned (use CPU for now)
- Use smaller models (TinyLlama) for faster responses
- Reduce `--max-tokens` for quicker generation

## Troubleshooting

### Model Not Found Error

```bash
# Error: Model directory not found
# Solution: Check the model path exists
ls ./models/TinyLlama_TinyLlama-1.1B-Chat-v1.0
```

### Missing Tokenizer Error

```bash
# Error: Tokenizer file not found
# Solution: Ensure these files exist in model directory:
# - tokenizer.json (or tokenizer_config.json)
# - config.json
# - *.safetensors files
```

### Unsupported Architecture Error

```bash
# Error: Unsupported model architecture (Mamba/MoE)
# Solution: Use a supported model like TinyLlama
influence download -m TinyLlama/TinyLlama-1.1B-Chat-v1.0
```

### Slow Generation on CPU

```bash
# CPU inference is slower. Options:
# 1. Use a smaller model (TinyLlama instead of Mistral-7B)
# 2. Reduce max-tokens
# 3. Build with Metal support (macOS):
cargo build --release --features metal
```

## Development

### Build with Debug Logging

```bash
RUST_LOG=influence=debug cargo run -- generate "Hello" \
  --model-path ./models/TinyLlama_TinyLlama-1.1B-Chat-v1.0
```

### Run Tests

```bash
cargo test
```

## Roadmap

- [ ] CUDA support for NVIDIA GPUs
- [ ] Quantized model support (GGUF)
- [ ] Chat mode with conversation history
- [ ] Batch generation
- [ ] HTTP API server mode
- [ ] Top-k and nucleus sampling

## License

MIT

## Contributing

Contributions are welcome! Please feel free to submit a Pull Request.

## Acknowledgments

Built with:
- [Candle]https://github.com/huggingface/candle - ML framework by HuggingFace
- [Tokenizers]https://github.com/huggingface/tokenizers - Fast tokenization
- [Clap]https://github.com/clap-rs/clap - CLI parsing