# candle-quantized-glm4
Candle implementation of various quantized GLM4-0414 models.
## Running an example
Run local gguf file (with local tokenizer.json)
```bash
$ cargo run --example quantized-glm4 --release --features cuda -- --tokenizer /home/data/GLM-4-9B-0414/tokenizer.json --model /home/data/GLM-4-9B-0414-Q4_K_M.gguf --prompt "How are you today?"
```
Run local gguf file with tokenizer.json downloaded form huggingface
```bash
$ cargo run --example quantized-glm4 --release --features cuda -- --which q4k9b --model /home/data/GLM-4-9B-0414-Q4_K_M.gguf --prompt "How are you today?"
```
Run with model-id (download from huggingface)
```bash
$ cargo run --example quantized-glm4 --release --features cuda -- --which q4k9b --prompt "How are you today?"
```
Options for `which` [q2k9b, q2k32b, q4k9b, q4k32b]
Example output:
```
avx: true, neon: false, simd128: false, f16c: true
temp: 0.80 repeat-penalty: 1.10 repeat-last-n: 64
loaded 523 tensors (6.16GB) in 0.86s
model built
I'm just a computer program, so I don't have feelings or emotions. However, I'm functioning well and ready to assist you with any questions or tasks you might have. How can I help you today?
10 prompt tokens processed: 67.12 token/s
44 tokens generated: 45.28 token/s
```