# Ollama thinking models exhaust token budget before producing output
## The Problem
Models with thinking/reasoning enabled (e.g., `qwen3:4b` with default settings) prepend `<think>...</think>` blocks before their actual JSON response. With the default `num_predict: 256` token budget, the thinking tokens consume most or all of the output budget, leaving insufficient tokens for the commit message JSON.
What happens under the hood: the model spends all 256 tokens writing its internal reasoning inside `<think>...</think>` tags, and the token budget runs out before it ever gets to the actual JSON commit message. The result is either an empty response or a truncated one that can't be parsed.
In practice, this is what you'll see:
```txt
WARN empty response from LLM, skipping candidate=1
commitbee::provider::error
× Provider 'ollama' error: No valid commit messages generated
```
You can verify this by calling the Ollama API directly with the same 256-token budget:
```bash
curl -s http://localhost:11434/api/generate -d '{
"model": "qwen3:4b",
"prompt": "Respond with ONLY this JSON: {\"type\": \"docs\", \"scope\": null, \"subject\": \"update README badges\", \"body\": null, \"breaking_change\": null}",
"stream": false,
"options": { "num_predict": 256 }
The response shows exactly what's happening:
```json
{
"response": "",
"thinking": "We are given a git diff for README.md. The summary says: 1 file modified... The change is adding new badges (and removing old ones? but the diff shows old",
"done": true,
"done_reason": "length"
}
```
- **`"response": ""`** — completely empty, no JSON produced
- **`"thinking"`** — the entire 256-token budget was spent on internal reasoning
- **`"done_reason": "length"`** — the model hit the token limit mid-thought, before it ever started writing the actual output
## Who Is Affected
Anyone using `qwen3:4b` with its default thinking mode enabled, which is the model listed in the README's quick start section.
## The Recommended Model
The default model is now **`qwen3.5:4b`**, which does not use thinking mode and works reliably with the default 256-token budget. It's smaller (3.4GB vs 4.3GB for qwen3:4b), produces clean JSON output, and has a simpler tag.
To use it:
```bash
ollama pull qwen3.5:4b
```
Then in your config (`commitbee init` to create one):
```toml
model = "qwen3.5:4b"
```
## Fix
The default `num_predict` has been bumped from 256 to 1024, giving thinking models enough room for both the `<think>` block and the JSON response. The sanitizer also strips `<think>` and `<thought>` blocks from LLM output before parsing, so the thinking content doesn't interfere with JSON extraction.
If you're on v0.3.0, you can work around the issue by setting `num_predict = 1024` in your config file (`commitbee init` to create one).
## Notes on Cloud Providers
I've also tested CommitBee with the **Anthropic API** (Claude Sonnet 4.6), which generally works and produces higher-quality commit messages than local Ollama models. However, there are some provider-specific edge cases I'm still investigating. Every provider and model has its own quirks — I'd encourage users to try different combinations and find what works best for their workflow. Feedback on provider/model experiences is welcome.