llama-crab-server 0.1.8

HTTP server for llama-crab
llama-crab-server-0.1.8 is not a library.

llama-crab-server

OpenAI-compatible HTTP server for local llama-crab inference.

Built on top of axum and exposes a worker thread that owns the model and context.

Installation

cargo install llama-crab-server --features mtmd --force

For development against a workspace checkout:

cargo run -p llama-crab-server -- \
  --model models/qwen2.5-0.5b-instruct-q4_k_m.gguf \
  --host 127.0.0.1 \
  --port 8080

Routes

Route Description
GET /health Liveness probe.
GET /v1/models List the loaded model.
POST /v1/completions OpenAI legacy text completions.
POST /v1/chat/completions OpenAI chat completions with streaming.
POST /v1/embeddings Embeddings (float or base64).
POST /v1/rerank, POST /v1/reranking Rerank.
POST /extras/tokenize, /extras/tokenize/count, /extras/detokenize Tokenizer helpers.

Multimodal chat is available when the binary is built with --features mtmd and started with --mmproj <projector.gguf>.

Hugging Face support

To enable loading models directly from Hugging Face (e.g. --model TheBloke/...), install the server with the hf-hub feature:

cargo install llama-crab-server --features hf-hub --force

When the feature is enabled, the server accepts Hugging Face repository ids via --model and disambiguates multi-.gguf repos via --hf-filename:

llama-crab-server \
  --model TheBloke/Llama-2-7B-Chat-GGUF \
  --hf-filename llama-2-7b-chat.Q4_K_M.gguf

For the full request schema, sampling fields and structured-output options, see the server guide.

Resources

License

Licensed under the MIT License.