llama-crab-server-0.1.6 is not a library.

`llama-crab-server`

OpenAI-compatible HTTP server for local llama-crab inference.

Built on top of axum and exposes a worker thread that owns the model and context.

Installation

cargo install llama-crab-server --features mtmd --force

For development against a workspace checkout:

cargo run -p llama-crab-server -- \
  --model models/qwen2.5-0.5b-instruct-q4_k_m.gguf \
  --host 127.0.0.1 \
  --port 8080

Route	Description
`GET /health`	Liveness probe.
`GET /v1/models`	List the loaded model.
`POST /v1/completions`	OpenAI legacy text completions.
`POST /v1/chat/completions`	OpenAI chat completions with streaming.
`POST /v1/embeddings`	Embeddings (`float` or `base64`).
`POST /v1/rerank`, `POST /v1/reranking`	Rerank.
`POST /extras/tokenize`, `/extras/tokenize/count`, `/extras/detokenize`	Tokenizer helpers.

Multimodal chat is available when the binary is built with --features mtmd and started with --mmproj <projector.gguf>.

To enable loading models directly from Hugging Face (e.g. --model TheBloke/...), install the server with the hf-hub feature:

cargo install llama-crab-server --features hf-hub --force

When the feature is enabled, the server accepts Hugging Face repository ids via --model and disambiguates multi-.gguf repos via --hf-filename:

llama-crab-server \
  --model TheBloke/Llama-2-7B-Chat-GGUF \
  --hf-filename llama-2-7b-chat.Q4_K_M.gguf

For the full request schema, sampling fields and structured-output options, see the server guide.

Licensed under the MIT License.