drama_llama 0.5.2

A library for language modeling and text generation.
docs.rs failed to build drama_llama-0.5.2
Please check the build logs for more information.
See Builds for ideas on how to fix a failed build, or Metadata for how to configure docs.rs builds.
If you believe this is docs.rs' fault, open an issue.


llama with drama mask logo

drama_llama is yet another Rust wrapper for llama.cpp. It is a work in progress and not intended for production use. The API will change.

For examples, see the bin folder. There are two example binaries.

  • Dittomancer - Chat with well represented personalities in the training.
  • Regurgitater - Test local language models for memorized content.

Supported Features

  • LLaMA 3 Support.
  • Iterators yielding candidates, tokens and pieces.
  • Stop criteria at regex, token sequence, and/or string sequence.
  • Metal support. CUDA may be enabled with the cuda and cuda_f16 features.
  • Rust-native sampling code. All sampling methods from llama.cpp have been translated.
  • N-gram based repetition penalties with custom exclusions for n-grams that should not be penalized.
  • Support for N-gram blocking with a default, hardcoded blocklist.


  • Code is poetry. Make it pretty.
  • Respect is universal.
  • Use rustfmt.


  • Candidate iterator with fine-grained control over sampling
  • Examples for new Candidate API.
  • Support for chaining sampling methods using SampleOptions. mode will become modes and applied one after another until only a single Candidate token remains.
  • Common command line options for sampling. Currently this is not exposed.
  • API closer to Ollama. Potentially support for something like Modelfile.
  • Logging (non-blocking) and benchmark support.
  • Better chat and instruct model support.
  • Web server. Tokenization in the browser.
  • Tiktoken as the tokenizer for some models instead of llama.cpp's internal one.
  • Reworked, functional, public, candidate API
  • Grammar constraints (maybe or maybe not llama.cpp style)
  • Async streams, better parallelism with automatic batch scheduling
  • Better cache management. llama.cpp does not seem to manage a longest prefix cache automatically, so one will have to be written.
  • Backends other than llama.cpp (eg. MLC, TensorRT-LLM, Ollama)

Known issues

  • With LLaMA 3, safe vocabulary is not working yet so --vocab unsafe must be passed as a command line argument or VocabKind::Unsafe used for an Engine constructor.
  • The model doesn't load until genration starts, so there can be a long pause on first generation. However because mmap is used, on subsequent process launches, the model should already be cached by the OS.
  • Documentation is broken on docs.rs because llama.cpp's CMakeLists.txt generates code, and writing to the filesystem is not supported. For the moment use cargo doc --open instead. Others have fixed this by patching llama.cpp in their bindings, but I'm not sure I want to do that for now.

Generative AI Disclosure

  • Generative, AI, specifically Microsoft's Bing Copilot, GitHub Copilot, and Dall-E 3 were used for portions of this project. See inline comments for sections where generative AI was used. Completion was also used for getters, setters, and some tests. Logos were generated with Dall-E and post processed in Inkscape.