llm_utils
Utilities for Llama.cpp, Openai, Anthropic, Mistral-rs. Made for the llm_client crate, but split into it's own crate because some of these are useful!
Installation
[]
= "*"
Model presets 🛤️
-
Presets for Open Source LLMs from Hugging Face, or API models like OpenAI, and Anthropic.
-
Load and/or download a model with metadata, tokenizer, and local path (for local LLMs like llama.cpp, vllm, mistral.rs).
-
Auto-select the largest quantized GGUF that will fit in your vram!
Supported Open Source models:
⚪ Llama 3
⚪ Mistral and Mixtral
⚪ Phi 3
// Load the largest quantized Mistral-7B-Instruct model that will fit in your vram
//
let model: OsLlm = new
.mistral_7b_instruct
.vram
.ctx_size // ctx_size impacts vram usage!
.load
.await?;
not_a_real_assert_eq!
// Or Openai
//
let model: OpenAiLlm = gpt_4_o;
not_a_real_assert_eq!
// Or Anthropic
//
let model: AnthropicLlm = gpt_4_o;
GGUF models from Hugging Face or local path 🚤
// From HF
//
let model_url = "https://huggingface.co/MaziyarPanahi/Meta-Llama-3-8B-Instruct-GGUF/blob/main/Meta-Llama-3-8B-Instruct.Q6_K.gguf";
let model: OsLlm = new
.hf_quant_file_url
.load
.await?;
// Note: because we can't instantiate a tokenizer from a GGUF file, the returned model will not have a tokenizer!
// However, if we provide the base model's repo, we load from there.
let repo_id = "meta-llama/Meta-Llama-3-8B-Instruct";
let model: OsLlm = new
.hf_quant_file_url
.hf_config_repo_id
.load
.await?;
// From Local
//
let local_path = "/root/.cache/huggingface/hub/models--MaziyarPanahi--Meta-Llama-3-8B-Instruct-GGUF/blobs/c2ca99d853de276fb25a13e369a0db2fd3782eff8d28973404ffa5ffca0b9267";
let model: OsLlm = new
.local_quant_file_path
.load
.await?;
// Again, we require a tokenizer.json. This can also be loaded from a local path.
let local_config_path = "/llm_utils/src/models/open_source/llama/llama_3_8b_instruct";
let model: OsLlm = new
.local_quant_file_path
.local_config_path
.load
.await?;
Tokenizer 🧮
-
Hugging Face's Tokenizer library for local models and Tiktoken-rs for OpenAI and Anthropic (Anthropic doesn't have a publically available tokenizer.)
-
Simple abstract API for encoding and decoding allows for abstract LLM consumption across multiple architechtures.
-
Safely set the
max_token
param for LLMs to ensure requests don't fail due to exceeding token limits!
// Get a Tiktoken tokenizer
//
let tokenizer: LlmTokenizer = new_tiktoken;
// Get a Hugging Face tokenizer from local path
//
let tokenizer: LlmTokenizer = new_from_tokenizer_json;
// Or load from repo
//
let tokenizer: LlmTokenizer = new_from_hf_repo;
// Get tokenizan'
//
let token_ids: = tokenizer.tokenize;
let count: u32 = tokenizer.count_tokens;
let word_probably: String = tokenizer.detokenize_one?;
let words_probably: String = tokenizer.detokenize_many?;
// These function are used for generating logit bias
let token_id: u32 = tokenizer.try_into_single_token;
let word_probably: String = tokenizer.try_from_single_token_id;
Prompting 🎶
-
Generate properly formatted prompts for GGUF models, Openai, and Anthropic.
-
Uses the GGUF's chat template and Jinja templates to format the prompt to model spec.
-
Create prompts from a combination of dynamic inputs and/or static inputs from file.
// Default formatted prompt (Openai and Anthropic format)
//
let default_formatted_prompt: = default_formatted_prompt?;
// Get total tokens in prompt
//
let total_prompt_tokens: u32 = model.openai_token_count_of_prompt;
// Then convert it to be used for a GGUF model
//
let gguf_formatted_prompt: String = convert_default_prompt_to_model_format?;
// Since the GGUF formatted prompt is just a string, we can just use the generic count_tokens function
//
let total_prompt_tokens: u32 = tokenizer.count_tokens;
// Validate requested max_tokens for a generation. If it exceeds the models limits, reduce max_tokens to a safe value.
//
let safe_max_tokens = get_and_check_max_tokens_for_response?;
Grammar 🤓
-
Grammars are the most capable method for structuring the output of an LLM. This was designed for use with LlamaCpp, but plan to support others.
-
Create lists of N items, restrict character types.
-
More to be added (JSON, classification, restrict characters, words, phrases)
// Return a list of between 1, 4 items
//
let grammar = create_list_grammar;
// List will be formatted: `- <list text>\n
//
let response: String = text_generation_request.await?;
// So you can easily split like:
//
let response_items: = response
.lines
.map
.collect;
// Exclude numbers from text generation
//
let grammar = create_text_structured_grammar;
let response: String = text_generation_request.await?;
assert!
assert!
// Exclude a list of common, and commonly unwanted characters from text generation
//
let grammar = create_text_structured_grammar;
let response: String = text_generation_request.await?;
assert!
assert!
assert!
Logit bias #️⃣
-
Create properly formatted logit bias requests for LlamaCpp and Openai.
-
Functionality to add logit bias from a variety of sources, along with validation.
// Exclude some tokens from text generation
//
let mut words = new;
words.entry.or_insert;
words.entry.or_insert;
// Build and validate
//
let logit_bias = logit_bias_from_words
let validated_logit_bias = validate_logit_bias_values?;
// Convert
//
let openai_logit_bias = convert_logit_bias_to_openai_format?;
let llama_logit_bias = convert_logit_bias_to_llama_format?;
Text utils 📝
-
Generic utils for cleaning text. Mostly useful for RAG.
-
Will add text splitting in the future.
License
This project is licensed under the MIT License.
Contributing
My motivation for publishing is for someone to point out if I'm doing something wrong!