llm_utils
A Swiss army knife for working with LLMs. Features supporting Llama.cpp, Openai, Anthropic, Mistral-rs. Originally made for the llm_client crate, but split into it's own crate just for you.
- Estimate GGUF Vram usage.
- Clean and chunk HTML and text.
- Build grammars for Llama.cpp.
- Ensure your prompts are within LLM token limits.
Installation
[]
= "*"
Tokenizer 🧮
-
Hugging Face's Tokenizer library for local models and Tiktoken-rs for OpenAI and Anthropic (Anthropic doesn't have a publically available tokenizer.)
-
Simple abstract API for encoding and decoding allows for abstract LLM consumption across multiple architechtures.
-
Safely set the
max_token
param for LLMs to ensure requests don't fail due to exceeding token limits!
// Get a Tiktoken tokenizer
//
let tokenizer: LlmTokenizer = new_tiktoken;
// Get a Hugging Face tokenizer from local path
//
let tokenizer: LlmTokenizer = new_from_tokenizer_json;
// Or load from repo
//
let tokenizer: LlmTokenizer = new_from_hf_repo;
// Get tokenizan'
//
let token_ids: = tokenizer.tokenize;
let count: u32 = tokenizer.count_tokens;
let word_probably: String = tokenizer.detokenize_one?;
let words_probably: String = tokenizer.detokenize_many?;
// These function are used for generating logit bias
let token_id: u32 = tokenizer.try_into_single_token;
let word_probably: String = tokenizer.try_from_single_token_id;
Text utils 📝
// Normalize whitespace chars to " " and "\n".
// Reduce the number of newlines to singles or doubles (paragraphs) or convert them to " ".
// Optionally, remove all characters besides alphabetic, numbers, and punctuation.
//
let mut text_cleaner: String = new;
let cleaned_text: String = text_cleaner
.reduce_newlines_to_single_space
.remove_non_basic_ascii
.run;
// Convert HTML to cleaned text.
// Uses an implementation of Mozilla's readability mode and HTML2Text.
//
let cleaned_text: String = clean_html;
// Rule based text segmentation for sentences.
// Better than unicode-segmentation crate or any other crate I tested.
// But still not as good as a model based approach like Spacy or other NLP libraries.
//
let sentence_splits: =
split_text_into_sentences_regex;
// Split text into balanced chunks as close to the given size as possible.
// Unlike other implementations this method attempts to keep the chunks the same size.
// This means you won't end up with an orphan final chunk that is too small to be useful.
// Implemented with a DFS algo, recursion, memoziation, and heuristic pre-filters.
// Attempts to split semantically in the following order:
// Paragraphs, newlines, sentences, words, and finally graphmemes.
// Please note: this is much slower than other methods. It needs optimizations!
//
let chunked_text: =
chunk_text;
Model presets 🛤️
-
Presets for Open Source LLMs from Hugging Face, or API models like OpenAI, and Anthropic.
-
Load and/or download a model with metadata, tokenizer, and local path (for local LLMs like llama.cpp, vllm, mistral.rs).
-
Auto-select the largest quantized GGUF that will fit in your vram!
Supported Open Source models:
⚪ Llama 3
⚪ Mistral and Mixtral
⚪ Phi 3
// Load the largest quantized Mistral-7B-Instruct model that will fit in your vram
//
let model: OsLlm = new
.mistral_7b_instruct
.vram
.ctx_size // ctx_size impacts vram usage!
.load
.await?;
not_a_real_assert_eq!
// Or Openai
//
let model: OpenAiLlm = gpt_4_o;
not_a_real_assert_eq!
// Or Anthropic
//
let model: AnthropicLlm = claude_3_opus;
GGUF models from Hugging Face or local path 🚤
// From HF
//
let model_url = "https://huggingface.co/MaziyarPanahi/Meta-Llama-3-8B-Instruct-GGUF/blob/main/Meta-Llama-3-8B-Instruct.Q6_K.gguf";
let model: OsLlm = new
.hf_quant_file_url
.load
.await?;
// Note: because we can't instantiate a tokenizer from a GGUF file, the returned model will not have a tokenizer!
// However, if we provide the base model's repo, we load from there.
let repo_id = "meta-llama/Meta-Llama-3-8B-Instruct";
let model: OsLlm = new
.hf_quant_file_url
.hf_config_repo_id
.load
.await?;
// From Local
//
let local_path = "/root/.cache/huggingface/hub/models--MaziyarPanahi--Meta-Llama-3-8B-Instruct-GGUF/blobs/c2ca99d853de276fb25a13e369a0db2fd3782eff8d28973404ffa5ffca0b9267";
let model: OsLlm = new
.local_quant_file_path
.load
.await?;
// Again, we require a tokenizer.json. This can also be loaded from a local path.
let local_config_path = "/llm_utils/src/models/open_source/llama/llama_3_8b_instruct";
let model: OsLlm = new
.local_quant_file_path
.local_config_path
.load
.await?;
Prompting 🎶
-
Generate properly formatted prompts for GGUF models, Openai, and Anthropic.
-
Uses the GGUF's chat template and Jinja templates to format the prompt to model spec.
-
Create prompts from a combination of dynamic inputs and/or static inputs from file.
// Default formatted prompt (Openai and Anthropic format)
//
let default_formatted_prompt: = default_formatted_prompt?;
// Get total tokens in prompt
//
let total_prompt_tokens: u32 = model.openai_token_count_of_prompt;
// Then convert it to be used for a GGUF model
//
let gguf_formatted_prompt: String = convert_default_prompt_to_model_format?;
// Since the GGUF formatted prompt is just a string, we can just use the generic count_tokens function
//
let total_prompt_tokens: u32 = tokenizer.count_tokens;
// Validate requested max_tokens for a generation. If it exceeds the models limits, reduce max_tokens to a safe value.
//
let safe_max_tokens = get_and_check_max_tokens_for_response?;
Grammar 🤓
-
Grammars are the most capable method for structuring the output of an LLM. This was designed for use with LlamaCpp, but plan to support others.
-
Create lists of N items, restrict character types.
-
More to be added (JSON, classification, restrict characters, words, phrases)
// Return a list of between 1, 4 items
//
let grammar = create_list_grammar;
// List will be formatted: `- <list text>\n
//
let response: String = text_generation_request.await?;
// So you can easily split like:
//
let response_items: = response
.lines
.map
.collect;
// Exclude numbers from text generation
//
let grammar = create_text_structured_grammar;
let response: String = text_generation_request.await?;
assert!
assert!
// Exclude a list of common, and commonly unwanted characters from text generation
//
let grammar = create_text_structured_grammar;
let response: String = text_generation_request.await?;
assert!
assert!
assert!
Logit bias #️⃣
-
Create properly formatted logit bias requests for LlamaCpp and Openai.
-
Functionality to add logit bias from a variety of sources, along with validation.
// Exclude some tokens from text generation
//
let mut words = new;
words.entry.or_insert;
words.entry.or_insert;
// Build and validate
//
let logit_bias = logit_bias_from_words
let validated_logit_bias = validate_logit_bias_values?;
// Convert
//
let openai_logit_bias = convert_logit_bias_to_openai_format?;
let llama_logit_bias = convert_logit_bias_to_llama_format?;
License
This project is licensed under the MIT License.
Contributing
My motivation for publishing is for someone to point out if I'm doing something wrong!