llm_utils: Basic LLM tools, best practices, and minimal abstraction.
llm_utils
is not a 'framework'. There are no chains, agents, or buzzwords. Abstraction is minimized as much as possible and individual components are easily accessible. Rather than a multitude of minimal viable implementations, the focus is on comprehensive, best practice implementations.
For real world examples of how this crate is used, check out the llm_client crate.
Model Loading
models::open_source_model::OsLlm
models::open_source_model::preset::LlmPresetLoader
- Presets for popular models that includes HF repos for the models, and local copies of tokenizers and chat templates
- Loads the best quantized model by calculating the largest quant that will fit in your VRAM
- Supports Llama 3, Phi, Mistral/Mixtral, and more
models::open_source_model::LlmGgufLoader
- Loads any GGUF model from HF repo or locally
models::api_model::ApiLlm
- Supports openai, anthropic, perplexity
- Supports prompting, tokenization, and price estimation
Essentials
tokenizer::LlmTokenizer
- A simple abstraction over HF's tokenizers and tiktoken-rs
- Load from local or HF.
- Included in
OsLlm
andApiLlm
prompting::LlmPrompt
- Build System/User/Assitant prompt messages into formatted prompts
- Supports chat template strings/tokens and openai hashmaps
- Count prompt tokens
- Integrated with
OsLlm
andApiLlm
- Assemble messages from multiple text inputs
- Build with generation prefixes on all chat template models. Even those that don't explicitly support it.
Constraints
grammar::Grammar
- Pre-built configurable grammars for fine grained control of open source LLM outputs
- Currently supports Llama.cpp style grammars, but intended to scale to support other grammars in the future.
logit_bias
- Supports all LLMs that can use logit bias constraints
Text Processing and NLP
text_utils::chunking::TextChunker
- A novel balanced text chunker that creates chunks of approximately equal length
- More accurate than unbalanced implementations that create orphaned final chunks
- Optimized with a parallelization
text_utils::splitting::TextSplitter
- Unicode text segmentation on paragraphs, sentences, words, graphemes
- The only semantic sentence segementation implementation in Rust (Please ping me if i'm wrong!) - mostly works
text_utils::clean_text::TextCleaner
- Clean raw text into unicode format
- Reduce duplicate whitespace
- Remove unwanted chars and graphemes
text_utils::clean_html
- Clean raw HTML into clean strings of content
- Uses an implementation of Mozilla's Readability to remove unwanted HTML
text_utils::test_text
- Macro generated test content
- Used for internal testing, but can be used for general LLM test cases
Setter Traits
- All setter traits are public, so you can integrate into your own projects if you wish.
- For example:
models::api_model_openai::OpenAiModelTrait
ormodels::open_source_model::hf_loader::HfTokenTrait
Installation
[]
= "*"
Model Loading 🛤️
LlmPresetLoader
- Presets for Open Source LLMs from Hugging Face, or from local storage
- Load and/or download a model with metadata, tokenizer, and local path (for local LLMs like llama.cpp, vllm, mistral.rs)
- Auto-select the largest quantized GGUF that will fit in your vram!
// Load the largest quantized Meta-Llama-3-8B-Instruct model that will fit in your vram
let model: OsLlm = new
.llama3_8b_instruct
.vram
.use_ctx_size // ctx_size impacts vram usage!
.load?;
See example.
LlmGgufLoader
- GGUF models from Hugging Face or local path
// From HF
let model_url = "https://huggingface.co/MaziyarPanahi/Meta-Llama-3-8B-Instruct-GGUF/blob/main/Meta-Llama-3-8B-Instruct.Q6_K.gguf";
let model: OsLlm = new
.hf_quant_file_url
.load?;
// Note: because we can't instantiate a tokenizer from a GGUF file, the returned model will not have a tokenizer!
// However, if we provide the base model's repo, we load from there.
let repo_id = "meta-llama/Meta-Llama-3-8B-Instruct";
let model: OsLlm = new
.hf_quant_file_url
.hf_config_repo_id
.load?;
// From Local
let local_path = "/root/.cache/huggingface/hub/models--MaziyarPanahi--Meta-Llama-3-8B-Instruct-GGUF/blobs/c2ca99d853de276fb25a13e369a0db2fd3782eff8d28973404ffa5ffca0b9267";
let model: OsLlm = new
.local_quant_file_path
.load?;
// Again, we require a tokenizer.json. This can also be loaded from a local path.
let local_config_path = "/llm_utils/src/models/open_source/llama/llama3_8b_instruct";
let model: OsLlm = new
.local_quant_file_path
.local_config_path
.load?;
ApiLlm
let model: ApiLlm = gpt_4_o;
assert_eq!
// Or Anthropic
//
let model: ApiLlm = claude_3_opus;
Essentials 🧮
Prompting
-
Generate properly formatted prompts for GGUF models, Openai, and Anthropic.
-
Uses the GGUF's chat template and Jinja templates to format the prompt to model spec.
let model: OsLlm = new.llama3_8b_instruct.load?;
let prompt: LlmPrompt = new_from_os_llm;
// or
let model: ApiLlm = gpt_4_o;
let prompt: LlmPrompt = new_from_openai_llm;
// Add system messages
prompt.add_system_message.set_content;
// User messages
prompt.add_user_message.set_content;
// LLM responses
prompt.add_assistant_message.set_content;
// Messages all share the same functions see prompting::PromptMessage for more
prompt.add_system_message.append_content;
prompt.add_system_message.prepend_content;
// Build prompt to set the built prompt fields to be sent to the llm
prompt.build;
// Build with generation prefix. The llm will complete the response: 'Don't you think that is... cool?'.
prompt.build_with_generation_prefix;
// Build without safety checks (allows to build with assistant as final message) for debug and print
prompt.build_final;
// Chat template formatted
let chat_template_prompt: String = prompt.built_chat_template_prompt.clone
let chat_template_prompt_as_tokens: = prompt.built_prompt_as_tokens.clone
// Openai formatted prompt (Openai and Anthropic format)
let openai_prompt: = prompt.built_openai_prompt.clone
// Get total tokens in prompt
let total_prompt_tokens: u32 = prompt.total_prompt_tokens;
// Validate requested max_tokens for a generation. If it exceeds the models limits, reduce max_tokens to a safe value.
let actual_request_tokens = check_and_get_max_tokens?;
Tokenizer
-
Hugging Face's Tokenizer library for local models and Tiktoken-rs for OpenAI and Anthropic (Anthropic doesn't have a publically available tokenizer.)
-
Simple abstract API for encoding and decoding allows for abstract LLM consumption across multiple architechtures.
-
Safely set the
max_token
param for LLMs to ensure requests don't fail due to exceeding token limits!
// Get a Tiktoken tokenizer
//
let tokenizer: LlmTokenizer = new_tiktoken;
// Get a Hugging Face tokenizer from local path
//
let tokenizer: LlmTokenizer = new_from_tokenizer_json;
// Or load from repo
//
let tokenizer: LlmTokenizer = new_from_hf_repo;
// Get tokenizan'
//
let token_ids: = tokenizer.tokenize;
let count: u32 = tokenizer.count_tokens;
let word_probably: String = tokenizer.detokenize_one?;
let words_probably: String = tokenizer.detokenize_many?;
// These function are used for generating logit bias
let token_id: u32 = tokenizer.try_into_single_token;
let word_probably: String = tokenizer.try_from_single_token_id;
Text Processing and NLP 🪓
Text cleaning
// Normalizes all whitespace chars .
// Reduce the number of newlines to singles or doubles (paragraphs) or convert them to " ".
// Optionally, remove all characters besides alphabetic, numbers, and punctuation.
//
let mut text_cleaner: String = new;
let cleaned_text: String = text_cleaner
.reduce_newlines_to_single_space
.remove_non_basic_ascii
.run;
// Convert HTML to cleaned text.
// Uses an implementation of Mozilla's readability mode and HTML2Text.
//
let cleaned_text: String = clean_html;
Text segmentation
Split text by paragraphs, sentences, words, and graphemes.
let paragraph_splits: = new
.on_two_plus_newline
.split_text?;
let newline_splits: = new
.on_single_newline
.split_text?;
// There is no good implementation sentence splitting in Rust!
// This implementation is better than unicode-segmentation crate or any other crate I tested.
// But still not as good as a model based approach like Spacy or other NLP libraries.
//
let sentence_splits: = new
.on_sentences_rule_based
.split_text?;
// Unicode
let sentence_splits: = new
.on_sentences_unicode
.split_text?;
let word_splits: = new
.on_words_unicode
.split_text?;
let graphemes_splits: = new
.on_graphemes_unicode
.split_text?;
// If the split separator produces less than two splits,
// this mode tries the next separator.
// It does this until it produces more than one split.
//
let paragraph_splits: = new
.on_two_plus_newline
.recursive
.split_text?;
Text chunking
Balanced text chunking means that all chunks are approximately the same size.
See my blog post on text chunking for implementation details.
let text = "one, two, three, four, five, six, seven, eight, nine";
// Give a max token count of four, other text chunkers would split this into three chunks.
assert_eq!;
// A balanced text chunker, however, would also split the text into three chunks, but of even sizes.
assert_eq!;
As long as the the total token length of the incoming text is not evenly divisible by they max token count, the final chunk will be smaller than the others. In some cases it will be so small it will be "orphaned" and rendered useless. If you asked your RAG implementation What did seven eat?
, that final chunk that answers the question would not be retrievable.
The TextChunker first attempts to split semantically in the following order: Paragraphs, newlines, sentences. If that fails it builds chunks linearlly by using the largest available splits, and splitting where needed.
Constraints #️⃣
Grammar
-
Grammars are the most capable method for structuring the output of an LLM. This was designed for use with LlamaCpp, but plan to support others.
-
Current implementations include booleans, integers, sentences, words, exact strings, and more. Open an issue if you'd like to suggest more.
// A grammar that constraints to a number between 1 and 4
let integer_grammar: IntegerGrammar = integer;
integer_grammar.lower_bound.upper_bound;
// Sets a stop word to be appended to the end of generation
integer_grammar.set_stop_word_done;
// Sets the primitive as optional; a stop word can be generated rather than the primitive
integer_grammar.set_stop_word_null_result;
// Returns the string to feed into LLM call
let grammar_string: String = integer_grammar.grammar_string;
// Cleans the response and checks if it's valid
let string_response: = integer_grammar.validate_clean;
// Parses the response to the grammar's primitive
let integer_response: = integer_grammar.grammar_parse;
// Enum for dynamic abstraction
let grammar: Grammar = integer_grammar.wrap;
// The enum implements the same functions that are generic across all grammars
grammar.set_stop_word_done;
let grammar_string: String = grammar.grammar_string;
let string_response: = grammar.validate_clean;
See the module for all implemented types
Logit bias
-
Create properly formatted logit bias requests for LlamaCpp and Openai.
-
Functionality to add logit bias from a variety of sources, along with validation.
// Exclude some tokens from text generation
//
let mut words = new;
words.entry.or_insert;
words.entry.or_insert;
// Build and validate
//
let logit_bias = logit_bias_from_words
let validated_logit_bias = validate_logit_bias_values?;
// Convert
//
let openai_logit_bias = convert_logit_bias_to_openai_format?;
let llama_logit_bias = convert_logit_bias_to_llama_format?;
License
This project is licensed under the MIT License.
Contributing
My motivation for publishing is for someone to point out if I'm doing something wrong!