llm_utils 0.0.10

Basic LLM tools, best practices, and minimal abstraction.
Documentation

llm_utils: Tools for LLMs with minimal abstraction.

llm_utils is not a 'framework'. There are no chains, agents, or buzzwords. Abstraction is minimized as much as possible and individual components are easily accessible. For real world examples of how this crate is used, check out the llm_client crate.

Cargo Install
llm_utils="*"

LocalLlmModel

Everything you need for GGUF models. The GgugLoader wraps the loaders for convience. All loaders return a LocalLlmModel which contains the tokenizer, metadata, chat template, and anything that can be extract from the GGUF.

GgufPresetLoader

  • Presets for popular models like Llama 3, Phi, Mistral/Mixtral, and more
  • Loads the best quantized model by calculating the largest quant that will fit in your VRAM
let model: LocalLlmModel = GgufLoader::default()
    .llama3_1_8b_instruct()
    .preset_with_available_vram_gb(48) // Load the largest quant that will fit in your vram
    .load()?;

GgufHfLoader

GGUF models from Hugging Face.

let model: LocalLlmModel = GgufLoader::default()
    .hf_quant_file_url("https://huggingface.co/bartowski/Meta-Llama-3.1-8B-Instruct-GGUF/blob/main/Meta-Llama-3.1-8B-Instruct-Q8_0.gguf")
    .load()?;

GgufLocalLoader

GGUF models for local storage.

let model: LocalLlmModel = GgufLoader::default()
    .local_quant_file_path("/root/.cache/huggingface/hub/models--bartowski--Meta-Llama-3.1-8B-Instruct-GGUF/blobs/9da71c45c90a821809821244d4971e5e5dfad7eb091f0b8ff0546392393b6283")
    .load()?;

ApiLlmModel

  • Supports openai, anthropic, perplexity, and adding your own API models
  • Supports prompting, tokenization, and price estimation
    assert_eq!(ApiLlmModel::gpt_4_o(), ApiLlmModel {
        model_id: "gpt-4o".to_string(),
        context_length: 128000,
        cost_per_m_in_tokens: 5.00,
        max_tokens_output: 4096,
        cost_per_m_out_tokens: 15.00,
        tokens_per_message: 3,
        tokens_per_name: 1,
        tokenizer: Arc<LlmTokenizer>,
    })

LlmTokenizer

    let tok = LlmTokenizer::new_tiktoken("gpt-4o"); // Get a Tiktoken tokenizer
    let tok = LlmTokenizer::new_from_tokenizer_json("path/to/tokenizer.json"); // From local path
    let tok = LlmTokenizer::new_from_hf_repo(hf_token, "meta-llama/Meta-Llama-3-8B-Instruct"); // From repo
    // From LocalLlmModel or ApiLlmModel
    let tok = model.model_base.tokenizer;

LlmPrompt

  • Generate properly formatted prompts for GGUF models, Openai, and Anthropic. Supports chat template strings/tokens and openai hashmaps
  • Count prompt tokens and check to ensure it's within model limits
  • Uses the GGUF's chat template and Jinja templates to format the prompt to model spec. Build with generation prefixes on all chat template models. Even those that don't explicitly support it.
// From LocalLlmModel or ApiLlmModel
let prompt: LlmPrompt = LlmPrompt::new_chat_template_prompt(&model);
let prompt: LlmPrompt = LlmPrompt::new_openai_prompt(&model);

// Add system messages
prompt.add_system_message().set_content("You are a nice robot");

// User messages
prompt.add_user_message().set_content("Hello");

// LLM responses
prompt.add_assistant_message().set_content("Well how do you do?");

// Messages all share the same functions see prompting::PromptMessage for more
prompt.add_system_message().append_content(final_rule_set);
prompt.add_system_message().prepend_content(starting_rule_set);

// Builds with generation prefix. The llm will complete the response: 'Don't you think that is... cool?'. 
prompt.set_generation_prefix("Don't you think that is...");

// Get total tokens in prompt
let total_prompt_tokens: u32 = prompt.get_total_prompt_tokens();

// Get chat template formatted prompt
let chat_template_prompt: String = prompt.get_built_prompt_string();
let chat_template_prompt_as_tokens: Vec<u32> = prompt.get_built_prompt_as_tokens()

// Openai formatted prompt (Openai and Anthropic format)
let openai_prompt: Vec<HashMap<String, String>> = prompt.get_built_prompt_hashmap()


// Validate requested max_tokens for a generation. If it exceeds the models limits, reduce max_tokens to a safe value.
let actual_request_tokens = check_and_get_max_tokens(
        model.context_length,
        model.max_tokens_output, // If using a GGUF model use either model.context_length or the ctx_size of the server.
        total_prompt_tokens,
        10,
        requested_max_tokens,
    )?;

Text Processing and NLP

TextChunker

Balanced text chunking means that all chunks are approximately the same size. See my blog post on text chunking for implementation details.

  • A novel balanced text chunker that creates chunks of approximately equal length
  • More accurate than unbalanced implementations that create orphaned final chunks
  • Optimized with a parallelization
let text = "one, two, three, four, five, six, seven, eight, nine";

// Give a max token count of four, other text chunkers would split this into three chunks.
assert_eq!(["one, two, three, four", "five, six, seven, eight", "nine"], // "nine" is orphaned!
    OtherChunkers::new()
    .max_chunk_token_size(4)
    .Chunk(text));

// A balanced text chunker, however, would also split the text into three chunks, but of even sizes.
assert_eq!(["one, two, three", "four, five, six", "seven, eight, nine"], 
    TextChunker::new()
    .max_chunk_token_size(4)
    .run(&text)?);
       

As long as the the total token length of the incoming text is not evenly divisible by they max token count, the final chunk will be smaller than the others. In some cases it will be so small it will be "orphaned" and rendered useless. If you asked your RAG implementation What did seven eat?, that final chunk that answers the question would not be retrievable.

The TextChunker first attempts to split semantically in the following order: Paragraphs, newlines, sentences. If that fails it builds chunks linearlly by using the largest available splits, and splitting where needed.

TextSplitter

  • Unicode text segmentation on paragraphs, sentences, words, graphemes
  • The only semantic sentence segementation implementation in Rust (Please ping me if i'm wrong!) - mostly works
let paragraph_splits: Vec<String> =  TextSplitter::new()
    .on_two_plus_newline()
    .split_text(&text)?;

let newline_splits: Vec<String> =  TextSplitter::new()
    .on_single_newline()
    .split_text(&text)?;

// There is no good implementation sentence splitting in Rust!
// This implementation is better than unicode-segmentation crate or any other crate I tested.
// But still not as good as a model based approach like Spacy or other NLP libraries.
//
let sentence_splits: Vec<String> =  TextSplitter::new()
    .on_sentences_rule_based()
    .split_text(&text)?;

// Unicode

let sentence_splits: Vec<String> =  TextSplitter::new()
    .on_sentences_unicode()
    .split_text(&text)?;

let word_splits: Vec<String> =  TextSplitter::new()
    .on_words_unicode()
    .split_text(&text)?;


let graphemes_splits: Vec<String> =  TextSplitter::new()
    .on_graphemes_unicode()
    .split_text(&text)?;

// If the split separator produces less than two splits,
// this mode tries the next separator.
// It does this until it produces more than one split.
//
let paragraph_splits: Vec<String> =  TextSplitter::new()
    .on_two_plus_newline()
    .recursive(true)
    .split_text(&text)?;

TextCleaner

  • Clean raw text into unicode format
  • Reduce duplicate whitespace
  • Remove unwanted chars and graphemes
// Normalizes all whitespace chars .
// Reduce the number of newlines to singles or doubles (paragraphs) or convert them to " ".
// Optionally, remove all characters besides alphabetic, numbers, and punctuation. 
//
let mut text_cleaner: String = llm_utils::text_utils::clean_text::TextCleaner::new();
let cleaned_text: String = text_cleaner
    .reduce_newlines_to_single_space()
    .remove_non_basic_ascii()
    .run(some_dirty_text);

// Convert HTML to cleaned text.
// Uses an implementation of Mozilla's readability mode and HTML2Text.
//
let cleaned_text: String = llm_utils::text_utils::clean_html::clean_html(raw_html);

clean_html

  • Clean raw HTML into clean strings of content
  • Uses an implementation of Mozilla's Readability to remove unwanted HTML

test_text

  • Macro generated test content
  • Used for internal testing, but can be used for general LLM test cases

Grammar Constraints

// A grammar that constraints to a number between 1 and 4
let integer_grammar: IntegerGrammar = Grammar::integer();
integer_grammar.lower_bound(1).upper_bound(4);
// Sets a stop word to be appended to the end of generation
integer_grammar.set_stop_word_done("Done.");
// Sets the primitive as optional; a stop word can be generated rather than the primitive
integer_grammar.set_stop_word_no_result("None.");

// Returns the string to feed into LLM call
let grammar_string: String = integer_grammar.grammar_string();

// Cleans the response and checks if it's valid
let string_response: Result<String, GrammarError> = integer_grammar.validate_clean(llm_response);
// Parses the response to the grammar's primitive
let integer_response: Result<u32, GrammarError>  = integer_grammar.grammar_parse(llm_response);

// Enum for dynamic abstraction
let grammar: Grammar = integer_grammar.wrap();
// The enum implements the same functions that are generic across all grammars
grammar.set_stop_word_done("Done.");
let grammar_string: String = grammar.grammar_string();
let string_response: Result<String, GrammarError> = grammar.validate_clean(llm_response);

Setter Traits

  • All setter traits are public, so you can integrate into your own projects if you wish.
  • For example: OpenAiModelTrait,GgufLoaderTrait,AnthropicModelTrait, and HfTokenTrait for loading models

Blog Posts

License

This project is licensed under the MIT License.

Contributing

My motivation for publishing is for someone to point out if I'm doing something wrong!