๐๏ธ Awful Dataset Builder: Turn Reference Text/Exam Question mappings into Question/Answer pairs! ๐
โTurn your study notes into interrogation scripts for robots.โ
__
_____....--' .'
___...---'._ o -`(
___...---' \ .--. `\
___...---' | \ \ `|
| |o o | | |
| \___'.-`. '.
| | `---'
'^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^' LGB
ฮป awful_dataset_builder --help
Generate final exam questions from YAML book chunks
Usage: awful_dataset_builder --dir <DIR> --config <CONFIG> --start <START> --source-type <SOURCE_TYPE>
Options:
-d, --dir <DIR> Path to directory of .yaml book files
-c, --config <CONFIG> Configuration file
-s, --start <START> Start processing file from this chunk
--source-type <SOURCE_TYPE> Source type [possible values: book, manpage, mdbook, tealdeer, code]
-h, --help Print help
๐ง What It Does
awful_dataset_builder is a command-line tool that takes structured YAML files (from awkward_knowledge_synthesizer) and generates question-answer pairs using Large Language Models (LLMs). It's your go-to tool for building datasets for finetuning LLMs.
๐ฏ Features
- โ Multi-source support: Books, manpages, mdbooks, tealdeer (command-line snippets), and code files can be turned into exam questions by awful_knowledge_synthesizer.
- ๐ง LLM-powered QA pairs: Fetches answers for final exam questions using
awful_aj. - ๐ YAML output: Saves results as structured YAML files (e.g.,
math_questions.yaml). - ๐ Chunked processing: Splits text into chunks for robust LLM queries.
๐ฆ How to Use
๐ง Sample Command
--dir: Path to YAML files (e.g.,books/).--config: Configuration file for LLM API (OpenAI, etc.).--source_type: Choose from:Book,Manpage,Mdbook,Tealdeer, orCode.--start: Skip files from this chunk (useful for parallel processing).
๐ Example Output
For a book YAML file:
title: "Calculus for Dummies"
chunks:
- "What is the derivative of f(x) = xยฒ?"
Output:
- prompt: "Here is some reference text:\n\nWhat is the derivative of f(x) = xยฒ?"
answer: "The derivative of $ f(x) = x^2 $ is $ 2x $."
๐ค How It Works
- Parse YAML: Extracts structured Reference Text to Final Exam Question mappings.
- LLM Query: Uses templates to generate questions and fetch answers via
awful_aj. - Output: Saves QA pairs in YAML format (e.g.,
math_questions.yaml).
๐ Supported Sources
| Source Type | Description |
|---|---|
Book |
YAML files with questions generated from book excerpts (e.g., "Title: Math for Dummies") |
Manpage |
YAML files with questions generated from manpages excerpts |
Mdbook |
YAML files withquestions generated from Markdown excerpts (mdbook built documentation) |
Tealdeer |
YAML files with questions generated from Command-line snippets (tldr commands) |
Code |
YAML files with questions generated from C, Rust, or Assembly source code repositories (language-aware tokenization) |
๐งช Implementation Notes
- Uses
clapfor CLI parsing. - Relies on
serde,tokio, andregex. - LLM queries are handled with exponential backoff (
MAX_RETRIES = 5).
โค๏ธ Contributing
- Report bugs or suggest improvements via GitHub Issues.
- Fork and extend to support new source types!
โจ Final Thoughts
Building datasets is the most dificult, time-consuming labor involved with the Synthetic Finetuning of LLMs. A well thought out workflow using Awful Book Sanitizer, Awful Knowledge Synthesizer, and Awful Dataset Builder will allow you to experiment with your wildest curiosities about human language, on the cutting edge of technological advancement for as long as written language exists ๐
You can find Open Source datasets I've generated using these tools on Huggingface.