# ๐๏ธ **Awful Dataset Builder: Turn Reference Text/Exam Question mappings into Question/Answer pairs!** ๐
> *โTurn your study notes into interrogation scripts for robots.โ*
```
__
_____....--' .'
___...---'._ o -`(
___...---' \ .--. `\
___...---' | \ \ `|
| |o o | | |
| \___'.-`. '.
| | `---'
'^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^' LGB
```
```
ฮป awful_dataset_builder --help
Generate final exam questions from YAML book chunks
Usage: awful_dataset_builder --dir <DIR> --config <CONFIG> --start <START> --source-type <SOURCE_TYPE>
Options:
-d, --dir <DIR> Path to directory of .yaml book files
-c, --config <CONFIG> Configuration file
-s, --start <START> Start processing file from this chunk
--source-type <SOURCE_TYPE> Source type [possible values: book, manpage, mdbook, tealdeer, code]
-h, --help Print help
```
---
## ๐ง **What It Does**
`awful_dataset_builder` is a command-line tool that takes structured YAML files (from `awkward_knowledge_synthesizer`) and generates **question-answer pairs** using Large Language Models (LLMs). It's your go-to tool for building datasets for finetuning LLMs.
---
## ๐ฏ **Features**
- โ
**Multi-source support**: Books, manpages, mdbooks, tealdeer (command-line snippets), and code files can be turned into exam questions by [awful_knowledge_synthesizer](https://github.com/graves/awful_knowledge_synthesizer).
- ๐ง **LLM-powered QA pairs**: Fetches answers for final exam questions using `awful_aj`.
- ๐ **YAML output**: Saves results as structured YAML files (e.g., `math_questions.yaml`).
- ๐ **Chunked processing**: Splits text into chunks for robust LLM queries.
---
## ๐ฆ **How to Use**
### ๐ง Sample Command
```bash
ฮป awful_dataset_builder --dir ./books --config config.yaml --source_type Book --start 1
```
- `--dir`: Path to YAML files (e.g., `books/`).
- `--config`: Configuration file for LLM API (OpenAI, etc.).
- `--source_type`: Choose from: `Book`, `Manpage`, `Mdbook`, `Tealdeer`, or `Code`.
- `--start`: Skip files from this chunk (useful for parallel processing).
### ๐ Example Output
For a book YAML file:
```yaml
title: "Calculus for Dummies"
chunks:
- "What is the derivative of f(x) = xยฒ?"
```
Output:
```yaml
- prompt: "Here is some reference text:\n\nWhat is the derivative of f(x) = xยฒ?"
answer: "The derivative of $ f(x) = x^2 $ is $ 2x $."
```
---
## ๐ค **How It Works**
1. **Parse YAML**: Extracts structured Reference Text to Final Exam Question mappings.
2. **LLM Query**: Uses templates to generate questions and fetch answers via `awful_aj`.
3. **Output**: Saves QA pairs in YAML format (e.g., `math_questions.yaml`).
---
## ๐ **Supported Sources**
| Source Type | Description |
|-------------|-------------|
| `Book` | YAML files with questions generated from book excerpts (e.g., `"Title: Math for Dummies"`) |
| `Manpage` | YAML files with questions generated from manpages excerpts |
| `Mdbook` | YAML files withquestions generated from Markdown excerpts (`mdbook` built documentation) |
| `Tealdeer` | YAML files with questions generated from Command-line snippets (`tldr` commands) |
| `Code` | YAML files with questions generated from C, Rust, or Assembly source code repositories (language-aware tokenization) |
---
## ๐งช **Implementation Notes**
- Uses `clap` for CLI parsing.
- Relies on `serde`, `tokio`, and `regex`.
- LLM queries are handled with exponential backoff (`MAX_RETRIES = 5`).
---
## โค๏ธ **Contributing**
- Report bugs or suggest improvements via GitHub Issues.
- Fork and extend to support new source types!
---
## โจ **Final Thoughts**
Building datasets is the most dificult, time-consuming labor involved with the Synthetic Finetuning of LLMs. A well thought out workflow using [Awful Book Sanitizer](https://github.com/graves/awful_book_sanitizer), [Awful Knowledge Synthesizer](https://github.com/graves/awful_knowledge_synthesizer), and [Awful Dataset Builder](https://github.com/graves/awful_dataset_builder) will allow you to experiment with your wildest curiosities about human language, on the cutting edge of technological advancement for as long as written language exists ๐
You can find Open Source datasets I've generated using these tools on [Huggingface](https://huggingface.co/dougiefresh/datasets).