awful_dataset_builder 0.1.3

Build LLM-ready Q/A datasets from reference text-to-question mappings produced by Awful Knowledge Synthesizer.
# ๐Ÿ—๏ธ **Awful Dataset Builder: Turn Reference Text/Exam Question mappings into Question/Answer pairs!** ๐Ÿ“š

> *โ€œTurn your study notes into interrogation scripts for robots.โ€*

```
                                           __
                               _____....--' .'
                     ___...---'._ o      -`(
           ___...---'            \   .--.  `\
 ___...---'                      |   \   \ `|
|                                |o o |  |  |
|                                 \___'.-`.  '.
|                                      |   `---'
'^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^' LGB
```

```
ฮป awful_dataset_builder --help
Generate final exam questions from YAML book chunks

Usage: awful_dataset_builder --dir <DIR> --config <CONFIG> --start <START> --source-type <SOURCE_TYPE>

Options:
  -d, --dir <DIR>                  Path to directory of .yaml book files
  -c, --config <CONFIG>            Configuration file
  -s, --start <START>              Start processing file from this chunk
      --source-type <SOURCE_TYPE>  Source type [possible values: book, manpage, mdbook, tealdeer, code]
  -h, --help                       Print help
```

---

## ๐Ÿง  **What It Does**  
`awful_dataset_builder` is a command-line tool that takes structured YAML files (from `awkward_knowledge_synthesizer`) and generates **question-answer pairs** using Large Language Models (LLMs). It's your go-to tool for building datasets for finetuning LLMs.

---

## ๐ŸŽฏ **Features**  
- โœ… **Multi-source support**: Books, manpages, mdbooks, tealdeer (command-line snippets), and code files can be turned into exam questions by [awful_knowledge_synthesizer](https://github.com/graves/awful_knowledge_synthesizer).  
- ๐Ÿง  **LLM-powered QA pairs**: Fetches answers for final exam questions using `awful_aj`.  
- ๐Ÿ“„ **YAML output**: Saves results as structured YAML files (e.g., `math_questions.yaml`).  
- ๐Ÿ”„ **Chunked processing**: Splits text into chunks for robust LLM queries.  

---

## ๐Ÿ“ฆ **How to Use**  
### ๐Ÿ”ง Sample Command  
```bash
ฮป awful_dataset_builder --dir ./books --config config.yaml --source_type Book --start 1
```
- `--dir`: Path to YAML files (e.g., `books/`).  
- `--config`: Configuration file for LLM API (OpenAI, etc.).  
- `--source_type`: Choose from: `Book`, `Manpage`, `Mdbook`, `Tealdeer`, or `Code`.  
- `--start`: Skip files from this chunk (useful for parallel processing).  

### ๐Ÿ“„ Example Output  
For a book YAML file:
```yaml
title: "Calculus for Dummies"
chunks:
  - "What is the derivative of f(x) = xยฒ?"
```
Output:
```yaml
- prompt: "Here is some reference text:\n\nWhat is the derivative of f(x) = xยฒ?"
  answer: "The derivative of $ f(x) = x^2 $ is $ 2x $."
```

---

## ๐Ÿค“ **How It Works**  
1. **Parse YAML**: Extracts structured Reference Text to Final Exam Question mappings.  
2. **LLM Query**: Uses templates to generate questions and fetch answers via `awful_aj`.  
3. **Output**: Saves QA pairs in YAML format (e.g., `math_questions.yaml`).  

---

## ๐Ÿ“š **Supported Sources**  
| Source Type | Description |
|-------------|-------------|
| `Book`      | YAML files with questions generated from book excerpts (e.g., `"Title: Math for Dummies"`) |
| `Manpage`   | YAML files with questions generated from manpages excerpts |
| `Mdbook`    | YAML files withquestions generated from Markdown excerpts (`mdbook` built documentation) |
| `Tealdeer`  | YAML files with questions generated from Command-line snippets (`tldr` commands) |
| `Code`      | YAML files with questions generated from C, Rust, or Assembly source code repositories (language-aware tokenization) |

---

## ๐Ÿงช **Implementation Notes**  
- Uses `clap` for CLI parsing.  
- Relies on `serde`, `tokio`, and `regex`.  
- LLM queries are handled with exponential backoff (`MAX_RETRIES = 5`).  

---

## โค๏ธ **Contributing**  
- Report bugs or suggest improvements via GitHub Issues.  
- Fork and extend to support new source types!  

---

## โœจ **Final Thoughts**  
Building datasets is the most dificult, time-consuming labor involved with the Synthetic Finetuning of LLMs. A well thought out workflow using [Awful Book Sanitizer](https://github.com/graves/awful_book_sanitizer), [Awful Knowledge Synthesizer](https://github.com/graves/awful_knowledge_synthesizer), and [Awful Dataset Builder](https://github.com/graves/awful_dataset_builder) will allow you to experiment with your wildest curiosities about human language, on the cutting edge of technological advancement for as long as written language exists ๐ŸŽ‰  

You can find Open Source datasets I've generated using these tools on [Huggingface](https://huggingface.co/dougiefresh/datasets).