llm-bucket 0.2.0

Open source core logic and pipelines for synchronising a bucket with content for LLMs to consume with RAG.
Documentation

llm-bucket

A fast, structured CLI utility for aggregating knowledge snapshots from Git repositories (and soon other sources) into ready-to-ingest local outputs and/or uploading them to an API for knowledge base workflows, LLM training, and auditing. Designed for automation, repeatability, and clean output.


Features

  • Sync multiple Git repositories (public or private) in a single run via YAML config.
  • Processing options: Flatten all files or (optionally) convert repository README.md to PDF.
  • Deterministic, isolated output folders for easy downstream use.
  • Single-command CLI—no interactive/manual steps required.
  • Uploads to API (if configured): integrates with external knowledge stores (requires environment config).
  • Extensible pipeline: Designed for additional source types (Confluence, Slack, etc.) in future.

Quick Start

1. Install Prerequisites

  • Rust (stable, edition 2021)
  • Git (must be on your PATH)

2. Build the CLI

git clone https://github.com/kasbuunk/llm-bucket.git
cd llm-bucket
cargo build --release

The executable will be at ./target/release/llm-bucket.


Configuration

All actions are configured in a YAML file. No command-line flags for input sources.

Example minimal config (config.yaml):

download:
  output_dir: ./output
  sources:
    - type: git
      repo_url: "https://github.com/youruser/yourrepo.git"
      reference: main         # optional: branch/tag/commit

process:
  kind: FlattenFiles          # or ReadmeToPDF
  • output_dir: Root directory for clones & processed data (recommended: gitignore this in production).
  • sources: List of source blocks. Currently only type: git is supported.
    • repo_url: HTTPS or SSH URL for the git repo.
    • reference: Optional; branch/tag/commit (default: main).
  • process.kind: Currently accepts:
    • FlattenFiles: Flatten all files for upload.
    • ReadmeToPDF: Convert repository README.md to PDF (if implemented for your repo).

Usage

After configuring config.yaml, run:

./target/release/llm-bucket sync --config config.yaml
  • The CLI clones each repo and processes it as specified.
  • Output is placed under output_dir (subdirectories per repo, deterministic naming).

Note: Only subcommand available is sync (see below).


Upload/API Integration

Uploading to a remote knowledge base/API requires these environment variables:

  • BUCKET_ID — Integer bucket/project ID (provided by backend/admin)
  • OCP_APIM_SUBSCRIPTION_KEY — API key/token for upload

You can use a .env file (auto-loaded by the CLI) or set variables in your environment:

export BUCKET_ID=1234
export OCP_APIM_SUBSCRIPTION_KEY=your-token-here

Output & Structure

  • All results are placed within the configured output_dir.
    • Each source is mapped to a uniquely named subfolder (sanitized from repo URL and branch).
    • Data format and structure matches your selected process.kind.
  • Example: For a single repo, with FlattenFiles, output might look like:
    ./output/
      git_github_com_youruser_yourrepo_git_main/
        src/
        README.md
        ...
    

Logging & Diagnostics

  • All actions are logged via tracing.
  • Log summaries printed to stdout/stderr, including errors and high-level summaries.

Development & Testing

Run all checks and tests:

cargo test

Tests cover:

  • End-to-end sync with test repos
  • CLI invocation with configs
  • Processor output validation

Contribution & Roadmap

  • All contributions welcome via PR or issues.
  • Please keep PRs small and focused.
  • Future: support for Confluence, Slack, Google Drive, robust error reports, scheduling, more processors.

FAQ

Q: How do I add a new source type?
A: See the src/ directory for modular structure; implement new SourceAction and expand the YAML loader, then add download, processing, and (optional) upload logic.

Q: Is interactive usage supported?
A: No—llm-bucket is for declarative, repeatable workflows. Only config files.

Q: Is output safe for public commit?
A: No. Output is meant for ingestion/upload, not VCS; it should be gitignored.


License

MIT (see LICENSE).


For design notes and future directions, see notes.md.