headson 0.7.15 - Docs.rs

<h1 align="center">
  <img src="https://raw.githubusercontent.com/kantord/headson/main/docs/assets/logo.svg" alt="headson" width="221" />
</h1>
<p align="center">
  <img src="https://raw.githubusercontent.com/kantord/headson/main/docs/assets/tapes/demo.gif" alt="Terminal demo" width="1560" height="900" />
  <br/>
</p>

`head`/`tail` for JSON, YAML — but structure‑aware. Get a compact preview that shows both the shape and representative values of your data, all within a strict byte budget. (Just like `head`/`tail`, `hson` can also work with unstructured text files.)

Available as:
- CLI (see [Usage](#usage))
- Python library (see [Python Bindings](#python-bindings))

![Codecov](https://img.shields.io/codecov/c/github/kantord/headson?style=flat-square) ![Crates.io Version](https://img.shields.io/crates/v/headson?style=flat-square) ![PyPI - Version](https://img.shields.io/pypi/v/headson?style=flat-square)


## Install


Using Cargo:

    cargo install headson

> Note: the CLI installs as `hson`. All examples below use `hson ...`.

From source:

    cargo build --release
    target/release/hson --help


## Features

- Budgeted output: specify exactly how much you want to see
- Output formats: `auto | json | yaml | text`
  - Styles: `strict | default | detailed`
    - JSON family: `strict` → strict JSON, `default` → human‑friendly Pseudo, `detailed` → JS with inline comments
    - YAML: always YAML; `strict` has no comments, `default` uses “# …”, `detailed` uses “# N more …”
    - Text: prints raw lines. In `default` style, omissions are shown as a single line `…`; in `detailed`, as `… N more lines …`. `strict` omits array‑level summaries.
- Multiple inputs: preview many files at once with a shared or per‑file budget
- Fast: processes gigabyte‑scale files in seconds (mostly disk‑bound)
- Available as a CLI app and as a Python library

## Fits into command line workflows

If you’re comfortable with tools like `head` and `tail`, use `hson` when you want a quick, structured peek into a JSON file without dumping the entire thing.

- `head`/`tail` operate on bytes/lines - their output is not optimized for tree structures
- `jq`: you need to craft filters to preview large JSON files
- `hson`: head/tail for trees—zero‑config by default; force text with `-i text` when you want raw lines

## Usage

    hson [FLAGS] [INPUT...]

- INPUT (optional, repeatable): file path(s). If omitted, reads from stdin. Multiple input files are supported.
- Prints the preview to stdout. On parse errors, exits non‑zero and prints an error to stderr.

Common flags:

- `-c, --bytes <BYTES>`: per‑file output budget (bytes). For multiple inputs, default total budget is `<BYTES> * number_of_inputs`.
- `-u, --chars <CHARS>`: per‑file output budget (Unicode code points). Behaves like `--bytes` but counts characters instead of bytes.
- `-C, --global-bytes <BYTES>`: total output budget across all inputs. With `--bytes`, the effective total is the smaller of the two.
- `-f, --format <auto|json|yaml|text>`: output format (default: `auto`).
  - Auto: stdin → JSON family; filesets → per‑file based on extension (`.json` → JSON family, `.yaml`/`.yml` → YAML, unknown → Text).
- `-t, --template <strict|default|detailed>`: output style (default: `default`).
  - JSON family: `strict` → strict JSON; `default` → Pseudo; `detailed` → JS with inline comments.
  - YAML: always YAML; style only affects comments (`strict` none, `default` “# …”, `detailed` “# N more …”).
- `-i, --input-format <json|yaml|text>`: ingestion format (default: `json`). For filesets in `auto` format, ingestion is chosen by extensions.
- `-m, --compact`: no indentation, no spaces, no newlines
- `--no-newline`: single line output
- `--no-header`: suppress fileset section headers (useful when embedding output in scripts)
- `--no-space`: no space after `:` in objects
- `--indent <STR>`: indentation unit (default: two spaces)
- `--string-cap <N>`: max graphemes to consider per string (default: 500)
- `--head`: prefer the beginning of arrays when truncating (keep first N). Strings are unaffected. Display styles place omission markers accordingly; strict JSON remains unannotated. Mutually exclusive with `--tail`.
- `--tail`: prefer the end of arrays when truncating (keep last N). Strings are unaffected. Display styles place omission markers accordingly; strict JSON remains unannotated. Mutually exclusive with `--head`.

Notes:

- Multiple inputs:
  - With newlines enabled, file sections are rendered with human‑readable headers (pass `--no-header` to suppress them). In compact/single‑line modes, headers are omitted.
  - Order: inputs are sorted by git frecency (via frecenfile) when available, then by mtime; pass `--no-sort` to keep the original input order without repo scanning.
- In `--format auto`, each file uses its own best format: JSON family for `.json`, YAML for `.yaml`/`.yml`.
  - Unknown extensions are treated as Text (raw lines) — safe for logs and `.txt` files.
  - `--global-bytes` may truncate or omit entire files to respect the total budget.
  - The tool finds the largest preview that fits the budget; even if extremely tight, you still get a minimal, valid preview.
  - Directories and binary files are ignored; a notice is printed to stderr for each. Stdin reads the stream as‑is.
  - Head vs Tail sampling: these options bias which part of arrays are kept before rendering. Display styles may still insert internal gap markers to honor very small budgets; strict JSON stays unannotated.

### Working with multiple files

- Budgets: per-file caps (`--bytes`/`--chars`/`--lines`) apply to each input; global caps (`--global-*`) constrain the combined output. Default byte budget scales by input count when no globals are set.
- Sorting: inputs are pre-sorted by git frecency (frecenfile) with last-modified-time fallback so recently touched files appear first. Pass `--no-sort` to preserve the order you provided and skip repo scanning.
- Headers: fileset sections get `==>` headers when newlines are enabled; hide them with `--no-header`. Compact and single-line modes omit headers automatically.
- Formats: in `--format auto`, each file picks JSON/YAML/Text based on extension; unknowns fall back to Text so mixed filesets “just work.”

## Budget Modes

- Bytes (`-c/--bytes`, `-C/--global-bytes`)
  - Measures UTF‑8 bytes in the output.
  - Default per‑file budget is 500 bytes when neither `--lines` nor `--chars` is provided.
  - Multiple inputs: total default budget is `<BYTES> * number_of_inputs`; `--global-bytes` caps the total.

- Characters (`-u/--chars`)
  - Measures Unicode code points (not grapheme clusters).

- Lines (`-n/--lines`, `-N/--global-lines`)
  - Caps the number of lines in the output.
  - Incompatible with `--no-newline`.
  - Multiple inputs: defaults to `<LINES> * number_of_inputs`; `--global-lines` caps the total.
  - Fileset headers, blank separators, and summary lines do not count toward the line cap; only actual content lines are considered.

- Interactions and precedence
  - All active budgets are enforced simultaneously. The render must satisfy all of: bytes (if set), chars (if set), and lines (if set). The strictest cap wins.
  - When only lines are specified, no implicit byte cap applies. When neither lines nor chars are specified, a 500‑byte default applies.

Quick one‑liners:

- Peek a big JSON stream (keeps structure):

      zstdcat huge.json.zst | hson -c 800 -f json -t default

- Many files with a fixed overall size:

      hson -C 1200 -f json -t strict logs/*.json

- Glance at a file, JavaScript‑style comments for omissions:

      hson -c 400 -f json -t detailed data.json

- YAML with detailed comments:

      hson -c 400 -f yaml -t detailed config.yaml

### Text mode

- Single file (auto):

      hson -c 200 notes.txt

- Force Text ingest/output (useful when mixing with other extensions, or when the extension suggests JSON/YAML):

      hson -c 200 -i text -f text notes.txt
      # Force text ingest even if the file looks like JSON
      hson -i text notes.json

- Styles on Text:
  - default: omission as a standalone `…` line.
  - detailed: omission as `… N more lines …`.
  - strict: no array‑level omission line (individual long lines may still truncate with `…`).

> **Note:** Filesets always render with per-file auto templates. When you need to preview a directory of mixed formats, skip `-f text` and let `-f auto` pick the right renderer for each entry.

Show help:

    hson --help

Note: flags align with head/tail conventions (`-c/--bytes`, `-C/--global-bytes`).

## Examples: head vs hson

Input:

```json
{"users":[{"id":1,"name":"Ana","roles":["admin","dev"]},{"id":2,"name":"Bo"}],"meta":{"count":2,"source":"db"}}
```

Naive cut (can break mid‑token):

```bash
jq -c . users.json | head -c 80
# {"users":[{"id":1,"name":"Ana","roles":["admin","dev"]},{"id":2,"name":"Bo"}],"me
```

Structured preview with hson (JSON family, default style → Pseudo):

```bash
hson -c 120 -f json -t default users.json
# {
#   users: [
#     { id: 1, name: "Ana", roles: [ "admin", … ] },
#     …
#   ]
#   meta: { count: 2, … }
# }
```

Machine‑readable preview (JSON family, strict style → strict JSON):

```bash
hson -c 120 -f json -t strict users.json
# {"users":[{"id":1,"name":"Ana","roles":["admin"]}],"meta":{"count":2}}
```

## Terminal Demos

Regenerate locally:

- Place tapes under docs/tapes (e.g., docs/tapes/demo.tape)
- Run: cargo make tapes
- Outputs are written to docs/assets/tapes


## Python Bindings

A thin Python extension module is available on PyPI as `headson`.

- Install: `pip install headson` (ABI3 wheels for Python 3.10+ on Linux/macOS/Windows).
- API:
  - `headson.summarize(text: str, *, format: str = "auto", style: str = "default", input_format: str = "json", byte_budget: int | None = None, skew: str = "balanced") -> str`
    - `format`: `"auto" | "json" | "yaml"` (auto maps to JSON family for single inputs)
    - `style`: `"strict" | "default" | "detailed"`
    - `input_format`: `"json" | "yaml"` (ingestion)
    - `byte_budget`: maximum output size in bytes (default: 500)
    - `skew`: `"balanced" | "head" | "tail"` (affects display styles; strict JSON remains unannotated)

Examples:

```python
import json
import headson

data = {"foo": [1, 2, 3], "bar": {"x": "y"}}
preview = headson.summarize(json.dumps(data), format="json", style="strict", byte_budget=200)
print(preview)

# Prefer the tail of arrays (annotations show with style="default"/"detailed")
print(
    headson.summarize(
        json.dumps(list(range(100))),
        format="json",
        style="detailed",
        byte_budget=80,
        skew="tail",
    )
)

# YAML support
doc = "root:\n  items: [1,2,3,4,5,6,7,8,9,10]\n"
print(headson.summarize(doc, format="yaml", style="default", input_format="yaml", byte_budget=60))
```

## Source Code Support

Source code support is a challenging area. While `headson`'s algorithm and code structure would allow for the use of
completely accurate parsing using language-specific parsers using `tree-sitter`, this would increase the complexity
of the application and its number of dependencies.

Instead of attempting a deep parse of source code files, we convert them into nested arrays based on a heuristic that
understands indentation patterns in the file.

When `headson` detects a code-like file, it uses a set of additional heuristics:
- **Atomic line ingest**: each line is treated as an atomic string so omission markers never split a code line.
- **Depth-aware sampling**:
  - We attempt to include more of the top level of the source code in order to give a good overview of classes, function and constants at the top level.
  - Nested blocks (function bodies, loops) prefer to omit lines in the middle to attempt to preserve natural "block" boundaries
- **Header priority**: lines that introduce a nested block (e.g., `def foo():`) get a small priority boost to ensure they survive tight budgets.

# Algorithm

![Algorithm overview](https://raw.githubusercontent.com/kantord/headson/main/docs/assets/algorithm.svg)

## Footnotes
 - <sup><b>[1]</b></sup> <b>Optimized tree representation</b>: An arena‑style tree stored in flat, contiguous buffers. Each node records its kind and value plus index ranges into shared child and key arrays. Arrays are ingested in a single pass and may be deterministically pre‑sampled: the first element is always kept; additional elements are selected via a fixed per‑index inclusion test; for kept elements, original indices are stored and full lengths are counted. This enables accurate omission info and internal gap markers later, while minimizing pointer chasing.
 - <sup><b>[2]</b></sup> <b>Priority order</b>: Nodes are scored so previews surface representative structure and values first. Arrays can favor head/mid/tail coverage (default) or strictly the head; tail preference flips head/tail when configured. Object properties are ordered by key, and strings expand by grapheme with early characters prioritized over very deep expansions.
 - <sup><b>[3]</b></sup> <b>Choose top N nodes (binary search)</b>: Iteratively picks N so that the rendered preview fits within the byte budget, looping between “choose N” and a render attempt to converge quickly.
 - <sup><b>[4]</b></sup> <b>Render attempt</b>: Serializes the currently included nodes using the selected template. Omission summaries and per-file section headers appear in display templates (pseudo/js); json remains strict. For arrays, display templates may insert internal gap markers between non‑contiguous kept items using original indices.
 - <sup><b>[5]</b></sup> <b>Diagram source</b>: The Algorithm diagram is generated from `docs/diagrams/algorithm.mmd`. Regenerate the SVG with `cargo make diagrams` before releasing.

## License

MIT