scan_json 2.0.0

# React to elements in a JSON stream

Parse JSON and execute callbacks based on patterns, even before the entire document is available. [llms.txt](https://raw.githubusercontent.com/olpa/streaming_json/refs/heads/master/scan_json/llms.txt).

For a fast start,

- first look at the concepts and examples in this `README`,
- then learn about [`crate::scan()`], and
- about the context stack and matching by [`crate::iter_match()`].

`scan_json` is designed to support zero-allocation and `no_std` environments, but a transitive dependency through `jiter` currently requires `std`.

## Concepts

The library uses the streaming JSON parser [`RJiter`](https://crates.io/crates/rjiter). While parsing, it maintains context, which is the path of element names from the root to the current nesting level.

The workflow for each key:

- First, call `find_action` and execute if found
- If the key value is an object or array, update the context and parse the next level
- Afterwards, call `find_end_action` and execute if found

An action receives two arguments:

- `rjiter`: A mutable reference to the `RJiter` parser object. An action can modify JSON parsing behavior by consuming the current key's value
- `baton`: This can be either:
  - A simple `Copy` type (like `i32`, `bool`, `()`) passed by value for read-only or stateless operations
  - `&RefCell<B>` for mutable state that needs to be shared across action calls

## Example of an action

`find_action` uses the library helper [`iter_match`] to detect the `content` key and return the `on_content` function.

The action peeks the value and writes it to the output. Because the value is consumed, the action returns the `ValueIsConsumed` flag to `scan` so it can update its internal state.

```rust
use scan_json::{scan, iter_match, Action, StreamOp, Options};
use scan_json::matcher::StructuralPseudoname;
use scan_json::stack::ContextIter;
use rjiter::RJiter;
use std::cell::RefCell;
use embedded_io::Write;
use u8pool::U8Pool;

fn on_content(rjiter: &mut RJiter<&[u8]>, writer_cell: &RefCell<Vec<u8>>) -> StreamOp {
    let mut writer = writer_cell.borrow_mut();
    let result = rjiter
        .peek()
        .and_then(|_| rjiter.write_long_bytes(&mut *writer));
    match result {
        Ok(_) => StreamOp::ValueIsConsumed,
        // This example discards detailed error info for simplicity.
        // See [`crate::idtransform()`] for production-grade error handling.
        Err(_e) => StreamOp::Error("RJiter error"),
    }
}

// Find action function that matches "content" key
let find_action = |structural_pseudoname: StructuralPseudoname, context: ContextIter, _baton: &RefCell<Vec<u8>>| -> Option<Action<&RefCell<Vec<u8>>, &[u8]>> {
    if iter_match(|| ["content".as_bytes()], structural_pseudoname, context) {
        Some(on_content)
    } else {
        None
    }
};
```

## Complete example: Identity transformation

The identity transformation copies JSON input to output, retaining the original structure.

The function [`crate::idtransform::idtransform()`] is not just a library function,
but also an example of advanced `scan` use. Read the source code for details.

Additionally, the function [`crate::idtransform::copy_atom()`] can be useful.


## Complete example: converting an LLM stream

Summary:

- Initialize the parser
- Create the black box with a `Vec`, which is used as `dyn Write` in actions
- Create handlers for `message`, `content`, and a handler for the end of `message`
- Combine all together in the `scan` function

The example demonstrates that `scan` can be used to handle LLM streaming output:

- The input consists of several top-level JSON objects not wrapped in an array
- The server-side-events tokens are ignored

```rust
use std::cell::RefCell;
use embedded_io::Write;
use scan_json::{scan, iter_match, Action, EndAction, StreamOp, Options};
use scan_json::matcher::StructuralPseudoname;
use scan_json::stack::ContextIter;
use rjiter::RJiter;
use u8pool::U8Pool;

fn on_begin_message(_: &mut RJiter<&[u8]>, writer: &RefCell<Vec<u8>>) -> StreamOp {
    writer.borrow_mut().write_all(b"(new message)\n").unwrap();
    StreamOp::None
}

fn on_content(rjiter: &mut RJiter<&[u8]>, writer_cell: &RefCell<Vec<u8>>) -> StreamOp {
    let mut writer = writer_cell.borrow_mut();
    let result = rjiter
        .peek()
        .and_then(|_| rjiter.write_long_bytes(&mut *writer));
    match result {
        Ok(_) => StreamOp::ValueIsConsumed,
        // This example discards detailed error info for simplicity.
        // See [`crate::idtransform()`] for production-grade error handling.
        Err(_e) => StreamOp::Error("RJiter error"),
    }
}

fn on_end_message(writer: &RefCell<Vec<u8>>) -> Result<(), &'static str> {
    writer.borrow_mut().write_all(b"\n").unwrap();
    Ok(())
}

fn scan_llm_output(json: &str) -> RefCell<Vec<u8>> {
    let mut reader = json.as_bytes();
    let mut buffer = vec![0u8; 32];
    let mut rjiter = RJiter::new(&mut reader, &mut buffer);
    let writer_cell = RefCell::new(Vec::new());

    let find_action = |structural_pseudoname: StructuralPseudoname, context: ContextIter, _baton: &RefCell<Vec<u8>>| -> Option<Action<&RefCell<Vec<u8>>, &[u8]>> {
        if iter_match(|| ["content".as_bytes()], structural_pseudoname, context.clone()) {
            Some(on_content)
        } else if iter_match(|| ["message".as_bytes()], structural_pseudoname, context.clone()) {
            Some(on_begin_message)
        } else {
            None
        }
    };
    let find_end_action = |structural_pseudoname: StructuralPseudoname, context: ContextIter, _baton: &RefCell<Vec<u8>>| -> Option<EndAction<&RefCell<Vec<u8>>>> {
        if iter_match(|| ["message".as_bytes()], structural_pseudoname, context.clone()) {
            Some(on_end_message)
        } else {
            None
        }
    };

    // Create working buffer for context stack (512 bytes, up to 20 nesting levels)
    // Based on estimation: 16 bytes per JSON key, plus 8 bytes per frame for state tracking
    let mut working_buffer = [0u8; 512];
    let mut context = U8Pool::new(&mut working_buffer, 20).unwrap();

    scan(
        find_action,
        find_end_action,
        &mut rjiter,
        &writer_cell,
        &mut context,
        {
            let sse_tokens: &[&[u8]] = &[b"data:", b"DONE"];
            &Options::with_sse_tokens(sse_tokens)
        },
    )
    .unwrap();

    writer_cell
}

// ---------------- Sample LLM output as `scan_llm_output` input

let json = r#"{
  "id": "chatcmpl-Ahpq4nZeP9mESaKsCVdmZdK96IrUH",
  "object": "chat.completion",
  "created": 1735010736,
  "model": "gpt-4o-mini-2024-07-18",
  "choices": [
    {
      "index": 0,
      "message": {
        "role": "assistant",
        "content": "Hello! How can I assist you today?",
        "refusal": null
      },
      "logprobs": null,
      "finish_reason": "stop"
    }
  ],
  "usage": {
    "prompt_tokens": 9,
    "completion_tokens": 10,
    "total_tokens": 19,
    "prompt_tokens_details": {
      "cached_tokens": 0,
      "audio_tokens": 0
    },
    "completion_tokens_details": {
      "reasoning_tokens": 0,
      "audio_tokens": 0,
      "accepted_prediction_tokens": 0,
      "rejected_prediction_tokens": 0
    }
  },
  "system_fingerprint": "fp_0aa8d3e20b"
}"#;

let writer_cell = scan_llm_output(json);
let message = String::from_utf8(writer_cell.borrow().to_vec()).unwrap();
assert_eq!(message, "(new message)\nHello! How can I assist you today?\n");

// ---------------- Another sample of LLM output, the streaming version
let json = r#"
data: {"choices":[{"index":0,"delta":{"role":"assistant","content":"","refusal":null},"logprobs":null,"finish_reason":null}],"id":"chatcmpl-AgMB1khICnwswjgqIl2X2jr587Nep","object":"chat.completion.chunk","created":1734658387,"model":"gpt-4o-mini-2024-07-18","system_fingerprint":"fp_d02d531b47"}

data: {"choices":[{"index":0,"delta":{"content":"Hello"},"logprobs":null,"finish_reason":null}],"id":"chatcmpl-AgMB1khICnwswjgqIl2X2jr587Nep","object":"chat.completion.chunk","created":1734658387,"model":"gpt-4o-mini-2024-07-18","system_fingerprint":"fp_d02d531b47"}

data: {"choices":[{"index":0,"delta":{"content":"!"},"logprobs":null,"finish_reason":null}],"id":"chatcmpl-AgMB1khICnwswjgqIl2X2jr587Nep","object":"chat.completion.chunk","created":1734658387,"model":"gpt-4o-mini-2024-07-18","system_fingerprint":"fp_d02d531b47"}

data: {"choices":[{"index":0,"delta":{"content":" How"},"logprobs":null,"finish_reason":null}],"id":"chatcmpl-AgMB1khICnwswjgqIl2X2jr587Nep","object":"chat.completion.chunk","created":1734658387,"model":"gpt-4o-mini-2024-07-18","system_fingerprint":"fp_d02d531b47"}

data: {"choices":[{"index":0,"delta":{"content":" can"},"logprobs":null,"finish_reason":null}],"id":"chatcmpl-AgMB1khICnwswjgqIl2X2jr587Nep","object":"chat.completion.chunk","created":1734658387,"model":"gpt-4o-mini-2024-07-18","system_fingerprint":"fp_d02d531b47"}

data: {"choices":[{"index":0,"delta":{"content":" I"},"logprobs":null,"finish_reason":null}],"id":"chatcmpl-AgMB1khICnwswjgqIl2X2jr587Nep","object":"chat.completion.chunk","created":1734658387,"model":"gpt-4o-mini-2024-07-18","system_fingerprint":"fp_d02d531b47"}

data: {"choices":[{"index":0,"delta":{"content":" assist"},"logprobs":null,"finish_reason":null}],"id":"chatcmpl-AgMB1khICnwswjgqIl2X2jr587Nep","object":"chat.completion.chunk","created":1734658387,"model":"gpt-4o-mini-2024-07-18","system_fingerprint":"fp_d02d531b47"}

data: {"choices":[{"index":0,"delta":{"content":" you"},"logprobs":null,"finish_reason":null}],"id":"chatcmpl-AgMB1khICnwswjgqIl2X2jr587Nep","object":"chat.completion.chunk","created":1734658387,"model":"gpt-4o-mini-2024-07-18","system_fingerprint":"fp_d02d531b47"}

data: {"choices":[{"index":0,"delta":{"content":" today"},"logprobs":null,"finish_reason":null}],"id":"chatcmpl-AgMB1khICnwswjgqIl2X2jr587Nep","object":"chat.completion.chunk","created":1734658387,"model":"gpt-4o-mini-2024-07-18","system_fingerprint":"fp_d02d531b47"}

data: {"choices":[{"index":0,"delta":{"content":"?"},"logprobs":null,"finish_reason":null}],"id":"chatcmpl-AgMB1khICnwswjgqIl2X2jr587Nep","object":"chat.completion.chunk","created":1734658387,"model":"gpt-4o-mini-2024-07-18","system_fingerprint":"fp_d02d531b47"}

data: {"choices":[{"index":0,"delta":{},"logprobs":null,"finish_reason":"stop"}],"id":"chatcmpl-AgMB1khICnwswjgqIl2X2jr587Nep","object":"chat.completion.chunk","created":1734658387,"model":"gpt-4o-mini-2024-07-18","system_fingerprint":"fp_d02d531b47"}

data: [DONE]
"#;

let writer_cell = scan_llm_output(json);
let message = String::from_utf8(writer_cell.borrow().to_vec()).unwrap();
assert_eq!(message, "Hello! How can I assist you today?");
```


# Colophon

License: MIT

Author: Oleg Parashchenko, olpa@ <https://uucode.com/>

Contact: via email or [Ailets Discord](https://discord.gg/HEBE3gv2)

`scan_json` is a part of the [ailets.org](https://ailets.org) project.