Crate scan_json

Crate scan_json 

Source
Expand description

§React to elements in a JSON stream

Parse JSON and execute callbacks based on patterns, even before the entire document is available. llms.txt.

For a fast start,

scan_json is designed to support zero-allocation and no_std environments, but a transitive dependency through jiter currently requires std.

§Concepts

The library uses the streaming JSON parser RJiter. While parsing, it maintains context, which is the path of element names from the root to the current nesting level.

The workflow for each key:

  • First, call find_action and execute if found
  • If the key value is an object or array, update the context and parse the next level
  • Afterwards, call find_end_action and execute if found

An action receives two arguments:

  • rjiter: A mutable reference to the RJiter parser object. An action can modify JSON parsing behavior by consuming the current key’s value
  • baton: This can be either:
    • A simple Copy type (like i32, bool, ()) passed by value for read-only or stateless operations
    • &RefCell<B> for mutable state that needs to be shared across action calls

§Example of an action

find_action uses the library helper iter_match to detect the content key and return the on_content function.

The action peeks the value and writes it to the output. Because the value is consumed, the action returns the ValueIsConsumed flag to scan so it can update its internal state.

use scan_json::{scan, iter_match, Action, StreamOp, Options};
use scan_json::matcher::StructuralPseudoname;
use scan_json::stack::ContextIter;
use rjiter::RJiter;
use std::cell::RefCell;
use embedded_io::Write;
use u8pool::U8Pool;

fn on_content(rjiter: &mut RJiter<&[u8]>, writer_cell: &RefCell<Vec<u8>>) -> StreamOp {
    let mut writer = writer_cell.borrow_mut();
    let result = rjiter
        .peek()
        .and_then(|_| rjiter.write_long_bytes(&mut *writer));
    match result {
        Ok(_) => StreamOp::ValueIsConsumed,
        // This example discards detailed error info for simplicity.
        // See [`crate::idtransform()`] for production-grade error handling.
        Err(_e) => StreamOp::Error("RJiter error"),
    }
}

// Find action function that matches "content" key
let find_action = |structural_pseudoname: StructuralPseudoname, context: ContextIter, _baton: &RefCell<Vec<u8>>| -> Option<Action<&RefCell<Vec<u8>>, &[u8]>> {
    if iter_match(|| ["content".as_bytes()], structural_pseudoname, context) {
        Some(on_content)
    } else {
        None
    }
};

§Complete example: Identity transformation

The identity transformation copies JSON input to output, retaining the original structure.

The function crate::idtransform::idtransform() is not just a library function, but also an example of advanced scan use. Read the source code for details.

Additionally, the function crate::idtransform::copy_atom() can be useful.

§Complete example: converting an LLM stream

Summary:

  • Initialize the parser
  • Create the black box with a Vec, which is used as dyn Write in actions
  • Create handlers for message, content, and a handler for the end of message
  • Combine all together in the scan function

The example demonstrates that scan can be used to handle LLM streaming output:

  • The input consists of several top-level JSON objects not wrapped in an array
  • The server-side-events tokens are ignored
use std::cell::RefCell;
use embedded_io::Write;
use scan_json::{scan, iter_match, Action, EndAction, StreamOp, Options};
use scan_json::matcher::StructuralPseudoname;
use scan_json::stack::ContextIter;
use rjiter::RJiter;
use u8pool::U8Pool;

fn on_begin_message(_: &mut RJiter<&[u8]>, writer: &RefCell<Vec<u8>>) -> StreamOp {
    writer.borrow_mut().write_all(b"(new message)\n").unwrap();
    StreamOp::None
}

fn on_content(rjiter: &mut RJiter<&[u8]>, writer_cell: &RefCell<Vec<u8>>) -> StreamOp {
    let mut writer = writer_cell.borrow_mut();
    let result = rjiter
        .peek()
        .and_then(|_| rjiter.write_long_bytes(&mut *writer));
    match result {
        Ok(_) => StreamOp::ValueIsConsumed,
        // This example discards detailed error info for simplicity.
        // See [`crate::idtransform()`] for production-grade error handling.
        Err(_e) => StreamOp::Error("RJiter error"),
    }
}

fn on_end_message(writer: &RefCell<Vec<u8>>) -> Result<(), &'static str> {
    writer.borrow_mut().write_all(b"\n").unwrap();
    Ok(())
}

fn scan_llm_output(json: &str) -> RefCell<Vec<u8>> {
    let mut reader = json.as_bytes();
    let mut buffer = vec![0u8; 32];
    let mut rjiter = RJiter::new(&mut reader, &mut buffer);
    let writer_cell = RefCell::new(Vec::new());

    let find_action = |structural_pseudoname: StructuralPseudoname, context: ContextIter, _baton: &RefCell<Vec<u8>>| -> Option<Action<&RefCell<Vec<u8>>, &[u8]>> {
        if iter_match(|| ["content".as_bytes()], structural_pseudoname, context.clone()) {
            Some(on_content)
        } else if iter_match(|| ["message".as_bytes()], structural_pseudoname, context.clone()) {
            Some(on_begin_message)
        } else {
            None
        }
    };
    let find_end_action = |structural_pseudoname: StructuralPseudoname, context: ContextIter, _baton: &RefCell<Vec<u8>>| -> Option<EndAction<&RefCell<Vec<u8>>>> {
        if iter_match(|| ["message".as_bytes()], structural_pseudoname, context.clone()) {
            Some(on_end_message)
        } else {
            None
        }
    };

    // Create working buffer for context stack (512 bytes, up to 20 nesting levels)
    // Based on estimation: 16 bytes per JSON key, plus 8 bytes per frame for state tracking
    let mut working_buffer = [0u8; 512];
    let mut context = U8Pool::new(&mut working_buffer, 20).unwrap();

    scan(
        find_action,
        find_end_action,
        &mut rjiter,
        &writer_cell,
        &mut context,
        {
            let sse_tokens: &[&[u8]] = &[b"data:", b"DONE"];
            &Options::with_sse_tokens(sse_tokens)
        },
    )
    .unwrap();

    writer_cell
}

// ---------------- Sample LLM output as `scan_llm_output` input

let json = r#"{
  "id": "chatcmpl-Ahpq4nZeP9mESaKsCVdmZdK96IrUH",
  "object": "chat.completion",
  "created": 1735010736,
  "model": "gpt-4o-mini-2024-07-18",
  "choices": [
    {
      "index": 0,
      "message": {
        "role": "assistant",
        "content": "Hello! How can I assist you today?",
        "refusal": null
      },
      "logprobs": null,
      "finish_reason": "stop"
    }
  ],
  "usage": {
    "prompt_tokens": 9,
    "completion_tokens": 10,
    "total_tokens": 19,
    "prompt_tokens_details": {
      "cached_tokens": 0,
      "audio_tokens": 0
    },
    "completion_tokens_details": {
      "reasoning_tokens": 0,
      "audio_tokens": 0,
      "accepted_prediction_tokens": 0,
      "rejected_prediction_tokens": 0
    }
  },
  "system_fingerprint": "fp_0aa8d3e20b"
}"#;

let writer_cell = scan_llm_output(json);
let message = String::from_utf8(writer_cell.borrow().to_vec()).unwrap();
assert_eq!(message, "(new message)\nHello! How can I assist you today?\n");

// ---------------- Another sample of LLM output, the streaming version
let json = r#"
data: {"choices":[{"index":0,"delta":{"role":"assistant","content":"","refusal":null},"logprobs":null,"finish_reason":null}],"id":"chatcmpl-AgMB1khICnwswjgqIl2X2jr587Nep","object":"chat.completion.chunk","created":1734658387,"model":"gpt-4o-mini-2024-07-18","system_fingerprint":"fp_d02d531b47"}

data: {"choices":[{"index":0,"delta":{"content":"Hello"},"logprobs":null,"finish_reason":null}],"id":"chatcmpl-AgMB1khICnwswjgqIl2X2jr587Nep","object":"chat.completion.chunk","created":1734658387,"model":"gpt-4o-mini-2024-07-18","system_fingerprint":"fp_d02d531b47"}

data: {"choices":[{"index":0,"delta":{"content":"!"},"logprobs":null,"finish_reason":null}],"id":"chatcmpl-AgMB1khICnwswjgqIl2X2jr587Nep","object":"chat.completion.chunk","created":1734658387,"model":"gpt-4o-mini-2024-07-18","system_fingerprint":"fp_d02d531b47"}

data: {"choices":[{"index":0,"delta":{"content":" How"},"logprobs":null,"finish_reason":null}],"id":"chatcmpl-AgMB1khICnwswjgqIl2X2jr587Nep","object":"chat.completion.chunk","created":1734658387,"model":"gpt-4o-mini-2024-07-18","system_fingerprint":"fp_d02d531b47"}

data: {"choices":[{"index":0,"delta":{"content":" can"},"logprobs":null,"finish_reason":null}],"id":"chatcmpl-AgMB1khICnwswjgqIl2X2jr587Nep","object":"chat.completion.chunk","created":1734658387,"model":"gpt-4o-mini-2024-07-18","system_fingerprint":"fp_d02d531b47"}

data: {"choices":[{"index":0,"delta":{"content":" I"},"logprobs":null,"finish_reason":null}],"id":"chatcmpl-AgMB1khICnwswjgqIl2X2jr587Nep","object":"chat.completion.chunk","created":1734658387,"model":"gpt-4o-mini-2024-07-18","system_fingerprint":"fp_d02d531b47"}

data: {"choices":[{"index":0,"delta":{"content":" assist"},"logprobs":null,"finish_reason":null}],"id":"chatcmpl-AgMB1khICnwswjgqIl2X2jr587Nep","object":"chat.completion.chunk","created":1734658387,"model":"gpt-4o-mini-2024-07-18","system_fingerprint":"fp_d02d531b47"}

data: {"choices":[{"index":0,"delta":{"content":" you"},"logprobs":null,"finish_reason":null}],"id":"chatcmpl-AgMB1khICnwswjgqIl2X2jr587Nep","object":"chat.completion.chunk","created":1734658387,"model":"gpt-4o-mini-2024-07-18","system_fingerprint":"fp_d02d531b47"}

data: {"choices":[{"index":0,"delta":{"content":" today"},"logprobs":null,"finish_reason":null}],"id":"chatcmpl-AgMB1khICnwswjgqIl2X2jr587Nep","object":"chat.completion.chunk","created":1734658387,"model":"gpt-4o-mini-2024-07-18","system_fingerprint":"fp_d02d531b47"}

data: {"choices":[{"index":0,"delta":{"content":"?"},"logprobs":null,"finish_reason":null}],"id":"chatcmpl-AgMB1khICnwswjgqIl2X2jr587Nep","object":"chat.completion.chunk","created":1734658387,"model":"gpt-4o-mini-2024-07-18","system_fingerprint":"fp_d02d531b47"}

data: {"choices":[{"index":0,"delta":{},"logprobs":null,"finish_reason":"stop"}],"id":"chatcmpl-AgMB1khICnwswjgqIl2X2jr587Nep","object":"chat.completion.chunk","created":1734658387,"model":"gpt-4o-mini-2024-07-18","system_fingerprint":"fp_d02d531b47"}

data: [DONE]
"#;

let writer_cell = scan_llm_output(json);
let message = String::from_utf8(writer_cell.borrow().to_vec()).unwrap();
assert_eq!(message, "Hello! How can I assist you today?");

§Colophon

License: MIT

Author: Oleg Parashchenko, olpa@ https://uucode.com/

Contact: via email or Ailets Discord

scan_json is a part of the ailets.org project.

Re-exports§

pub use error::Error;
pub use error::Result;
pub use idtransform::idtransform;
pub use matcher::iter_match;
pub use matcher::Action;
pub use matcher::EndAction;
pub use matcher::StreamOp;
pub use scan::scan;
pub use scan::Options;
pub use rjiter;
pub use rjiter::jiter;

Modules§

error
Error types for JSON stream processing.
idtransform
Copy JSON input to output, retaining the original structure and collapsing whitespace. The implementation of idtransform is an example of advanced use of the scan function.
matcher
This module contains functions for matching JSON nodes based on their name and context.
scan
Implementation of the scan function to scan a JSON stream.
stack
Stack management for JSON parsing context

Structs§

RJiter
Streaming JSON parser, a wrapper around Jiter.