Expand description
§React to elements in a JSON stream
Parse JSON and execute callbacks based on patterns, even before the entire document is available. llms.txt.
For a fast start,
- first look at the concepts and examples in this
README
, - then learn about
crate::scan()
, and - about the context stack and matching by
crate::iter_match()
.
scan_json
is designed to support zero-allocation and no_std
environments, but a transitive dependency through jiter
currently requires std
.
§Concepts
The library uses the streaming JSON parser RJiter
. While parsing, it maintains context, which is the path of element names from the root to the current nesting level.
The workflow for each key:
- First, call
find_action
and execute if found - If the key value is an object or array, update the context and parse the next level
- Afterwards, call
find_end_action
and execute if found
An action receives two arguments:
rjiter
: A mutable reference to theRJiter
parser object. An action can modify JSON parsing behavior by consuming the current key’s valuebaton
: This can be either:- A simple
Copy
type (likei32
,bool
,()
) passed by value for read-only or stateless operations &RefCell<B>
for mutable state that needs to be shared across action calls
- A simple
§Example of an action
find_action
uses the library helper iter_match
to detect the content
key and return the on_content
function.
The action peeks the value and writes it to the output. Because the value is consumed, the action returns the ValueIsConsumed
flag to scan
so it can update its internal state.
use scan_json::{scan, iter_match, Action, StreamOp, Options};
use scan_json::matcher::StructuralPseudoname;
use scan_json::stack::ContextIter;
use rjiter::RJiter;
use std::cell::RefCell;
use embedded_io::Write;
use u8pool::U8Pool;
fn on_content(rjiter: &mut RJiter<&[u8]>, writer_cell: &RefCell<Vec<u8>>) -> StreamOp {
let mut writer = writer_cell.borrow_mut();
let result = rjiter
.peek()
.and_then(|_| rjiter.write_long_bytes(&mut *writer));
match result {
Ok(_) => StreamOp::ValueIsConsumed,
// This example discards detailed error info for simplicity.
// See [`crate::idtransform()`] for production-grade error handling.
Err(_e) => StreamOp::Error("RJiter error"),
}
}
// Find action function that matches "content" key
let find_action = |structural_pseudoname: StructuralPseudoname, context: ContextIter, _baton: &RefCell<Vec<u8>>| -> Option<Action<&RefCell<Vec<u8>>, &[u8]>> {
if iter_match(|| ["content".as_bytes()], structural_pseudoname, context) {
Some(on_content)
} else {
None
}
};
§Complete example: Identity transformation
The identity transformation copies JSON input to output, retaining the original structure.
The function crate::idtransform::idtransform()
is not just a library function,
but also an example of advanced scan
use. Read the source code for details.
Additionally, the function crate::idtransform::copy_atom()
can be useful.
§Complete example: converting an LLM stream
Summary:
- Initialize the parser
- Create the black box with a
Vec
, which is used asdyn Write
in actions - Create handlers for
message
,content
, and a handler for the end ofmessage
- Combine all together in the
scan
function
The example demonstrates that scan
can be used to handle LLM streaming output:
- The input consists of several top-level JSON objects not wrapped in an array
- The server-side-events tokens are ignored
use std::cell::RefCell;
use embedded_io::Write;
use scan_json::{scan, iter_match, Action, EndAction, StreamOp, Options};
use scan_json::matcher::StructuralPseudoname;
use scan_json::stack::ContextIter;
use rjiter::RJiter;
use u8pool::U8Pool;
fn on_begin_message(_: &mut RJiter<&[u8]>, writer: &RefCell<Vec<u8>>) -> StreamOp {
writer.borrow_mut().write_all(b"(new message)\n").unwrap();
StreamOp::None
}
fn on_content(rjiter: &mut RJiter<&[u8]>, writer_cell: &RefCell<Vec<u8>>) -> StreamOp {
let mut writer = writer_cell.borrow_mut();
let result = rjiter
.peek()
.and_then(|_| rjiter.write_long_bytes(&mut *writer));
match result {
Ok(_) => StreamOp::ValueIsConsumed,
// This example discards detailed error info for simplicity.
// See [`crate::idtransform()`] for production-grade error handling.
Err(_e) => StreamOp::Error("RJiter error"),
}
}
fn on_end_message(writer: &RefCell<Vec<u8>>) -> Result<(), &'static str> {
writer.borrow_mut().write_all(b"\n").unwrap();
Ok(())
}
fn scan_llm_output(json: &str) -> RefCell<Vec<u8>> {
let mut reader = json.as_bytes();
let mut buffer = vec![0u8; 32];
let mut rjiter = RJiter::new(&mut reader, &mut buffer);
let writer_cell = RefCell::new(Vec::new());
let find_action = |structural_pseudoname: StructuralPseudoname, context: ContextIter, _baton: &RefCell<Vec<u8>>| -> Option<Action<&RefCell<Vec<u8>>, &[u8]>> {
if iter_match(|| ["content".as_bytes()], structural_pseudoname, context.clone()) {
Some(on_content)
} else if iter_match(|| ["message".as_bytes()], structural_pseudoname, context.clone()) {
Some(on_begin_message)
} else {
None
}
};
let find_end_action = |structural_pseudoname: StructuralPseudoname, context: ContextIter, _baton: &RefCell<Vec<u8>>| -> Option<EndAction<&RefCell<Vec<u8>>>> {
if iter_match(|| ["message".as_bytes()], structural_pseudoname, context.clone()) {
Some(on_end_message)
} else {
None
}
};
// Create working buffer for context stack (512 bytes, up to 20 nesting levels)
// Based on estimation: 16 bytes per JSON key, plus 8 bytes per frame for state tracking
let mut working_buffer = [0u8; 512];
let mut context = U8Pool::new(&mut working_buffer, 20).unwrap();
scan(
find_action,
find_end_action,
&mut rjiter,
&writer_cell,
&mut context,
{
let sse_tokens: &[&[u8]] = &[b"data:", b"DONE"];
&Options::with_sse_tokens(sse_tokens)
},
)
.unwrap();
writer_cell
}
// ---------------- Sample LLM output as `scan_llm_output` input
let json = r#"{
"id": "chatcmpl-Ahpq4nZeP9mESaKsCVdmZdK96IrUH",
"object": "chat.completion",
"created": 1735010736,
"model": "gpt-4o-mini-2024-07-18",
"choices": [
{
"index": 0,
"message": {
"role": "assistant",
"content": "Hello! How can I assist you today?",
"refusal": null
},
"logprobs": null,
"finish_reason": "stop"
}
],
"usage": {
"prompt_tokens": 9,
"completion_tokens": 10,
"total_tokens": 19,
"prompt_tokens_details": {
"cached_tokens": 0,
"audio_tokens": 0
},
"completion_tokens_details": {
"reasoning_tokens": 0,
"audio_tokens": 0,
"accepted_prediction_tokens": 0,
"rejected_prediction_tokens": 0
}
},
"system_fingerprint": "fp_0aa8d3e20b"
}"#;
let writer_cell = scan_llm_output(json);
let message = String::from_utf8(writer_cell.borrow().to_vec()).unwrap();
assert_eq!(message, "(new message)\nHello! How can I assist you today?\n");
// ---------------- Another sample of LLM output, the streaming version
let json = r#"
data: {"choices":[{"index":0,"delta":{"role":"assistant","content":"","refusal":null},"logprobs":null,"finish_reason":null}],"id":"chatcmpl-AgMB1khICnwswjgqIl2X2jr587Nep","object":"chat.completion.chunk","created":1734658387,"model":"gpt-4o-mini-2024-07-18","system_fingerprint":"fp_d02d531b47"}
data: {"choices":[{"index":0,"delta":{"content":"Hello"},"logprobs":null,"finish_reason":null}],"id":"chatcmpl-AgMB1khICnwswjgqIl2X2jr587Nep","object":"chat.completion.chunk","created":1734658387,"model":"gpt-4o-mini-2024-07-18","system_fingerprint":"fp_d02d531b47"}
data: {"choices":[{"index":0,"delta":{"content":"!"},"logprobs":null,"finish_reason":null}],"id":"chatcmpl-AgMB1khICnwswjgqIl2X2jr587Nep","object":"chat.completion.chunk","created":1734658387,"model":"gpt-4o-mini-2024-07-18","system_fingerprint":"fp_d02d531b47"}
data: {"choices":[{"index":0,"delta":{"content":" How"},"logprobs":null,"finish_reason":null}],"id":"chatcmpl-AgMB1khICnwswjgqIl2X2jr587Nep","object":"chat.completion.chunk","created":1734658387,"model":"gpt-4o-mini-2024-07-18","system_fingerprint":"fp_d02d531b47"}
data: {"choices":[{"index":0,"delta":{"content":" can"},"logprobs":null,"finish_reason":null}],"id":"chatcmpl-AgMB1khICnwswjgqIl2X2jr587Nep","object":"chat.completion.chunk","created":1734658387,"model":"gpt-4o-mini-2024-07-18","system_fingerprint":"fp_d02d531b47"}
data: {"choices":[{"index":0,"delta":{"content":" I"},"logprobs":null,"finish_reason":null}],"id":"chatcmpl-AgMB1khICnwswjgqIl2X2jr587Nep","object":"chat.completion.chunk","created":1734658387,"model":"gpt-4o-mini-2024-07-18","system_fingerprint":"fp_d02d531b47"}
data: {"choices":[{"index":0,"delta":{"content":" assist"},"logprobs":null,"finish_reason":null}],"id":"chatcmpl-AgMB1khICnwswjgqIl2X2jr587Nep","object":"chat.completion.chunk","created":1734658387,"model":"gpt-4o-mini-2024-07-18","system_fingerprint":"fp_d02d531b47"}
data: {"choices":[{"index":0,"delta":{"content":" you"},"logprobs":null,"finish_reason":null}],"id":"chatcmpl-AgMB1khICnwswjgqIl2X2jr587Nep","object":"chat.completion.chunk","created":1734658387,"model":"gpt-4o-mini-2024-07-18","system_fingerprint":"fp_d02d531b47"}
data: {"choices":[{"index":0,"delta":{"content":" today"},"logprobs":null,"finish_reason":null}],"id":"chatcmpl-AgMB1khICnwswjgqIl2X2jr587Nep","object":"chat.completion.chunk","created":1734658387,"model":"gpt-4o-mini-2024-07-18","system_fingerprint":"fp_d02d531b47"}
data: {"choices":[{"index":0,"delta":{"content":"?"},"logprobs":null,"finish_reason":null}],"id":"chatcmpl-AgMB1khICnwswjgqIl2X2jr587Nep","object":"chat.completion.chunk","created":1734658387,"model":"gpt-4o-mini-2024-07-18","system_fingerprint":"fp_d02d531b47"}
data: {"choices":[{"index":0,"delta":{},"logprobs":null,"finish_reason":"stop"}],"id":"chatcmpl-AgMB1khICnwswjgqIl2X2jr587Nep","object":"chat.completion.chunk","created":1734658387,"model":"gpt-4o-mini-2024-07-18","system_fingerprint":"fp_d02d531b47"}
data: [DONE]
"#;
let writer_cell = scan_llm_output(json);
let message = String::from_utf8(writer_cell.borrow().to_vec()).unwrap();
assert_eq!(message, "Hello! How can I assist you today?");
§Colophon
License: MIT
Author: Oleg Parashchenko, olpa@ https://uucode.com/
Contact: via email or Ailets Discord
scan_json
is a part of the ailets.org project.
Re-exports§
pub use error::Error;
pub use error::Result;
pub use idtransform::idtransform;
pub use matcher::iter_match;
pub use matcher::Action;
pub use matcher::EndAction;
pub use matcher::StreamOp;
pub use scan::scan;
pub use scan::Options;
pub use rjiter;
pub use rjiter::jiter;
Modules§
- error
- Error types for JSON stream processing.
- idtransform
- Copy JSON input to output, retaining the original structure and collapsing whitespace.
The implementation of
idtransform
is an example of advanced use of thescan
function. - matcher
- This module contains functions for matching JSON nodes based on their name and context.
- scan
- Implementation of the
scan
function to scan a JSON stream. - stack
- Stack management for JSON parsing context
Structs§
- RJiter
- Streaming JSON parser, a wrapper around
Jiter
.