deser_incomplete/lib.rs
1#![cfg_attr(docsrs, feature(doc_auto_cfg))]
2#![cfg_attr(
3 not(all(feature = "rand", feature = "tracing")),
4 allow(unused_variables, unused_imports, dead_code, unused_mut)
5)]
6
7//! # deser-incomplete: Deserialize incomplete or broken data with Serde
8//!
9//! Parse incomplete or broken data with existing Serde data formats.
10//!
11//! This is nice for ingesting streaming JSON, which is technically invalid until
12//! the stream is done. By tolerating premature end of input, we can immediately make use
13//! of the streaming input.
14//!
15//! <img src="https://raw.githubusercontent.com/bgeron/deser-incomplete/rendered/assets/live-travel-modes.gif" alt='Someone is slowly
16//! typing JSON into a terminal program. The JSON is an array of objects.
17//! The program gradually renders the JSON input as Rust debug output, and as a table.
18//! The fields of the Rust struct are printed even though they are missing in the JSON input.
19//! The example program is called "live".' title="Demo that shows parsing JSON as it is typed by the user"
20//! width="60%" height="60%">
21//!
22//! Here, we wrapped [`serde_json`] with `deser-incomplete`, and printed the Rust
23//! debug representation of the result. We also reserialized to JSON and
24//! let nushell do its beautiful table formatting.
25//!
26//! The JSON can also come from an external program. Here is a demo program that
27//! computes disk usage of directories and outputs the results as JSON.
28//! In true Unix style, displaying for the user is a separate concern,
29//! implemented by a separate program.
30//!
31//! <img src="https://raw.githubusercontent.com/bgeron/deser-incomplete/rendered/assets/du-live.gif" alt='A Unix pipeline with
32//! two programs is shown. The source program computes the disk size
33//! of a bunch of directories and outputs a JSON array of objects. The sink program
34//! pretty-prints the JSON table. Computing the disk size takes a while, and you can
35//! see which directory is being analyzed because the result for that directory is empty
36//! while it is computing.' title='Demo that shows parsing JSON as it is generated live from another program that mimics du'
37//! width="60%" height="60%">
38//!
39//! `deser-incomplete` sits between `#[serde(Deserialize)]` and the data format. When a parse
40//! error is detected (presumably because the input ended), it safely halts parsing.
41//!
42//! <img src="https://raw.githubusercontent.com/bgeron/deser-incomplete/rendered/assets/deser-incomplete-blocks-errors.png" alt='This library sits
43//! in between Deserialize and Deserializer. Information about the parsed data is successfully
44//! sent from Deserializer through deser-incomplete to Deserialize. But errors from Deserializer are
45//! blocked.' width="60%" height="60%">
46//!
47//! ## How to use: JSON and YAML
48//!
49//! ```
50//! let result: Result<Vec<u32>, deser_incomplete::Error<serde_json::Error>>
51//! = deser_incomplete::from_json_str("[3, 4, ");
52//!
53//! assert_eq!(result.unwrap(), vec![3, 4]);
54//!
55//! let result: Result<Vec<bool>, deser_incomplete::Error<serde_yaml::Error>>
56//! = deser_incomplete::from_yaml_str("- true\n- false\n- ");
57//!
58//! assert_eq!(result.unwrap(), vec![true, false]);
59//! ```
60//!
61//! Command line:
62//!
63//! ```sh
64//! $ cargo install deser-incomplete --example repair-deser
65//!
66//! $ echo '[3, 4' | repair-deser # JSON by default
67//! [3,4]
68//! ```
69//!
70//! ## How to use: other data formats
71//!
72//! - You need to explain how to create the [`Deserializer`] by implementing [`Source`].
73//!
74//! - If your format has `&mut T: Deserializer` then mimic [`source::JsonStr`].
75//! - If your format has `T: Deserializer` then mimic [`source::YamlStr`].
76//!
77//! - Some formats need a trailer for best results. For example, [`from_json_str`] appends
78//! a double-quote to the input before parsing, this lets [`serde_json`] see strings that weren't
79//! actually complete.
80//!
81//! We also preprocess the input in [`from_yaml_str`], actually there it is even more important
82//! for good results.
83//!
84//! _Add preprocessing with [`Options::set_random_trailer`], or turn it off such preprocessing
85//! with [`Options::disable_random_tag`]. You can see the effect of it with
86//! `cargo run --example live -- --use-random-trailer false`._
87//!
88//! I expect that binary formats don't need this preprocessing.
89//!
90//!
91//! ## How this works internally
92//!
93//! The implementation sits in between [`Deserialize`], [`Deserializer`], and [`Visitor`],
94//! gathers metadata during the parse, and saves successful sub-parses. It also "backtracks":
95//! if a parse fails, then we retry, but just before the failure point we swap out the real
96//! [`Deserializer`] for a decoy which can brings deserialization to a safe end.
97//!
98//!
99//! We apply multiple techniques. Suppose we want to parse `Vec<u32>` with [`serde_json`].
100//! Here are the main techniques.
101//!
102//! 1. **(Example: parse empty JSON as `[]` .)** — On the top level, if parsing fails immediately (e.g.
103//! empty input) but a sequence is expected, then return `[]`.
104//!
105//! _\[setting name: fallback_seq_empty_at_root]_
106//!
107//! 2. **(Example: parse JSON `"[3"` as `[3]` .)** — When there are no more elements in a sequence,
108//! let the [`Visitor`] construct the `Vec<u32>` and put it somewhere safe. Now
109//! `serde_json::Deserializer::deserialize_seq` notices the missing close bracket and
110//! returns `Err` to us. We ignore `Err`, retrieve the saved value again, and return `Ok`
111//! of it.
112//!
113//! This happens for every `deserialize_*` method, not just sequences.
114//!
115//! _\[setting name: tolerate_deserializer_fail_after_visit_success]_
116//!
117//! 3. **(Example: parse JSON `"[3,"` as `[3]` .)** — Inside a sequence, if parsing the next element will
118//! fail, then don't even try.
119//!
120//! This works using backtracking.
121//!
122//! _\[setting name: backtrack_seq_skip_item]_
123//!
124//! 4. Before deserializing, we append a random trailer.
125//!
126//! #### Random trailer
127//!
128//! Additionally we have a "random trailer" technique to get incomplete strings to parse.
129//! Unfortunately this technique is specific to the data format. This library implements
130//! it for JSON and YAML.
131//!
132//! This technique is not applied by default for other data formats. Even with JSON/YAML, this
133//! technique can be turned off with [`Options::disable_random_tag`].
134//!
135//! #### Random trailer for JSON
136//!
137//! We actually [append][append-impl] `tRANDOM"` to every JSON input, where `RANDOM` are some randomly chosen
138//! letters. It turns out that [`serde_json`] can parse any prefix of valid JSON, as long
139//! as we concatenate `tRANDOM"` to it. Some examples:
140//!
141//! 1. **(Example: `"hello` .)** The concatenation is `"hellotRANDOM"` and we actually get
142//! this back from [`serde_json`] through `fn visit_borrowed_str` --- after [`serde_json`]
143//! removed the double-quotes.
144//!
145//! In `fn visit_borrowed_str`, we notice that the string ends in `RANDOM`. Because this
146//! is a random string of letters, it cannot have been part of the incomplete JSON input.
147//! We remove the `tRANDOM` suffix and get back just `"hello"`.
148//!
149//! 2. **(Example: `"hello\` --- perhaps breaking in the middle of `\n` .)** The concatenation
150//! is `"hello\tRANDOM"`; the `\t` parses to a tab character. We strip off `<TAB>random`
151//! and again return `"hello"`.
152//!
153//! 3. **(Example: `"hello"` .)** The concatenation is `"hello"tRANDOM"`. Now [`serde_json`]
154//! visits the `hello` string as it would normally do, and if there should be any error
155//! after the visit, we can recover from it anyway as
156//! per _tolerate_deserializer_fail_after_visit_success_.
157//!
158//! [append-impl]: https://github.com/bgeron/deser-incomplete/blob/main/src/random_trailer/json.rs
159//!
160//! #### Inspecting at runtime
161//!
162//! There is extensive logging through the [`tracing`] library, which becomes visible if you
163//! initialize the library.
164//!
165//! #### Guiding principles
166//!
167//! The logic was hand-tweaked to the following criteria:
168//!
169//! 1. ("soundness") For any complete and valid JSON/YAML, if you call `deser-incomplete`
170//! on a prefix, then its output should not contain data that doesn't exist in the
171//! complete JSON/YAML.
172//!
173//! 2. ("monotone") A larger prefix should not parse to a shorter output.
174//!
175//! 3. ("prompt") Ideally, each prefix contains as much data as we can be certain of.
176//!
177//! The implementation of [`Deserializer`] (data format) may influence the quality of the output,
178//! but the default ruleset does generally very well with [`serde_json`] and [`serde_yaml`].
179//!
180//! There are [extensive snapshot tests][snapshot-tests] that validate the quality of the output
181//! on these criteria.
182//!
183//! If you are curious, then it is possible to tweak the ruleset
184//! with `unstable::UnstableCustomBehavior`. We also have snapshot tests for some alternative
185//! parsing configurations.
186//!
187//! [snapshot-tests]: https://github.com/bgeron/deser-incomplete/blob/main/tests/output/json_output/seq.rs
188//!
189//! ## Notes and limitations
190//!
191//! - Ideally, your data format should be relatively greedy, in the sense that it
192//! generates information quickly and does not need to look ahead in the serialized
193//! stream too much.
194//!
195//! - This approach lets us safely abort parsing and get a value, but
196//! we cannot skip over invalid segments of input. (For that you need
197//! an approach like [tree-sitter](https://tree-sitter.github.io/).)
198//!
199//! - We cannot distinguish eof from invalid input.
200//!
201//! - YAML works well in general, but it is a bit less exhaustively tested than JSON.
202//! The randomized trailer is really important for YAML.
203//!
204//! - JSON: when parsing a floating-point number, if the end of input happens to fall
205//! directly after the decimal point, then the number is missing from the output.
206//!
207//! - For YAML, the randomized trailer uses a heuristic to see if we are currently in
208//! an escape sequence in a string --- but this heuristic can fail. In this case,
209//! the incomplete string will be missing from the output.
210//!
211//! Have fun!
212//!
213//! ## Acknowledgements
214//!
215//! Thanks to Annisa Chand and @XAMPPRocky for useful feedback.
216
217macro_rules! error {
218 ($($arg:tt)*) => {
219 #[cfg(feature = "tracing")]
220 ::tracing::error!($($arg)*)
221 };
222}
223macro_rules! debug {
224 ($($arg:tt)*) => {
225 #[cfg(feature = "tracing")]
226 ::tracing::debug!($($arg)*)
227 };
228}
229macro_rules! trace {
230 ($($arg:tt)*) => {
231 #[cfg(feature = "tracing")]
232 ::tracing::trace!($($arg)*)
233 };
234}
235
236mod attempt;
237pub mod error;
238mod fallback;
239mod options_impl;
240#[cfg(feature = "rand")]
241pub mod random_trailer;
242#[cfg(not(feature = "rand"))]
243mod random_trailer;
244mod reporter;
245pub mod source;
246mod state;
247mod util;
248
249/// Types and traits that have to be public to satisfy rustc/rustdoc.
250///
251/// Instead of looking here, look at the methods of [`crate::Options`].
252pub mod options {
253 #[cfg(all(feature = "rand", feature = "serde_json"))]
254 pub use crate::options_impl::JsonExtraOptions;
255 #[cfg(all(feature = "rand", feature = "serde_yaml"))]
256 pub use crate::options_impl::YamlExtraOptions;
257 pub use crate::options_impl::{
258 DefaultExtraOptions, ExtraOptions, MakeDefaultFallbacks, MakeDefaultReporter,
259 };
260}
261
262/// Import from this crate in this library. That way, doc links work properly.
263#[cfg(not(feature = "unstable"))]
264mod unstable {
265 pub use crate::collection_of_unstable_stuff::*;
266}
267/// Stuff that is not polished or likely to change.
268#[cfg(feature = "unstable")]
269pub mod unstable {
270 pub use crate::collection_of_unstable_stuff::*;
271}
272
273#[allow(unused_imports)]
274mod collection_of_unstable_stuff {
275 pub use crate::fallback::Fallbacks;
276 pub use crate::options_impl::{
277 ExtraOptions, ExtraOptionsStruct, MakeFallbackProvider, MakeReporter,
278 UnstableCustomBehavior,
279 };
280 pub use crate::reporter::{DefaultReporter, Reporter};
281 pub(crate) trait ExtraOptionsIsUnstable {}
282}
283
284use std::borrow::Cow;
285
286pub use error::Error;
287pub use options_impl::Options;
288use options_impl::UnstableCustomBehavior;
289#[cfg(doc)]
290use serde::{de::Visitor, Deserialize, Deserializer};
291pub use source::Source;
292
293/// Main function. Robustly deserialize incomplete input with [`serde_json`].
294///
295/// See methods on [`Options`] for more generic APIs.
296#[cfg(all(feature = "rand", feature = "serde_json"))]
297pub fn from_json_str<T>(json: &str) -> Result<T, Error<serde_json::Error>>
298where
299 T: for<'de> serde::Deserialize<'de>,
300{
301 Options::new_json().deserialize_from_json_str(Cow::Borrowed(json))
302}
303
304/// Like [`from_json_str`], but for bytes.
305///
306/// See methods on [`Options`] for more generic APIs.
307#[cfg(all(feature = "rand", feature = "serde_json"))]
308pub fn from_json_slice<T>(json: &[u8]) -> Result<T, Error<serde_json::Error>>
309where
310 T: for<'de> serde::Deserialize<'de>,
311{
312 Options::new_json().deserialize_from_json_slice(Cow::Borrowed(json))
313}
314
315/// Robustly deserialize incomplete input with [`serde_yaml`].
316///
317/// See methods on [`Options`] for more generic APIs.
318#[cfg(all(feature = "rand", feature = "serde_yaml"))]
319pub fn from_yaml_str<T>(yaml: &str) -> Result<T, Error<serde_yaml::Error>>
320where
321 T: for<'de> serde::Deserialize<'de>,
322{
323 Options::new_yaml().deserialize_from_yaml_str(Cow::Borrowed(yaml))
324}
325
326/// Like [`from_yaml_str`], but for bytes.
327///
328/// See methods on [`Options`] for more generic APIs.
329#[cfg(all(feature = "rand", feature = "serde_yaml"))]
330pub fn from_yaml_slice<T>(yaml: &[u8]) -> Result<T, Error<serde_yaml::Error>>
331where
332 T: for<'de> serde::Deserialize<'de>,
333{
334 Options::new_yaml().deserialize_from_yaml_slice(Cow::Borrowed(yaml))
335}