ion-schema 0.15.0

Implementation of Amazon Ion Schema
Documentation
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
//! Implementation of a Non-deterministic Finite-state Automaton for the `ordered_elements` constraint.
//!
//! The general idea is this:
//!
//! > The NFA consumes a sequence of input events, one by one. In each step, whenever two or more
//! > transitions are applicable, it "clones" itself into appropriately many copies, each one
//! > following a different transition. If exactly one transition is applicable, it follows that
//! > transition without making any copies. If no transition is applicable, the current copy is in a
//! > dead end, and it "dies". If, after consuming the complete input, any of the copies is in an
//! > accept state, the input is accepted, else, it is rejected. [...]
//! >
//! > Keep a set data structure of all states which the NFA might currently be in. On the
//! > consumption of an input event, unite the results of the transition function applied to all
//! > current states to get the set of next states. [...]
//! >
//! > On the consumption of the last input event, if one of the current states is a final state,
//! > the machine accepts the sequence.
//!
//! (Paraphrased from [Nondeterministic_finite_automaton](https://en.wikipedia.org/wiki/Nondeterministic_finite_automaton))
//!
//! Strictly speaking, our implementation might not be properly finite. For an `ordered_elements`
//! member such as `{ type: int, occurs: range::[0, max] }`, there are infinite possible states
//! since it could accept a sequence of integers that is any length. We avoid having an
//! infinite-sized graph by constructing one state for each member in the `ordered_elements`
//! constraint and storing the min and max possible visits for each state.
//!
//! As we are traversing the state machine graph, we track the current state of the machine as a set
//! of `(state_id, visit_count)` pairs. For something that accepts a theoretically infinite input
//! sequence (as above), this could result in an infinite number of parallel states, but in practice
//! the number of possible states is bounded by the length of the input.
//!
//! Our state machine has the following rules that specify the edges between states.
//! * Every state (type ref in `ordered_elements`) must have an edge to the next state in the list
//! * If a state has a max occurs >1, then it has an edge to itself
//! * If a state `n` has a min occurs of 0, then any states with an edge to state `n` also have
//!   an edge to state `n+1` (to be able to skip the optional state).
//!
//! For example, suppose we have the following `ordered_elements` constraint:
//! ```ion
//! {
//!   ordered_elements: [
//!     /*A*/ bool,
//!     /*B*/ { occurs: optional, type: int },
//!     /*C*/ float,
//!     /*D*/ { occurs: range::[1, 2], type: string }
//!     /*E*/ decimal,
//!   ]
//! }
//! ```
//! * `A` is required, so `Initial` only has a single edge to `A`.
//! * `A` has an edge to `B`, but because `B` is optional, we could skip it and go straight from
//!   `A` to `C`. `C` is required, so we cannot skip it and there are no edges from `A` to
//!   `D` or anything else past `C`.
//! * `B` has a single edge to `C`.
//! * `C` has a single edge to `D`.
//! * `D` has an edge to `E`, but because `D` has a max occurs that is >1, `D` also has an edge
//!   back to itself.
//! * `E` has a single edge to `Final`.
//!
//! Thus, the adjacency list ends up being:
//! ```text
//! Initial -> A
//!       A -> B
//!       A -> C
//!       B -> C
//!       C -> D
//!       D -> D
//!       D -> E
//!       E -> Final
//! ```
//!
//! A corollary of these rules is that set of edges out of state `i` must have destinations that
//! are a contiguous range from either `i` or `i+1` to `j` where `j > i`. This ends up being useful
//! because we can model the set of all edges from a particular state as a `Range` rather than a
//! list of destinations.

use crate::ion_path::{IonPath, IonPathElement};
use crate::result::ValidationResult;
use crate::system::TypeStore;
use crate::type_reference::{TypeReference, VariablyOccurringTypeRef};
use crate::types::TypeValidator;
use crate::violation::{Violation, ViolationCode};
use crate::IonSchemaElement;
use ion_rs::Element;
use std::cmp::Ordering;
use std::collections::HashSet;
use std::fmt::{Debug, Display, Formatter};
use std::hash::{Hash, Hasher};
use std::mem::swap;
use std::ops::RangeInclusive;
use std::vec;

/// Unique identifier for a particular node/state in the NFA.
///
/// Each [State] has an integer identifier ([StateId]) which must be unique within any
/// [OrderedElementsNfa] instance. This identifier is used for the implementation of traits such as
/// [Ord], [Eq], and  [Hash] for types such as [TraversalError].
type StateId = usize;

/// In the evaluation of the NFA, used for tracking which states are in the set of possible current states.
type StateVisitCount = (StateId, usize);

/// Represents an event in the input sequence—it is either some [Element] or the end-of-sequence
/// marker (i.e. [Option::None]).
type ElementOrEndOfSequence<'a> = Option<&'a Element>;

/// The compiled state machine for an `ordered_elements` constraint.
///
/// This is represented as a directed, almost a-cyclical graph. The only cycles allowed are
/// loops—i.e. an edge that is connected to the same vertex at both ends.
#[derive(Debug, Clone, PartialEq)]
pub struct OrderedElementsNfa {
    /// The vertices/nodes of the state machine.
    states: Vec<State>,
    /// An adjacency list describing the directed edges between `states`.
    ///
    /// Because all outgoing edges must have consecutively numbered destinations, we can model the
    /// edges as a map of `from_id` (the index in the vec) to a `Range` of `StateId` destinations.
    edges: Vec<RangeInclusive<usize>>,
    /// Stores the [StateId] of the final state (packaged into a [StateVisitCount]) for convenience.
    /// When running the state machine, if the set of current states contains this `StateVisitCount`,
    /// we are done evaluating, and the input was accepted by the state machine.
    terminal_state: StateVisitCount,
}

impl OrderedElementsNfa {
    /// Constructs an [OrderedElementsNfa] from a [Vec] of pairs of [VariablyOccurringTypeRef] and
    /// an optional string description.
    ///
    /// The description is a human friendly string that goes into the violation messages to describe
    /// which entry in the `ordered_elements` constraint is producing the violation. If a
    /// description is provided, it should be something that is recognizable to users, such as a
    /// row/col in the source ISL or a snippet of the source ISL.
    /// If no description is provided, the default is `<ELEMENT[i]>`, where `i` is the index in the
    /// `ordered_element` constraint's list of variably occurring type references.
    pub fn new(intermediate_states: Vec<(VariablyOccurringTypeRef, Option<String>)>) -> Self {
        // "Initial" state is always first—no surprise there.
        let mut states = vec![State::Initial];

        // Construct intermediate states and add to our vec of states.
        intermediate_states
            .into_iter()
            .enumerate()
            .for_each(|(i, (var_type_ref, description))| {
                let description =
                    description.unwrap_or_else(|| format!("<ORDERED_ELEMENT[{}]>", i));
                let (min_visits, max_visits) = var_type_ref.occurs_range().inclusive_endpoints();
                let state = IntermediateState {
                    type_ref: var_type_ref.type_ref(),
                    min_visits,
                    max_visits,
                    description,
                };

                states.push(State::Intermediate(i + 1, state))
            });

        // This will become the ID of the "Final" state, but it's convenient to wait to add the
        // "Final" state until we've determined the edges of the graph.
        let max_id = states.len();

        // Construct an adjacency list to represent the edges of the graph.
        let mut edges = vec![];
        for (i, s) in states.iter().enumerate() {
            // Loop back to self, if max is > 1
            let min_transition = if s.can_reenter(1) { i } else { i + 1 };

            // Add transitions forward up to (including) the first type with a min occurs that is greater than 0
            let mut j = i + 1;
            while j < max_id {
                if !states[j].can_exit(0) {
                    break;
                }
                j += 1;
            }
            edges.push(min_transition..=j)
        }

        states.push(State::Final(max_id));

        // Terminal state is the Final state with a visit count of 1.
        let terminal_state: StateVisitCount = (max_id, 1usize);

        OrderedElementsNfa {
            states,
            edges,
            terminal_state,
        }
    }

    /// Tests an input sequence of [Element]
    pub fn matches<'a, I: Iterator<Item = &'a Element>>(
        &self,
        mut iter: I,
        type_store: &'a TypeStore,
        ion_path: &mut IonPath,
    ) -> ValidationResult {
        let mut current_state_set: HashSet<StateVisitCount> = HashSet::new();
        let mut new_states: HashSet<StateVisitCount> = HashSet::new();

        let mut input_index = 0;
        current_state_set.insert((0usize, 1usize));

        // Essentially, for-each input, but we want to capture the `Option::None` at the end of the iterator.
        loop {
            let element: ElementOrEndOfSequence = iter.next();
            let mut invalid_transitions: HashSet<TraversalError> = HashSet::new();

            ion_path.push(IonPathElement::Index(input_index));

            // For each state in the set of current states...
            for &(from_state_id, num_visits) in &current_state_set {
                let from_state = &self.states[from_state_id];

                let edges = if let Some(edges) = self.edges.get(from_state_id) {
                    // Need to clone the range because strangely &RangeInclusive doesn't
                    // implement Copy or IntoIterator.
                    edges.clone()
                } else {
                    // The only state without edges is `Final`, which cannot be exited.
                    invalid_transitions.insert(TraversalError::CannotExitState(from_state_id));
                    break;
                };

                // For each edge out of the current state we are inspecting...
                for to_state_id in edges {
                    let to_state: &State = &self.states[to_state_id];

                    let can_reenter = from_state.can_reenter(num_visits);
                    let can_exit = from_state.can_exit(num_visits);
                    let is_loop = to_state_id == from_state_id;

                    if !is_loop && !can_exit {
                        invalid_transitions.insert(TraversalError::CannotExitState(from_state_id));
                        // We haven't reached the min_occurs of the current state. Any further
                        // transitions will also suffer from the same problem. No need to report
                        // this same problem repeatedly, so we break here.
                        break;
                    }

                    // TODO: Consider caching the result of this so that *if* there are multiple
                    //       current states that could transition to the same state, we don't end
                    //       up doing the same work multiple times.
                    let can_enter = to_state.can_enter(element, type_store, ion_path);

                    if let Err(violation) = can_enter {
                        invalid_transitions
                            .insert(TraversalError::CannotEnterState(to_state_id, violation));
                    } else if is_loop && !can_reenter {
                        invalid_transitions.insert(TraversalError::CannotReEnterState(to_state_id));
                    } else {
                        let new_num_visits = if is_loop { num_visits + 1 } else { 1 };
                        new_states.insert((to_state_id, new_num_visits));
                    }
                }
            }

            // There are no valid paths to continue through the graph.
            if new_states.is_empty() {
                return Err(self.build_violation(element, ion_path, invalid_transitions));
            }

            ion_path.pop();

            if new_states.contains(&self.terminal_state) {
                return Ok(());
            }

            current_state_set.clear();
            swap(&mut current_state_set, &mut new_states);
            input_index += 1;
        }
    }

    /// Build a [Violation] out of the set of [TraversalError]s.
    fn build_violation(
        &self,
        event: ElementOrEndOfSequence,
        ion_path: &mut IonPath,
        invalid_transitions: HashSet<TraversalError>,
    ) -> Violation {
        let mut reasons: Vec<_> = invalid_transitions.into_iter().collect();
        reasons.sort();
        let reasons = reasons
            .into_iter()
            .map(|it| match it {
                TraversalError::CannotExitState(s) => Violation::new(
                    "ordered_elements",
                    ViolationCode::ElementMismatched,
                    format!("{}: min occurs not reached", &self.states[s]),
                    ion_path,
                ),
                TraversalError::CannotReEnterState(s) => Violation::new(
                    "ordered_elements",
                    ViolationCode::ElementMismatched,
                    format!("{}: max occurs already reached", &self.states[s],),
                    ion_path,
                ),
                TraversalError::CannotEnterState(s, v) => Violation::with_violations(
                    "ordered_elements",
                    ViolationCode::ElementMismatched,
                    format!("{}: does not match type", &self.states[s]),
                    ion_path,
                    vec![v],
                ),
            })
            .collect();

        let index = ion_path.pop().unwrap();

        Violation::with_violations(
            "ordered_elements",
            ViolationCode::ElementMismatched,
            format!(
                "input does not match ordered_elements at index {}: {}",
                index,
                event
                    .map(Element::to_string)
                    .unwrap_or_else(|| "<END_OF_INPUT>".to_string())
            ),
            ion_path,
            reasons,
        )
    }
}

/// Details for a state that represents one of the [VariablyOccurringTypeReferences] in an
/// `ordered_elements` constraint.
#[derive(Debug, Clone, PartialEq)]
struct IntermediateState {
    type_ref: TypeReference,
    min_visits: usize,
    max_visits: usize,
    description: String,
}

/// Represents a state in the compiled nondeterministic finite automaton.
#[derive(Debug, Clone, PartialEq)]
enum State {
    Initial,
    Intermediate(StateId, IntermediateState),
    Final(StateId),
}

impl State {
    /// The unique integer identifier for this state.
    fn id(&self) -> StateId {
        match self {
            State::Initial => 0usize,
            State::Intermediate(id, _) => *id,
            State::Final(id) => *id,
        }
    }

    /// Checks whether the state can be visited more times or if the current path must move to a
    /// different state.
    fn can_reenter(&self, num_visits: usize) -> bool {
        match self {
            State::Initial => false,
            State::Intermediate(_, s) => num_visits < s.max_visits,
            State::Final(_) => false,
        }
    }

    /// Checks whether the state has been visited enough times to allow exiting that state (as
    /// opposed to either looping or dying out).
    fn can_exit(&self, num_visits: usize) -> bool {
        match self {
            State::Initial => true,
            State::Intermediate(_, s) => num_visits >= s.min_visits,
            State::Final(_) => false,
        }
    }

    /// Tests whether the `element` is valid for the [TypeReference] of this state.
    fn can_enter(
        &self,
        element: Option<&Element>,
        type_store: &TypeStore,
        ion_path: &mut IonPath,
    ) -> ValidationResult {
        match self {
            State::Initial => unreachable!("There are no transitions to the initial state."),
            State::Intermediate(_, s) => {
                if let Some(el) = element {
                    let t = s.type_ref;
                    t.validate(&IonSchemaElement::from(el), type_store, ion_path)
                } else {
                    Err(Violation::new(
                        "ordered_elements",
                        ViolationCode::ElementMismatched,
                        "expected another element; found <END OF SEQUENCE>",
                        ion_path,
                    ))
                }
            }
            State::Final(_) => {
                if element.is_some() {
                    Err(Violation::new(
                        "ordered_elements",
                        ViolationCode::ElementMismatched,
                        format!("expected <END OF SEQUENCE>; found: {}", element.unwrap()),
                        ion_path,
                    ))
                } else {
                    Ok(())
                }
            }
        }
    }
}

impl Display for State {
    fn fmt(&self, f: &mut Formatter<'_>) -> std::fmt::Result {
        let string = match self {
            State::Initial => "<START OF SEQUENCE>",
            State::Intermediate(i, s) => &s.description,
            State::Final(_) => "<END OF SEQUENCE>",
        };
        f.write_str(string)
    }
}

/// The reason why a transition (or edge) in the state machine graph cannot be traversed.
#[derive(Debug)]
enum TraversalError {
    CannotEnterState(StateId, Violation),
    CannotExitState(StateId),
    CannotReEnterState(StateId),
}

impl PartialOrd for TraversalError {
    fn partial_cmp(&self, other: &Self) -> Option<Ordering> {
        Some(self.cmp(other))
    }
}

impl Ord for TraversalError {
    fn cmp(&self, other: &Self) -> Ordering {
        let self_id = match self {
            TraversalError::CannotEnterState(id, _)
            | TraversalError::CannotExitState(id)
            | TraversalError::CannotReEnterState(id) => id,
        };
        let other_id = match other {
            TraversalError::CannotEnterState(id, _)
            | TraversalError::CannotExitState(id)
            | TraversalError::CannotReEnterState(id) => id,
        };
        self_id.cmp(other_id)
    }
}

impl Eq for TraversalError {}

impl PartialEq for TraversalError {
    fn eq(&self, other: &Self) -> bool {
        match (self, other) {
            (
                TraversalError::CannotExitState(self_id),
                TraversalError::CannotExitState(other_id),
            ) => self_id == other_id,
            (
                TraversalError::CannotReEnterState(self_id),
                TraversalError::CannotReEnterState(other_id),
            ) => self_id == other_id,
            // It is okay to ignore the violation here because we only consider one event/element at
            // any given point in the state machine. Since that is the case, if the IDs are the same,
            // then they must represent the same destination state (type reference), and so the
            // violations must be equal.
            (
                TraversalError::CannotEnterState(self_id, _),
                TraversalError::CannotEnterState(other_id, _),
            ) => self_id == other_id,
            (_, _) => false,
        }
    }
}

impl Hash for TraversalError {
    fn hash<H: Hasher>(&self, state: &mut H) {
        // By using unique primes, we cannot get a hash collision unless there's at least as many
        // states as the smallest of the prime numbers. Furthermore, the relatively large spacing
        // between the prime numbers makes it even more unlikely that a collision would occur since
        // the first IDs that could have a collision with each other would be 107 and 307.
        state.write_usize(match self {
            TraversalError::CannotEnterState(id, _) => id * 503,
            TraversalError::CannotExitState(id) => id * 307,
            TraversalError::CannotReEnterState(id) => id * 107,
        })
    }
}