pub struct RedundantParenCollapser<S>where
S: ScopeData,{ /* private fields */ }Expand description
Streaming middleware that collapses an explicit parenthetical reading annotation into the converted hanja word it duplicates.
Mixed-script input sometimes spells a word together with a parenthetical
gloss, either hanja-first (庫間(곳간)) or hangul-first (곳간(庫間)). Left
alone, the converter would render the hanja and keep the parenthetical,
producing a redundant 곳간(곳간). An author who wrote such a gloss meant
“annotate this word fully”, so this middleware detects the two patterns,
removes the now-redundant parenthetical text, and sets both
Annotation::require_hanja and Annotation::require_hangul on the
surviving annotation. Setting both flags reproduces the author’s intent in
every render mode: RenderMode::HangulOnly honours require_hanja
(곳간(庫間)) while RenderMode::Original honours require_hangul
(庫間(곳간)).
A parenthetical may also pin an alternative reading. 數字 is normally
read 숫자, but in the sense “a few characters” it reads 수자; writing
數字(수자) fixes the reading for that occurrence. Such a reading
annotation is told apart from a definition gloss like
庫間(물건을 간직하여 두는 곳) with a two-tier test against the candidate
hangul R:
- Exact match —
Requals the annotation’s reading. Collapse and keep the reading. - Valid alternative reading —
Rhas exactly one hangul syllable per hanja character and every syllable is a recorded Unihan reading of its character (or the initial-sound-law variant of one). Collapse and override the reading withR.
Anything else (definition glosses, foreign transliterations such as
蔣介石(장제스), or a syllable-count mismatch) is left untouched.
The middleware runs immediately after the engine, before
HomophoneMarker and FirstOccurrenceFilter, so later stages observe
the corrected reading and flags. It coalesces adjacent
OutputToken::Text tokens (the streaming engine flushes non-hanja text at
safe points, so (곳간) can arrive split as (곳간 then )) and buffers
only a bounded amount: a held annotation, the trailing matchable suffix of
the preceding text, and the following parenthetical until it can be
classified. This keeps the streaming result identical to a one-shot
conversion while staying responsive on long hanja-free runs.
OutputToken::Open, OutputToken::Close, and OutputToken::Verbatim
flush the buffer and pass through, so a match never crosses a scope
boundary. When enabled is false the middleware is an exact
pass-through.
§Limitation
The collapser runs after the engine and never re-derives readings, so a
hanja-first gloss immediately followed (with no space) by an initial-sound-law
(頭音法則) character keeps the reading the engine chose with the parenthetical
acting as a word boundary. For example 學(학)率 collapses to 학(學)율
rather than 학률: the engine read 率 as word-initial 율 because )
separated it from 學, and removing the gloss cannot recover the
non-word-initial 률. This is narrow in practice; an intended compound is
normally written 學率(학률). Insert a space (學(학) 率) or gloss the whole
compound to control the reading.
Implementations§
Source§impl<S> RedundantParenCollapser<S>where
S: ScopeData,
impl<S> RedundantParenCollapser<S>where
S: ScopeData,
Sourcepub fn new(enabled: bool) -> RedundantParenCollapser<S>
pub fn new(enabled: bool) -> RedundantParenCollapser<S>
Creates a collapser. When enabled is false every token passes
through unchanged.
Sourcepub fn push_token(&mut self, token: OutputToken<S>) -> Vec<OutputToken<S>>
pub fn push_token(&mut self, token: OutputToken<S>) -> Vec<OutputToken<S>>
Pushes one output token and returns tokens ready for downstream stages.
Sourcepub fn finish(self) -> Vec<OutputToken<S>>
pub fn finish(self) -> Vec<OutputToken<S>>
Flushes buffered tokens and returns them.