Expand description
Documentation comments are taken in part from the SSML specification which can be found here. All copied sections will be marked with:
“Speech Synthesis Markup Language (SSML) Version 1.1” Copyright © 2010 W3C® (MIT, ERCIM, Keio), All Rights Reserved.
If any sections aren’t marked please submit a PR. For types this copyright notice will be placed on the top level type and not each field for conciseness but keep in mind the fields will also be taken from the same section of the standard.
Structs§
- Audio
Attributes - The audio element supports the insertion of recorded audio files and the insertion of other audio formats in conjunction with synthesized speech output. The audio element may be empty. If the audio element is not empty then the contents should be the marked-up text to be spoken if the audio document is not available. The alternate content may include text, speech markup, desc elements, or other audio elements. The alternate content may also be used when rendering the document to non-audible output and for accessibility (see the desc element).
- Break
Attributes - The break element is an empty element that controls the pausing or other prosodic boundaries between tokens. The use of the break element between any pair of tokens is optional. If the element is not present between tokens, the synthesis processor is expected to automatically determine a break based on the linguistic context. In practice, the break element is most often used to override the typical automatic behavior of a synthesis processor.
- Emphasis
Attributes - “Speech Synthesis Markup Language (SSML) Version 1.1” Copyright © 2010 W3C® (MIT, ERCIM, Keio), All Rights Reserved.
- Lang
Attributes - The lang element is used to specify the natural language of the content. This element MAY be used when there is a change in the natural language.
- Language
Accent Pair - A language accent pair, this will be a language (required) and an optional accent in which to speak the language.
- Lexicon
Attributes - An SSML document MAY reference one or more lexicon documents. A lexicon document is located by a URI with an OPTIONAL media type and is assigned a name that is unique in the SSML document. Any number of lexicon elements MAY occur as immediate children of the speak element.
- Lookup
Attributes - The lookup element MUST have a ref attribute. The ref attribute specifies a name that references a lexicon document as assigned by the xml:id attribute of the lexicon element.
- Mark
Attributes - A mark element is an empty element that places a marker into the text/tag sequence. It has one REQUIRED attribute, name, which is of type xsd:token [SCHEMA2 §3.3.2]. The mark element can be used to reference a specific location in the text/tag sequence, and can additionally be used to insert a marker into an output stream for asynchronous notification. When processing a mark element, a synthesis processor MUST do one or both of the following:
- Meta
Attributes - The metadata and meta elements are containers in which information about the document can be placed. The metadata element provides more general and powerful treatment of metadata information than meta by using a metadata schema.
- Phoneme
Attributes - The phoneme element provides a phonemic/phonetic pronunciation for the contained text. The phoneme element may be empty. However, it is recommended that the element contain human-readable text that can be used for non-spoken rendering of the document. For example, the content may be displayed visually for users with hearing impairments.
- Prosody
Attributes - “Speech Synthesis Markup Language (SSML) Version 1.1” Copyright © 2010 W3C® (MIT, ERCIM, Keio), All Rights Reserved.
- SayAs
Attributes - The say-as element allows the author to indicate information on the type of text construct contained within the element and to help specify the level of detail for rendering the contained text. The say-as element has three attributes: interpret-as, format, and detail. The interpret-as attribute is always required; the other two attributes are optional. The legal values for the format attribute depend on the value of the interpret-as attribute. The say-as element can only contain text to be rendered.
- Speak
Attributes - The Speech Synthesis Markup Language is an XML application. The root element is speak.
- SubAttributes
- The sub element is employed to indicate that the text in the alias attribute value replaces the contained text for pronunciation. This allows a document to contain both a spoken and written form. The REQUIRED alias attribute specifies the string to be spoken instead of the enclosed string. The processor SHOULD apply text normalization to the alias value.
- Token
Attributes - The token element allows the author to indicate its content is a token and to eliminate token (word) segmentation ambiguities of the synthesis processor.
- Voice
Attributes - The voice element is a production element that requests a change in speaking voice. There are two kinds of attributes for the voice element: those that indicate desired features of a voice and those that control behavior. The voice feature attributes are:
Enums§
- Contour
Element - The pitch contour is defined as a set of white space-separated targets at specified time positions in the speech output. The algorithm for interpolating between the targets is processor-specific. In each pair of the form (time position,target), the first value is a percentage of the period of the contained text (a number followed by “%”) and the second value is the value of the pitch attribute (a number followed by “Hz”, a relative change, or a label value). Time position values outside 0% to 100% are ignored. If a pitch value is not defined for 0% or 100% then the nearest pitch target is copied. All relative values for the pitch are relative to the pitch value just before the contained text.
- Emphasis
Level - “Speech Synthesis Markup Language (SSML) Version 1.1” Copyright © 2010 W3C® (MIT, ERCIM, Keio), All Rights Reserved.
- Fetch
Hint - This tells the synthesis processor whether or not it can attempt to optimize rendering by pre-fetching audio. The value is either safe to say that audio is only fetched when it is needed, never before; or prefetch to permit, but not require the processor to pre-fetch the audio.
- Gender
- Attribute indicating the preferred gender of the voice to speak the contained text.
- OnLanguage
Failure - The onlangfailure attribute is an optional attribute that contains one value from the following enumerated list describing the desired behavior of the synthesis processor upon language speaking failure. A conforming synthesis processor must report a language speaking failure in addition to taking th action(s) below.
- Parsed
Element - Enum representing the parsed element, each element with attributes allowed also contains an object for it’s attributes.
- Phoneme
Alphabet - The phonemic/phonetic pronunciation alphabet. A pronunciation alphabet in this context refers to a collection of symbols to represent the sounds of one or more human languages.
- Pitch
Contour - The pitch contour is defined as a set of white space-separated targets at specified time positions in the speech output. The algorithm for interpolating between the targets is processor-specific. In each pair of the form (time position,target), the first value is a percentage of the period of the contained text (a number followed by “%”) and the second value is the value of the pitch attribute (a number followed by “Hz”, a relative change, or a label value). Time position values outside 0% to 100% are ignored. If a pitch value is not defined for 0% or 100% then the nearest pitch target is copied. All relative values for the pitch are relative to the pitch value just before the contained text.
- Pitch
Range - Although the exact meaning of “pitch range” will vary across synthesis processors, increasing/decreasing this value will typically increase/decrease the dynamic range of the output pitch.
- Pitch
Strength - “Speech Synthesis Markup Language (SSML) Version 1.1” Copyright © 2010 W3C® (MIT, ERCIM, Keio), All Rights Reserved.
- Positive
Number - Representation of positive numbers in SSML tags. We keep a float vs integral value to ensure that when re-serializating numeric errors are minimised.
- Rate
Range - A change in the speaking rate for the contained text. Legal values are: a non-negative percentage or “x-slow”, “slow”, “medium”, “fast”, “x-fast”, or “default”. Labels “x-slow” through “x-fast” represent a sequence of monotonically non-decreasing speaking rates. When the value is a non-negative percentage it acts as a multiplier of the default rate. For example, a value of 100% means no change in speaking rate, a value of 200% means a speaking rate twice the default rate, and a value of 50% means a speaking rate of half the default rate. The default rate for a voice depends on the language and dialect and on the personality of the voice. The default rate for a voice SHOULD be such that it is experienced as a normal speaking rate for the voice when reading aloud text. Since voices are processor-specific, the default rate will be as well.
- Rate
Strength - “Speech Synthesis Markup Language (SSML) Version 1.1” Copyright © 2010 W3C® (MIT, ERCIM, Keio), All Rights Reserved.
- Sign
- Sign for relative values (positive or negative).
- Ssml
Element - Type of the SSML element
- Strength
- The strength attribute is an optional attribute having one of the following values: “none”, “x-weak”, “weak”, “medium” (default value), “strong”, or “x-strong”. This attribute is used to indicate the strength of the prosodic break in the speech output. The value “none” indicates that no prosodic break boundary should be outputted, which can be used to prevent a prosodic break which the processor would otherwise produce. The other values indicate monotonically non-decreasing (conceptually increasing) break strength between tokens. The stronger boundaries are typically accompanied by pauses. “x-weak” and “x-strong” are mnemonics for “extra weak” and “extra strong”, respectively.
- Time
Designation - For times SSML only uses seconds or milliseconds in the form “%fs” “%fs”, this handles parsing these times
- Unit
- Unit used to measure relative changes in values, this is either percentage or for pitches can be measured in semitones or Hertz.
- Volume
Range - The volume for the contained text. Legal values are: a number preceded by “+” or “-” and immediately followed by “dB”; or “silent”, “x-soft”, “soft”, “medium”, “loud”, “x-loud”, or “default”. The default is +0.0dB. Specifying a value of “silent” amounts to specifying minus infinity decibels (dB).
- Volume
Strength - “Speech Synthesis Markup Language (SSML) Version 1.1” Copyright © 2010 W3C® (MIT, ERCIM, Keio), All Rights Reserved.