What is Speech Synthesis: 3 Important Factors Related To It

Text-to-Speech Robot

Speech Synthesis

The method of generating human-like speech artificial with the help of machines is called speech synthesis. A computer system that is put into use for carrying out this procedure is termed a speech synthesizer. The system requires further implementation in either software or hardware, and we can notice one application of it in a Text-to-Speech (TTS) system. A Text-to-Speech system accepts everyday human language in text form as input and converts it into speech as output.

Speech synthesis is done by sequencing recorded speech in the form of units that are stored in a database. Systems vary in the size of the voice units stored; the most extensive output range is provided by a system that stores telephones or diphones with the possibility of a loss of clarity.

The storage of whole words or sentences allows for high-quality production for particular user domains. This method can be substituted by incorporating a vocal tract model and various other characteristics belonging to the human voice and generating artificial voice output.

A speech synthesizer’s output quality is subject to its closeness to the real human voice and how easy it is to be understood. The usage of the speech synthesis device has been evident since the 1990s, which has been thoroughly developed to help people with specific disabilities and impairments.

Overview of Text-to-Speech System

There are two significant parts to a text-to-speech speech:

Front End– It is responsible for converting the input text that contains various symbols, numbers, and abbreviations into the equivalent form of understandable and convertible data. This process is termed as text normalization or pre-processing of the data. Each word is then assigned with phonetic transcriptions and separates and tags the text into prosodic units, such as sentences, clauses, and phrases, through a process called text-to-phoneme or grapheme-to-phoneme. The two aspects are then combined to generate the output data containing the symbolic linguistic representation.
Back End– Generally stated to as the “synthesizer”, this part is accountable for the symbolic linguistic representation into sound. In advanced system, this process is further followed by computation of the target prosody (pitch contour, phoneme times), which will be utilized in the output speech.

Computer and speech synthesiser housing 19 9663804888 — ***Speech Synthesizer used by Stephen Hawking***; *Image Source: Science Museum London / Science and Society Picture Library, Computer and speech synthesiser housing, 19 (9663804888), CC BY-SA 2.0*

Technologies involved in Speech Synthesis

Naturalness and intelligibility are the most significant attributes which determine the quality of a speech synthesis device. Naturalness is defined by the device’s capability to replicate human voice as closely as possible, and intelligibility determines how easily the device can understand the output sound. Speech synthesizers strive to produce optimal results in both these aspects.

Concatenative synthesis and formant synthesis are the two primary technologies that generate synthetic speech waveforms. There are strengths and disadvantages in each technology, and the common uses of a synthesis method usually dictate the choice of one of these approaches.

Concatenative Synthesis

Sequencing of fragments of recorded speech in a certain way is called concatenative synthesis. This process typically produces the most natural-sounding synthesized speech. However, inconsistencies between natural speech variations and the design of the automated waveform segmentation methods often result in audible output glitches.

There exist three important sub-types of concatenative synthesis.

Unit selection synthesis– The input for this selectin technique is an extensive database of recorded speech. Segmentation of the database is carried out using a speech recognizer set to forced alignment mode. Segmentation results in units like phones, diphones, words, phrases, syllables, morphemes, sentences, etc. Indexing of these units is based on various parameters like- pitch, duration, position in syllable, and the neighboring phones. The decision tree process selects the most suitable units to form a chain for execution. The more extensive the database, the more natural is the output speech. This technique offers the most extraordinary naturalness for output speech based on the recorded data.
Diphone synthesis– The database for this technique consists of only diphones, which makes it relatively small. The phonotactics of a selected language determines the set of all unique diphones to be considered. The speech database consists of a single recording of each diphone. Various digital signal processing techniques like PSOLA, MBROLA, linear predictive coding are used to superimpose the target sentence on these diphone units. Usage of diphone synthesis is limited to research because the speech lacks naturalness, sounds very robotic, and contains sonic glitches.
Domain-specific synthesis- The database for this technique is confined to pre-recorded words and phrases. Applicability of this synthesis method is limited to the domain based on which the database is generated, for example, railway station announcements, weather reports, talking clocks, etc. Implementation of this technology is straightforward, and at the same time, a high level of naturalness can be achieved because of limited output sentences. To achieve a smooth blending of words with natural speech, many language variations must be accounted for.

Formant Synthesis

For many applications, speech’s naturalness is not a goal; rather, reliability, intelligence, and high-speed accuracy are more important. This can be achieved using formant synthesis, which creates a synthesized speech employing additive synthesis and acoustic modeling. This method, also called rule-based synthesis, creates an artificial speech waveform by varying parameters like frequency, noise levels, and voicing.

The artificial, robotic-sounding speech created by formant synthesis technology is highly unlikely to be mistaken for human speech. Acoustic glitches, which are common in concatenative systems, are primarily eliminated in this technique. Due to the absence of an extensive database of speech recordings, these programs are relatively tiny because they find use in embedded systems where the power for processing is limited.

It is possible to convey a variety of voice tones and emotions apart from standard questions and statements because format based systems exhibit complete control over all aspects of the output. For instance, many notable video games have made use of format synthesis technology for interactive speech.

Articulatory Synthesis

The method used to generate speech sounds based on the human vocal tract model is called articulatory synthesis. It is aimed at simulating the speech articulators in one or more ways. It offers a way to gain an understanding of the development of speech and to research phonetics.

Coarticulation is a naturally occurring effect in such a model, and it should be possible to correctly deal in theory with the properties of the glottal source, the relation of the vocal tract with vocal folds, and how the subglottal system, the nasal tract, and the sinus cavities influence the generation of human-like speech through this model.

Articulatory synthesis typically comprises two distinct components: the vocal tract, which is divided into several sub-components, and the corresponding cross-sectional regions used parametrically towards the reflection of the vocal cord characteristics. In the acoustic model, an electrical analog transmission line approximates each cross-sectional field.

Simulation of the vocal tract is subject to the changes appearing in the area functions concerning time. The target configuration allotted to each sound determines the pace of the vocal tract movement. If properly constructed, the articulatory synthesizer can reproduce every relevant effect in developing fricatives and plosives and modeling coarticulation transitions to replicate the processes involved in real speech production.

In the mid-1970s, at Haskins Laboratories, Philip Rubin, Tom Baer, and Paul Mermelstein created the first articulatory synthesizer commonly used for laboratory experiments.

HMM-based Synthesis

This is a Statistical Parametric Synthesis by following the “hidden Markov models”. HMMs simultaneously model the frequency spectrum, fundamental frequency, and length of speech in this method. Speech waveforms created on the maximum likelihood criterion are created from HMMs themselves.

A hidden Markov model (HMM) in Computational Biology is a mathematical technique mostly used for biological sequence modeling. A sequence is modeled as an output of a discrete stochastic method in its implementation, which advances through a set of sequential states that are ‘hidden’ from the observer.

Sinewave Synthesis

Sinewave synthesis, or sinewave voice, is a method of synthesizing speech by substituting pure tone whistles for the formants (prominent energy bands). Philip Rubin created the first sinewave synthesis software (SWS) for the automated production of stimuli for perceptual experiments at Haskins Laboratories in the 1970s.

Sinewave Speech is a peculiar phenomenon where some of the speech features are taken on by a small number of sinusoids put together – which they do not resemble at all in most respects. High intelligibility can be achieved using three sinusoids that track the frequency and amplitude of the first three speech formants.

Deep Learning-based Synthesis

Unlike the HMM-based approach, the Deep Learning-based method explicitly maps linguistic characteristics to acoustic characteristics with deep neural networks proven to be extremely successful in learning inherent data characteristics. People have suggested various models in the long tradition of studies that follow Deep Learning-based methods for speech synthesis.

A useful tool for speech synthesis has become deep learning capable of exploiting vast quantities of training data. Recently, more and more research on deep learning techniques or even end-to-end systems has been performed, and state-of-the-art success has been achieved.

AI ML DL 1 — *Image Source: Original file: Avimanyu786 SVG version: Tukijaaliwa, AI-ML-DL, CC BY-SA 4.0*

September 2016 marked the beginning of WaveNet by DeepMind, a deep generative model of raw audio waveforms. It made it evident that deep learning-based models can model raw waveforms and perform well from acoustic characteristics such as spectrograms or specific pre-processed linguistic characteristics to generate expression.

Advantages of end-to-end systems-

Limited capability of text analysis using a single system.
Confined amount of feature engineering.
Rich conditioning of existing attributes and easy adaptation towards newer ones.
Increased naturalness and intelligelability
More robust compared to multi-stage models.

Disadvantages of end-to-end systems-

Existence of slow inference problem.
Lesser data results in less robust output speech.
Limited controlling ability than concatenative approach.
Flat prosody is developed with averaging over training data.

Challenges involved in Speech Synthesis

Accommodation of differently pronounced words having the same spelling, based on the context.
Inference of how to expand a no. based on surrounding word, number, and punctuation. For example, 1465 can be ‘one thousand four hundred sixty five’ or may also be read as ‘one four six five’, ‘fourteen sixty five’ or ‘fourteen hundred and sixty five’.
Ambiguity in abbreviations. For example, ‘in’ for ‘inches’ must be differentiated from the word ‘in’.
The dictionary based approach (looking up each word in the dictionary and substituting the spelling with the pronunciation detailed in the dictionary to choose the right pronunciation of each word) of the text-to-phoneme process completely fails for any word that can be found in the dictionary.
Rule-based approach (to evaluate their pronunciations based on their spellings, pronunciation rules are applied to words, or the approach of ‘learning how to read’) of the text-to-phoneme process fails as the scheme takes into account unusual spellings or pronunciations because the sophistication of the rules increases considerably.
Difficulty in the reliable evaluation of speech synthesis systems due to a lack of generally accepted objective performance standards.
Shift of the sentence’s pitch contour, depending on whether it is an affirmative, interrogative, or exclamatory expression.

For previous article on Mecanum Wheeled Robot, Click Here.

Also Read:

Esha Chakraborty

I have a background in Aerospace Engineering, currently working towards the application of Robotics in the Defense and the Space Science Industry. I am a continuous learner and my passion for creative arts keeps me inclined towards designing novel engineering concepts.
With robots substituting almost all human actions in the future, I like to bring to my readers the foundational aspects of the subject in an easy yet informative manner. I also like to keep updated with the advancements in the aerospace industry simultaneously.