Speech synthesis

ImprimirCitar
Stephen Hawking was one of the most famous people to use a voice synthesizer to communicate

Speech synthesis is the artificial production of speech. The computerized system that is used for this purpose is called a speech computer or speech synthesizer and can be implemented in software or hardware products. A text-to-speech (TTS) system converts normal text language into speech; other systems recreate linguistic symbolic representation as phonetic transcriptions in speech.

Synthesized speech can be created by concatenating recorded speech fragments that are stored in a database. Systems differ in the size of the speech units stored; a system that stores phones and diphones allows a greater range of sounds but lacks clarity. For specific uses, the size of whole word or sentence storage allows for higher audio quality. Alternatively, a synthesizer can incorporate a vocal tract model or other characteristics of the human voice to fully recreate a 'synthetic' voice.

The quality of a speech synthesizer is judged by its similarity to the human voice and its ability to be clearly understood. A text-to-speech program allows people with visual impairments or reading difficulties to listen to text on a computer. Many operating systems have built-in speech synthesizers since the early 1990s.

Diagram of a typical TTS system

A system or "engine" Text-to-speech (TTS) is made up of two parts: a front-end and a back-end. The front-end has two main tasks. First, convert text with characters, numbers, symbols, and abbreviations into their written word equivalent. This process is called as "text normalization", "pre-processing" or "tokenization", the front-end then assigns a phonetic transcription to each word, marks and divides the text into prosodic units, such as phrases, clauses and sentences. The process of assigning phonetic transcriptions to words is called "text-to-phoneme" or "grapheme to phoneme". The information from phonetic or prosodic transcriptions prepare the information of the linguistic symbolic representation that is the result of the front-end. The back-end, commonly referred to as the 'synthesizer', converts the linguistic symbolic representation into sound. In some systems, this part includes the computation of "prosodic intent" (profile pitch, phoneme duration), which is implemented in the output voice.

History

Before electronic signal processing was invented, there were those who tried to build machines to imitate human speech. Some of the earliest legends of the existence of the "Brazen Heads" they involved Sylvester II (d. 1003 AD), Albertus the Great (1198–1280), and Roger Bacon (1214–1294).

In 1779 Danish scientist Christian Kratzenstein, while working at the Russian Academy of Sciences, built models of the human vocal tract that could reproduce the sounds of all five vowels (in International Phonetic Alphabet notation for English, they are [aː], [eː], [iː], [oː] and [uː]). This was followed by the "Wolfgang von Kempelen's Speaking Machine" bellows-operated machine made by Wolfgang von Kempelen of Bratislava, Hungary, described in a text in 1791. This machine integrated models of lips and tongue, allowing it to produce consonants as well as vowels. In 1837 Charles Wheatstone produced a "talking machine" based on Von Kempelen's design, and in 1857, M. Faber built the "Euphonia" machine. Wheatstone's design was employed by Paget in 1923.

In the 1930s, Bell Laboratories developed the vocoder, which automatically analyzed speech through its root note and resonances. From his work with the vocoder, Homer Dudley developed a keyboard-operated synthesizer called The Voder, which was exhibited at the 1939 New York World's Fair.

The "Pattern playback" it was built by Dr. Franklin S. Cooper and his colleagues at Haskins Laboratories in the late 1940s and completed in the 1950s. There have been several versions of this hardware device, but only one exists. The machine converts the images of acoustic speech patterns (spectrograms) into sound. Using this device, Alvin Liberman and his colleagues were able to discover acoustic indicators for the perception of phonetic segments (vowels and consonants).

The dominant systems in the 1980s and 1990s were the DECtalk system, based on the work of Dennis Klatt at MIT, and the Bell Labs system, which would later become one of the first independent multi-language systems., making extensive use of natural language processing methods.

Early speech synthesizers had a robotic sound and poor intelligibility. The quality of synthesized speech has been improved, but the output audio of contemporary speech synthesis is still distinguishable from human speech.

Due to the cost-performance ratio, speech synthesizers have become cheaper and more accessible to people, more people will benefit from using text-to-speech programs.

Electronic devices

The computer and voice synthesizer used by Stephen Hawking in 1999

The first computer systems based on speech synthesis were created in the 1950s. The first general text-speech system of English was developed by Noriko Umeda et al. in 1968 at the Electrotechnical Laboratory in Japan. In 1961, physicist John Larry Kelly, Jr and his colleague Louis Gerstman they used an IBM 704 computer to synthesize speech, an important event in the history of Bell Laboratories. Kelly's voice synthesizer (vocoder) played the song "Daisy Bell" with the musical accompaniment of Max Mathews. Coincidentally, Arthur C. Clarke was visiting his friend and colleague John Pierce at Bell Laboratories in Murray Hill. Clarke was so impressed by the demonstration that he used it in the climax scene for his novel 2001: A Space Odyssey, where the HAL 9000 computer sings the same song when putting astronaut David Bowman to sleep. Despite the success of pure electronic speech synthesis, research on mechanical speech synthesizers is still ongoing.

Mobile electronic devices including voice synthesis began to appear in the 1970s. One of the first was the Speech+ calculator for the blind from Telesensory Systems Inc. (TSI) in 1976. Other devices were produced for educational purposes such as the "Speak & Spell", created by Texas Instruments in 1978. Fidelity released a talking version of its electronic chess in 1979. The first video game to include voice synthesis was the arcade Space Fighter, Stratovox, by Sunsoft. Another early example is the arcade version of Berzerk from the same year. The first multiplayer electronic game to use speech synthesis was "Milton" of the Milton Bradley Company, which produced the device in 1980.

Synthesizer Technologies

The most important qualities of voice synthesis systems are "naturalness" and "intelligibility". Naturalness describes how close the output audio is to the human voice, while intelligibility is the degree of understanding the audio has. The ideal speech synthesizer is both natural and intelligible. Speech synthesis systems usually try to maximize these characteristics.

The two primary technologies that generate synthetic speech waveforms are "concatenative synthesis" and the "formant synthesis". Each technology has its strengths and weaknesses, depending on its use it will be possible to determine which approach will be used.

Concatenative synthesis

Concatenative synthesis is based on the concatenation (or union) of segments of a recorded voice. Generally, concatenative synthesis produces the most natural sound of a synthesized voice. However, differences between natural variations in speech and the nature of automated waveform segmentation techniques sometimes result in audible glitches in the output audio. There are three sub-types of concatenaitva synthesis.

Unit selection summary

Unit selection synthesis uses databases of recorded voices. During the creation of the database, each recorded utterance is segmented into: phones, diphones, half phones, syllables, morphemes, words, phrases, and sentences. Usually the segmentation is done with the help of a modified speech recognition system, using visual representations such as waveforms and a spectrogram. An index of the speech units in the database is created based on the segmentation and on acoustic parameters such as fundamental frequency (pitch), duration, syllable position, and nearby phonemes. During runtime, the desired statement is created by determining the largest possible string of units (unit selection). This process is carried out using a decision tree.

The unit selection allows for greater naturalness by using less digital signal processing (DSP) on recorded speech. Digital signal processing usually causes the voice to not sound as natural, although some systems employ a small amount of signal processing at the concatenation point to adjust the waveform. The output audio from the best selection of units is usually indistinguishable from actual human voices, especially in contexts with TTS systems. However, greater naturalness requires very large unit selection databases, in some systems as large as gigabytes of recorded data, representing dozens of hours of voice. Also unit selection algorithms are known to select segments. from a less ideal location (e.g., small words are not clear) even when a better match exists in the database. Recently, researchers have proposed several automated methods for detecting unnatural segments in synthesis systems for selection. units.

Synthesis of diphones

Diphone synthesis uses a minimal speech database that contains all the diphones (transitions between sounds) that occur in language. The number of diphones depends on the phonotactics of the language: for example, in the Spanish language there are around 800 diphones and in German 2500. In the synthesis of diphones, only one example of each diphone is stored in the voice database.. At runtime, the target prosody of a sentence is superimposed on these minimal units through digital signal processing techniques such as linear predictive coding, PSOLA or MBROLA or more recent techniques such as pitch-domain coding. of the source employed the discrete cosine transform. Diphone synthesis suffers from the glitchy sounds of concatenative synthesis and the robotic sound of formant synthesis and has few advantages over any other approach other than its size. Its use in commercial applications has decreased, although it continues to be investigated due to its number of applications in free software.

Domain-Specific Synthesis

Domain-specific synthesis concatenates pre-recorded words and phrases to create complete sentences. It is used in applications where the variety of system texts is limited to one audio output in a particular domain, such as traffic calendar announcements or weather reports. The technology is very simple to implement and has been used extensively. commercially for several years in devices such as calculators or talking clocks. The level of naturalness of these systems can be very high because the variety of sentence types is limited and they manage to be very close to the prosody and intonation of the original recordings.

Because these systems are limited by the words and phrases in their databases, they are not used for general purposes and can only synthesize combinations of words and phrases that they have been programmed to. The adherence of the words to the naturalness of the language can cause problems, unless the variations are taken into account. For example, in non-rhotic dialects of English the words r such as «clear» /ˈklɪə/ they are usually pronounced when the following word has a vowel in its first letter (e.g. «clear out» is pronounced as /ˌklɪəɾˈʌʊt/). As in the French language, several of the last consonants are not silent if they are followed by a word beginning with a vowel, the effect is called a liaison. This alternation cannot be reproduced by a simple concatenation system, which requires a complex additional context-sensitive grammar.

Formant synthesis

Formant synthesis does not use human speech samples at runtime. Instead, the output audio is created from additive synthesis and an acoustic model (synthesis by physical modeling). Parameters such as fundamental frequency, phonation, and noise levels are varied over time to create a sound pattern. wave of an artificial voice. This approach is sometimes called rule-based synthesis; however, there are concatenation systems that also have rule-based components.

Various systems based on formant synthesis technology generate an artificial, robotic-sounding voice that could not be mistaken for a human voice. However, maximum naturalness is not the goal of speech synthesis systems, formant synthesis systems have advantages over other concatenation systems. Speech through formant synthesis can be intelligible, even at high speeds, avoiding common acoustic glitches in concatenation systems. Synthesized speech at high speeds is used by visually impaired people to more fluidly navigate computers using a screen reader. Formant synthesizers are small programs compared to concatenation systems because they do not have a database of voice samples. They can be used in embedded systems where memory and microprocessor power are limited. Because formant-based systems have complete control over all aspects of the output audio, a wide variety of prosody and intonation can be generated, to convey not just questions or statements, but a variety of emotions and intonations in the voice.

Some examples of formant synthesis, not in real time but with great precision in pitch control, are found in late 1970s work by Texas Instruments with the "Speak & Spell" and in the late 1980s in SEGA arcades and other Atari arcade games using Texas Instrument's TMS5220 LPC chips. Creating the proper intonation was difficult and the results had to be matched in real time with the text-speech interfaces.

Articulatory synthesis

Articulatory synthesis refers to computational techniques for speech synthesis based on models of the human vocal tract and the articulatory processes that occur. The first articulatory synthesizer frequently used in laboratory experiments was developed at Haskins Laboratories in the mid-1970s by Philip Rubin, Tom Baer, and Paul Mermelstein. This synthesizer, known as ASY, was based on models of the vocal tract developed at Bell Laboratories in the 1960s and 1970s by Paul Mermelstein, Cecil Coker, and their colleagues.

Recently, articulatory synthesis models had not been incorporated into commercial speech synthesis systems. One notable exception is the NeXT-based system, originally developed and released by Trillium Sound Research, a division of the University of Calgary company, where much of the research was conducted. Following the demise of NeXT (started by Steve Jobs in the late 1980s and merged with Apple Computer in 1997), the Trillium software was released under the GNU General Public License, with its work continuing as gnuspeech. The system, released in 1994, allows text-to-speech conversion based on full articulation using a waveguide or analog transmission line of the human voice and nasal passages controlled by the 'distinctive model of region" of Carré.

Synthesis based on HMM models

HMM-based synthesis is a hidden Markov model-based synthesis method, also called paramedical statistical synthesis. In this system, the frequency spectrum (vocal tract), the fundamental frequency (voice source), and the duration (prosody) of speech are simultaneously modeled by HMM. Speech waveforms are generated by HMMs based on a maximum likelihood criterion.

Sine Wave Synthesis

Sine wave synthesis is a technique for speech synthesis through the replacement of formants (main energy bands) with pure tones.

Challenges

Challenges of text normalization

The text normalization process is rarely straightforward. The texts are full of heteronomies, numbers and abbreviations that require an expansion in a phonetic representation. There are many words in English that are pronounced differently based on their context. For example, «My latest project is to learn how to better project my voice» in English the word project contains two pronunciations.

Most text-to-speech systems (TTS) do not generate semantic representations of input texts, so their processes can be erroneous, poorly understood, and computationally ineffective. As a result various heuristic techniques are used to predict the appropriate way to disambiguate homographs such as examining nearby words using statistics about frequency of use.

Recently TTS systems have started using HMM to generate "grammar markups" to help disambiguate the homographs. This technique is to some extent effective for various cases about how to "read" should be pronounced as "red" implying a conjugation in the past. Typical error rates using HMMs in this way are below five percent. These techniques also work for most European languages, although training in the linguistic corpus is often difficult in these languages.

Deciding how to convert numbers is another problem that TTS systems face. It's a simple programming challenge to convert a number to words (at least in the English language), like "1325" becomes "one thousand three hundred twenty-five". However, numbers occur in different contexts; "1325" can be read as "one three two five", "thirteen twenty-five" or "one three hundred twenty-five". A TTS system can usually infer how to expand a number based on the surrounding words, number, and punctuation, sometimes the system allows a way to specify the context if it is ambiguous. Roman numerals can be read in different ways depending on the context.

Similarly, abbreviations can be ambiguous. For example, the abbreviation "in" from "fleas" can be differentiated by the word "in" (en) or at the English address "12 St John St." uses the same abbreviation for "street" (street) and "saint" (Saint). TTS systems with intelligent front ends can make correct predictions about abbreviation ambiguity, while others give the same result in all cases, giving meaningless (and sometimes comical) results like "co-operation" interpreted as "company operation".

Text to Phoneme Challenges

Speech synthesis systems employ two basic approaches to determine the pronunciation of a word based on its spelling, a process which is commonly called text-phoneme or grapheme-to-phoneme conversion (phoneme is the term used in linguistics). to describe distinctive sounds in language). The simplest approach to text-phoneme conversion is through dictionaries, where a comprehensive dictionary containing all the words of a language and their correct pronunciation is stored by the program. Determining the correct pronunciation of each word is a matter of checking each word in the dictionary and replacing it with the pronunciation specified by the dictionary. Another approach is through rules, where pronunciation rules are applied to words to determine the correct pronunciation based on their spelling.

Each approach has its advantages and disadvantages. The dictionary-based approach is fast and accurate, but fails completely when a word is not found in the dictionary. As the dictionary grows, so does the memory size required for the synthesis of the system. On the other hand, the rule-based approach works with any type of input text, but the complexity of the rules grows substantially when the system detects irregular pronunciations or spellings. (Consider the English word "of", which is the only one where the "f" is pronounced.) As a result, almost all speech synthesis systems use a combination of these approaches.

Languages with phonetic spelling have a regular writing system and prediction of the pronunciation of words based on their spelling is successful. Synthesis systems for languages where extensive use of the rule method is common, resorting to dictionaries for some words, such as foreign names and loanwords, whose translations are not obvious from their writing. On the other hand, speech synthesis systems for languages like English, which have extremely irregular writing systems, tend to resort to dictionaries and use rule methods only for words that are unusual or not in their dictionaries.

Assessing challenges

Consistent evaluation of speech synthesis systems can be difficult due to the lack of accepted universal evaluation criteria. Different organizations commonly use different voice data. The quality of voice synthesis systems also depends on the degree of quality in the production technique (which may involve digital or analog recordings) and its ease of reproducing voice. The evaluation of voice synthesis systems has been compromised by differences between production and reproduction techniques.

Since 2005, however, some researchers have begun evaluating speech synthesis using a common speech data sheet.

Prosody and emotional content

A study in the journal Speech Communication by Amy Drahota and colleagues at the University of Portsmouth in the UK reports that people who listen to voice recordings can determine, on different levels, whether the speaker was smiling or not. It has been suggested that identifying vocal characteristics that show emotional content may help make the speech synthesis sound more natural. One of the related issues is the tone of the sentences, depending on when it is affirmative, interrogative or an exclamation sentence. One of the techniques for pitch modification uses the discrete cosine transform in the source domain (linear prediction residual). Such techniques for pitch-synchronous modification require prior signaling of pitches in the speech synthesis database using techniques such as epoch extraction using a stop consonant index applied to residual integrated linear prediction of speech regions..

Dedicated hardware

First technologies (not available)

  • Icofono
  • Votrax
    • SC-01A
    • SC-02 / SSI-263 / "Artic 263"
  • General Instrument SP0256-AL2 (CTS256A-AL2)
  • National Semiconductor DT1050 Digitalker (Mozer - Forrest Mozer)
  • Silicon Systems SSI 263
  • Voice Chips Texas Instruments LPC
    • TMS5110A
    • TMS5200
    • MSP50C6XX - Sold to Sensory, Inc. in 2001

Current (in 2013)

  • Magnevation SpeakJet (www.speechchips.com) TTS256 Hobby and experimenter.
  • Epson S1V30120F01A100 (www.epson.com) IC DECTalk Based voice, Robotic, English and Spanish.
  • Textspeak TTS-EM (www.textspeak.com)

Mattel

Mattel's Intellivision video game console, which is a computer without a keyboard, enabled a speech synthesis module called Intellivoice in 1982. It included the SP0256 Narrator speech synthesis chip on a cartridge. The Narrator had 2KB of Read-Only Memory (ROM) and was used to store a database of generic words that could be combined to make sentences in Intellivision games. Since the Orator chip can accept data from external memory, any additional words or phrases required can be stored within the cartridge. The data consists of analog filter coefficient text strings to modify the behavior of the chip's vocal tract model, rather than digital samples.

SAM

Also released in 1982, Software Automatic Mouth was the first commercial speech synthesizer software. It was later used for the base of the Macintalk. The program was not available for Apple Macintosh computers (including the Apple II and Lisa), but for Atari and Commodore 64 models. Apple's version required additional hardware for digital-to-analog conversion, although it was possible to use audio output from the computer (with distortion) if the card was not present. The Atari made use of a POKEY audio chip. Voice playback on the Atari would normally disable interrupt requests and turn off the ANTIC chip during audio output. The output was grossly distorted when the display was on. The Commodore 64 used the SID audio chip.

Atari

The first built-in speech synthesis system in an operating system was for the 1400XL/1450XL computers designed by Atari using the Votrax SC01 chip in 1983. The 1400XL/1450XL computers used the Finite State Machine to perform speech synthesis in English. However, the 1400XL/1450XL computers were rare.

Atari ST computers were sold with the "stspeech.tos" on a floppy disk

Apple

The first speech synthesizer built into an operating system was Apple's MacInTalk. The software was licensed by third party developers such as Joseph Katz and Mark Barton (later SoftVoice, Inc.) and the first version was released during the introduction of the Macintosh computer in 1984. The demo released in January, which used speech synthesis based on SAM software, it required 512KB of RAM. As a result, it could not run on the 128KB RAM, present in early Mac computers. The demo was carried out with a 512KB prototype, although this was not revealed to the public, which created higher expectations for the Macintosh. In the early 1990s, Apple expanded its capabilities by offering a system with extensive support for text-to-speech with the introduction of faster PowerPC-based computers, including higher quality of reproduced voice. Apple also introduced speech recognition into its systems which allowed for a fluent command set. More recently, Apple has incorporated voice samples. Starting out as a curiosity, Apple's Macintosh voice system has evolved into a full-fledged program, PlainTalk, for the visually impaired. VoiceOver was introduced in Mac OS X Tiger (10.4). During 10.4 (Tiger) and early releases of 10.5 (Leopard) there was only one voice on Mac OS X. Since 10.6 (Snow Leopard), the user can choose from a wide range of multiple voices. VoiceOver has features like inhalation sounds between sentences, as well as clarity at higher speeds compared to PlainTalk. Mac OS X also includes "say" software, a command-line application that converts text to speech. Standard AppleScript additions include the "say" which allows a script to use the installed voices and control the pitch, speed and modulation of the spoken text.

Apple's iOS operating system, used on the iPhone, iPad, and iPod Touch, uses VoiceOver speech synthesis for accessibility. Some applications also use speech synthesis to make it easier to navigate, read web pages, or translate text.

AmigaOS

The second operating system to include advanced speech synthesis capabilities was AmigaOS, introduced in 1985. The speech synthesis was licensed by Commodore International from SoftVoice, Inc., which also developed the MacinTalk text-to-speech system. It included a complete American voice emulation system for the English language, with female and male voices and 'stress' markers, made possible through the Amiga chipset. The synthesis system was divided into one device of narration, which was responsible for modulating and concatenating phonemes, and a translation library which translated the English text into phonemes through a set of rules. AmigaOS also included a high-level speech processor that allowed users to play text via command lines. Speech synthesis was occasionally used by third-party programs, particularly word processors and educational software. The synthesis software remained intact from the first release of AmigaOS, and Commodore would eventually remove speech synthesis starting with AmigaOS 2.1.

Despite the limitation of the phonemes of American English, an unofficial version with speech synthesis of several languages was developed. This made use of an extended version of the translator's library which could translate into a number of languages, based on the rules of each language.

Microsoft Windows

Modern Windows desktop systems can implement SAPI 1-4 and SAPI 5 components to support speech synthesis and speech recognition. SAPI 4.0 was made available as an add-on option for Windows 95 and Windows 98. Windows 2000 added Microsoft Narrator, a text-to-speech utility for the visually impaired. Third-party programs like CoolSpeech, Textaloud and Ultra Hal can perform various text-to-speech tasks like reading text from a specific website, email, text document, user-entered text, etc. Not all programs can use speech synthesis directly. Some programs may use extensions to read text.

Microsoft Speech Server is a server-based speech synthesis and recognition package. It is designed for network use with web applications and call centers.

Text-to-Speech (TTS) refers to the ability of computers to read text. A TTS Engine converts written text into a phonetic representation, then converts the representation into sound waves that can be heard. TTS engines with different languages, dialects and specialized vocabularies are available through third parties.

Android

Android version 1.6 added support for speech synthesizers (TTS).

Internet

There are currently a number of applications, plug-ins and gadgets that can read messages directly from an email client and web pages from a web browser or Google Toolbar such as Text to Voice which is a Firefox add-on. Some specialized software can narrate RSS. On the other hand, RRS storytellers simplify the information sent by allowing users to listen to their favorite news sources and turn them into podcasts. There are RSS readers on almost any PC connected to the Internet. Users can download generated audio files to portable devices, eg. with the help of a podcast receiver and listen to them while you walk, run, etc.

A growing field in the internet based on TTS are assistive technologies like 'Browsealoud'; from a UK company and Readspeaker. They allow TTS functionality to anyone (for reasons of accessibility, convenience, entertainment or information) with access to an internet browser. The Pediaphon project was created in 2006 to enable web browsing similar to the TTS-based interface on Wikipedia.

Other work is underway in the context of the W3C through the W3C Audio Incubator Group with support from the BBC and Google Inc.

Others

  • Following the commercial failure of Intellivoice hardware, videogame developers used voice synthesis software in moderation for future games. A famous example is the introduction of Nintendo's Super Metroid video game for the Super Nintendo Entertainment System. Other of the first systems to use videogame software synthesis are Atari 5200 (Baseball) and Atari 2600 (Quadrun and Open Sesame).
  • Some e-book readers, such as Amazon Kindle, Samsung E6, PocketBook eReader Pro, enTourage eDGe and Bebook Neo.
  • The BBC Micro incorporated the Texas Instruments TMS5220 voice synthesis chip.
  • Some Texas computer models Instruments produced in 1979 and 1981 (Texas Instruments TI-99/4 and TI-99/4A) were capable of text-fonema synthesis or recite full words and phrases (text-dictionary), using the popular peripheral Speech Synthesizer. IT used its own codec to complete the phrases generated in applications, mainly games.
  • IBM OS/2 Warp 4 included the VoiceType, a precursor of IBM ViaVoice.
  • Systems that operate with free and open source software including Linux are varied and include open-source programs such as Speech Synthesis System, which uses diffogen-based synthesis (can use a limited number of MBROLA voices) and gnuspeech which uses the joint synthesis of Free Software Foundation.
  • The GPS units produced by Garmin, Magellan, TomTom and others use the voice synthesis for car navigation.
  • Yamaha produced a synthesizer in 1999, the Yamaha FS1R which included capacity for synthesis of trainers. Sequences up to 512 individual and consonant vocal trainers could be stored and reproduced, allowing synthesized short sentences.

Speech synthesis markup languages

A number of markup languages have been established for interpreting text as speech in a compiled XML format. The most recent is Speech Synthesis Markup Language (SSML), which became a W3C recommendation in 2004. Older speech synthesis markup language systems include Java Speech Markup Language (JSML) and SABLE. Although each of these was proposed as a standard, none of them have been widely adopted.

Speech synthesis markup languages are distinguished from dialog markup languages. VoiceXML, for example, includes tags related to speech recognition, dialog handling, and markup, as well as speech synthesis markup.

Applications

Speech synthesis has been one of the vital assistive technology tools and its application in this area is significant and widely used. It allows environmental barriers to be removed for people with different disabilities. The most widely used application has been screen readers for the visually impaired, but text-to-speech systems are now commonly used by people with dyslexia and other reading difficulties, as well as children. They are also frequently employed to help those with communication disabilities usually through a help voice.

Voice synthesis techniques are used in entertainment products such as games or animations. In 2007, Animo Limited announced the development of a software application based on FineSpeech speech synthesis, explicitly targeted at consumers in the entertainment industry, allowing narrations and disagreement devil lines to be generated to user specifications. The application matured. In 2008 when NEC Biglobe announced a web service that allowed users to create phrases out of character voices from Code Geass: Lelouch of the Rebellion R2,

Text-voice has found new applications outside of the disability assistance market. For example, speech synthesis combined with speech recognition enables interaction with mobile devices through natural language processing interfaces. It has also been used as a second acquisition language. Voki, for example, is an educational tool created by Oddcast that allows users to select their own avatar, using different accents. They can be sent via email or posted on websites or social media.

API

Multiple companies offer TTS APIs to consumers to speed up the development of new applications using TTS technology. Companies that offer TTS APIs include AT&T, IVONA, Neospeech, Readspeaker, and YAKiToMe!. For mobile app development, the Android operating system has offered a TTS API for a long time. Recently, with iOS7, Apple has started offering a TTS API as well.

Contenido relacionado

Romansh

The Romansh is the generic name for the Romance Rhetoric languages spoken in Switzerland, where it is recognized as a national language. As a language of the...

Terminology

This article provides an overview of the state of terminology as a discipline and field of work. Firstly, its precedents are presented: the origin of the...

Languages of japan

The most widely spoken language in Japan is Japanese, which is divided into several dialects and the Tokyo dialect is considered standard...
Más resultados...
Tamaño del texto:
Copiar