Virtual voices: Azure’s neural text-to-speech service



How Google’s WaveNet tech has paved the way in which for home equipment that discuss again to you
Voysis founder and CEO Peter Cahill on how latest advances in machine-generated speech will change how we work together with machines, talking on the AI Convention offered by O’Reilly and Intel AI.

The times of the keyboard and display screen as our sole methodology of interacting with a pc are lengthy gone. Now we’re surrounded by extra pure consumer interfaces, including contact and speech recognition to our repertoire of interactions. The identical goes for the way computer systems reply to us, utilizing haptics and speech synthesis.

SEE: Alexa Expertise: A information for enterprise professionals (free PDF) (TechRepublic)

Speech is more and more necessary, because it gives a hands-free and at-a-distance approach of working with gadgets. It isn’t vital to the touch them or take a look at them — all that is wanted are a handful of set off phrases and an excellent speech recognition system. We’re maybe most aware of digital assistants like Cortana, Alexa, Siri, and Google Assistant, however speech applied sciences are showing in assistive methods, in in-car functions, and in different environments the place guide operations are tough, distracting or downright harmful.

Synthetic voices for our code

The opposite aspect of the speech recognition story is, after all, speech synthesis. Computer systems are good at displaying textual content, however not superb at studying it to us. What’s wanted is a simple approach of taking textual content content material and turning it into recognisable human-quality speech, not the eerie monotone of a sci-fi robotic. We’re all aware of the speech synthesis instruments in automated telephony methods or in GPS apps that fail primary pronunciation assessments, getting names and addresses amusingly fallacious.

Excessive-quality speech synthesis is not straightforward. If you happen to take the usual strategy, mapping textual content to strings of phonemes, the result’s typically stilted and susceptible to mispronunciation. What’s extra disconcerting is that there is little or no inflection. Even utilizing SSML (Speech Synthesis Markup Language) so as to add emphasis and inflection would not make a lot distinction and solely provides to developer workloads, requiring each utterance to be tagged upfront so as to add the suitable speech constructions.

A part of the issue is the way in which that conventional speech synthesis works, with separate fashions for each analyzing the textual content and for predicting the required audio. As they’re separate steps, the result’s clearly synthetic. What’s wanted is an strategy that takes these separate steps and brings them collectively, right into a single speech synthesis engine.


Microsoft’s text-to-speech service makes use of deep neural networks to enhance the way in which conventional text-to-speech methods match patterns of stress and intonation in spoken language (prosody) and synthesise speech items into a pc voice.

Picture: Microsoft

Utilizing neural networks for extra convincing speech

Microsoft Analysis has been engaged on fixing this downside for a while, and the ensuing neural network-based speech synthesis approach is now out there as a part of the Azure Cognitive Companies suite of Speech instruments. Utilizing its new Neural text-to-speech service, hosted in Azure Kubernetes Service for scalability, generated speech is streamed to finish customers. As a substitute of a number of steps, enter textual content is first handed by way of a neural acoustic generator to find out intonation earlier than being rendered utilizing a neural voice mannequin in a neural vocoder.

The underlying voice mannequin is generated by way of deep studying methods utilizing a big set of sampled speech because the coaching information. The unique Microsoft Analysis paper on the topic goes into element on the coaching strategies used, initially utilizing body error minimization earlier than refining the ensuing mannequin with sequence error minimisation.

Utilizing the neural TTS engine is straightforward sufficient. As with all of the Cognitive Companies, you begin with a subscription key after which use this to create a category that calls the text-to-speech APIs. All you should do is select one of many neural voices to make use of the brand new service; the underlying APIs are the identical for neural and normal TTS. Speech responses are streamed from the service to your machine, so you possibly can both direct them straight to your default audio output or reserve it as a file to be performed again later.

SEE: Synthetic intelligence: A enterprise chief’s information (free PDF) (TechRepublic)

Neural voices nonetheless assist SSML, so you possibly can add your individual changes to the default voices. That is along with their particular optimisations for particular speech sorts. If you happen to do not wish to use SSML, decide a neural voice by attribute — a impartial voice or a cheerful voice, for instance. SSML can be utilized to hurry up playback or change the pitch of a speech section with out altering the synthesised voice. That approach you possibly can permit customers to regulate output to swimsuit their working setting, permitting them to decide on the voice settings they discover applicable.

Microsoft has made neural voices out there in a number of areas, though for extra language protection you may have to step again to utilizing the older, normal speech synthesis fashions. Neural voices can be found in English, German, Italian and Chinese language, with 5 totally different voices. Most are feminine, however there’s one male English voice.

Including neural voices to your apps

So the place would you utilize neural voices? The plain selection is in any software that requires a protracted set of voice interactions, as conventional speech synthesis might be tiring to take heed to for lengthy durations. You additionally wish to use neural voices the place you do not wish to add to cognitive load — a danger that is decreased through the use of a extra pure set of voices. Digital private assistants and in-car methods are a logical first step for these new methods, however you should use them to shortly create audio variations of present paperwork, decreasing the prices of audiobooks and serving to customers with auditory studying types.

If you wish to get began utilizing neural voices in your functions, Microsoft gives a free subscription that provides you 500,000 characters of recognised textual content per 30 days. As neural voices do require extra compute than conventional sample-based strategies, they’re dearer to make use of, however at $16 per million characters as soon as you progress out of the free service, it isn’t going to interrupt the financial institution — significantly should you use the choice of saving utterances for later use. These can be utilized to construct a library of generally used speech segments that may be performed again as required.

With speech an more and more necessary accessibility device, it is good to see the cutting-edge transferring past stilted, clearly synthetic voices. Microsoft’s launch of neural voices in its Cognitive Companies suite is a crucial step ahead. Now it must carry them to extra languages and to extra areas so we are able to all get the good thing about these new speech synthesis methods.

Additionally see


Source link