Google’s AI-powered text-to-speech system is nearly indistinguishable from human speech

Google Tacotron 2 is an AI-powered text-to-speech system. It can be used to train neural network to produce human speech from written text without almost any knowledge of grammar.  Google reveals in a research paper published recently that the speech synthesis system is almost accurate when reproducing human voice generated from a text.

If you curious to know how it works, Dave Gershgorn gave details in a Quartz post.

The system is Google’s second official generation of the technology, which consists of two deep neural networks. The first network translates the text into a spectrogram (pdf), a visual way to represent audio frequencies over time. That spectrogram is then fed into WaveNet, a system from Alphabet’s AI research lab DeepMind, which reads the chart and generates the corresponding audio elements accordingly.

An online comparative samples which you can listen now was made available by Google to clearly show how it works although it’s hard to say which of the samples is from human, and which is generated by the Artificial Intelligence (AI), Gershgorn goes ahead to give us a clue to view the page source on Google research website, the filenames will simply point it out.

It is worth mentioning that the system is only taught to mimic the one female voice. If the need arise for the system to speak like a male or another female, Google would teach the system again.


Leave a Reply