High quality, emotional, fluent and variable Text-to-Speech engine?

Question 1

High quality, emotional, fluent and variable Text-to-Speech engine?

audio text-to-speech voice speech-synthesis

tomsseisums · Jun 19, 2011 · Viewed 24k times · Source

Answer

Answer

I don't know if you're looking for an open solution, but if you have a Mac, you should check out OS X advanced speech markup and the "Repeat After Me" phrase building tool. It's really powerful. The Alex voice built into Mac OS X 10.5 and later is more advanced than the other voices.

On a Mac, highlight the following text, control-click, and go to Speech > Start Speaking:

You talkin' to me
[[inpt PHON]] [[slnc 500]] [[rate -30]]
+yUW _1tAOl=kIHn ~AX [[pbas +3]]+mIY?

http://www.mattmontag.com/personal/mac-os-x-speech-synthesis-markup

Question 2

After looking at some of services/tools, I've come to a conclusion. Most Text-to-Speech tools have too techy, robotic - in other words, bad quality c voices.

And yeah, on top of that, it looks like they come with a "hard-coded" voice templates, therefore shortening the variety/customization. Some tools allow you to set the reading speed and pitch', but that's not enough.

My guess about the problem behind the emotional aspect - it's hard to judge emotions from plain text, even more if it's just a sentence or two. Plus, the good ol' PC is a machine - machines don't have emotions, but that's a different story.

The thing that bothers me the most, is, quality. For example, there are these tools out there, that use to cut off apex of words, resulting in these techy voices. Feels like there's a problem with sentence construction or something. And yes, while people are working on such tools, I wonder, what keeps them from working a little more to improve those... cutting off the apex, that's not a small deal! Plus, have to keep in mind, that a good, quality Text-to-Speech software is worth, well... A LOT! Therefore resulting in a pretty profitable product.

Oh, under fluency I'm hiding questions, exclamations and so on. (Possible that those do not apply to fluency, but I'm not native English, please excuse me if that's the case.)

A list of tools I've looked into:

Quite impressive, but still have space for improvements (++)

^{- Loquendo : lacks voice variety, got some minor apex/fluency problems (depends on sentence), too much coughing and excuses in examples!}
^{- Nuance Vocalizer : while still lacks variety, some of the provided voices are worthy.}

Could as well cooperate to get more resources then to work on different, but almost equal products (--)

^{- eSpeak : one of the best robots out there, hence the program logo(?!)}
^{- Natural Reader (dumb autoplay!!) : well, it got some fluency, but still that techy feeling kicks in.}
^{- iSpeech : good laugh when setting the voice to Japanese with English text. I bet Japanese guys aren't very happy about it.}
^{- Cepstral + Enhanced Voices ... plus the enhanced voices give the good ol' crappy result, so, except ~5 more voices, nothing have been enhanced.}
^{- AT&T : decent fluency, but got problems with sentence endings and too much robo!}
^{- LumenVox TTS : looks like coming from a background with lots of speech tools, but still results in robotic voices.}
^{- And some more...}

In case I've missed something worth a look, please share. Can be free, commercial, super expensive... as long as it works, I'm interested!

And the question(-s)..

What do you think are the main issues behind quality, fluency and variety of those voices? Since emotional aspect is hard to judge, I don't mind if you skip it, but if you have an idea or two, I wouldn't mind if you shared your thoughts
How is text transformed into speech? Like, what algorithms are used behind these tools? Maybe a fresh theory or two could come in handy.
Are those actually different engines/drivers or just different voice patterns for the same driver/engine?
Is it just me, or the quality between one of the first Text2Speech tools hasn't changed much (or at all) over the years? And have to admit, that this oldschool Apple's tool provides better results than some of the year 2000+ tools, at least when comparing video with what I've looked into.)

High quality, emotional, fluent and variable Text-to-Speech engine?

A list of tools I've looked into:

Quite impressive, but still have space for improvements (++)

Could as well cooperate to get more resources then to work on different, but almost equal products (--)

Answer

Related questions