blog

Does Your Text-To-Speech System Sound Like a Zombie?

26 June, 2005
According to my super-stimulus theory of music, music perception is actually the perception of an aspect of speech, and this aspect provides information about the consciousness of the speaker. It follows that if you are creating a text-to-speech system, which does not take account of my theory, then your system will probably talk like a zombie.
by Philip Dorrell

Zombie Theory

In the field of consciousnessology, a zombie is a hypothetical human being or some entity as similar as possible to a human being, which is not conscious. Philosophers of consciousness ask themselves questions about whether or not a zombie could do this thing or that thing, and "what it is like" to be a zombie.

The common-sense view of zombies is that consciousness is an essential component of the human mind/brain, and a human being without consciousness would suffer severe deficits in their information processing capabilities. This is fairly obvious to anyone who has ever watched a zombie movie.

The Perception of Consciousness in Others

Another common question that philosophers of consciousness ask is: even if I know that I am conscious and therefore not a zombie, how do I know that other people aren't zombies?

A common-sense answer to this question is that we perceive consciousness in other people indirectly from observation of their speech and their actions, which are such that they could not be produced by a non-conscious individual. Implicitly this assumes that we have some notion of which information processing capabilities are provided by our faculty of consciousness, and what sort of contribution these capabilities make to the decisions that we make about what to do and what to say.

But the super-stimulus theory of music raises a more radical possibility: that we do more than deduce the existence of consciousness from the response of an individual to their circumstances, that we actually directly and constantly monitor the level of activity of the conscious faculty within another person as they speak. In particular, the perceived level of musicality of speech provides a direct estimate of the current level of consciousness in the speaker.

Musicality and TTS

This has immediate consequences for anyone in the Text-To-Speech (TTS) business, especially if you want a software application to produce human-like speech that sounds completely natural. If the perception of varying levels of musicality is a component of speech perception, and a component that tells the listener about the conscious nature of the speaker, then any TTS system that fails to add appropriate levels of musicality to its output will not sound natural, and risks sounding like a non-conscious zombie.

How Do I Add Musicality to my TTS System?

The long answer is: go and read my book What is Music? Solving a Scientific Mystery. The short answer is:

  • Consider the different perceived aspects of speech that musicality applies to.
  • Determine the level of consciousness appropriate to different portions of the output speech.
  • Modulate the musicality of each of the perceived aspects in proportion to the determined level of consciousness.

Of course it may be very difficult for a software application to determine an appropriate level of consciousness for a given utterance, unless the application is itself conscious. However, in practice, any kind of variation in musicality, applied simultaneously to different aspects, even if it is partly random, may be enough to make generated speech sound more natural.

A second difficulty is determining all the perceived aspects of speech that musicality applies to. Even if you read my book, you will only learn about some of them, and the failure to identify all of them is a major cause of the incompleteness of the super-stimulus theory. But in the pragmatic world of TTS, something may be better than nothing, so it's worth trying to apply musicality to as many aspects of generated speech as possible, to see if it helps at all.

Vote for this article on: digg reddit