Part I
Physics and Physiology
Part I: Physics and Physiology
It can scarcely be denied that the supreme goal of all theory is to make the irreducible basic elements as simple and as few as possible without having to surrender the adequate representation of a single datum of experience.
On the Method of Theoretical Physics
Albert Einstein
The Herbert Spencer Lecture
Oxford, June 10, 1933
A good understanding of human voice production is the starting point of improving voice and developing algorithms for voice and speech technology. In Part I, a theory of human voice production is presented, which also serves as the scientific foundation of Part II, Mathematical Representations and the applications in speech and voice technology.
Chapter 1 presents background theory of acoustic waves. For simplicity, only the one-dimensional wave equation in a uniform tube is presented. It is sufficient for the understanding of the entire book.
Chapter 2 presents the basic anatomy and physiology of the voice-producing organs, including vocal folds and vocal tract. Instruments for probing and measuring the functions of the voice organs are presented. Special emphasis is directed to the non-invasive probing methods, the electroglottograph (EGG) and miniature pressure sensors, which can be applied simultaneously with the microphone during normal voicing.
Chapter 3 presents the experimental facts of human voice. First, the superposition principle formulated by Edward W. Scripture [84] is illustrated by numerous examples of voice signals. Next, the universal temporal correlation of the voice signal with the electroglottograph signal, the subglottal and the supraglottal pressures is presented. The temporal correlation strongly implies the critical role of glottal closings in voice production.
In Chapter 4, inferred from the experimental facts presented in Chapter 3, a theory of human voice production, especially for vowels, is presented. Briefly, the theory is as follows. Immediately before a glottal closure, there is a steady airflow in the vocal tract. A glottal closing abruptly blocks the airflow from the trachea into the vocal tract, triggers a zero-particle-velocity dâAlembert wavefront, which propagates and resonates in the vocal tract to form a decaying acoustic wave. Kinetic energy of the airflow in the vocal tract immediately before a closure is converted into acoustic energy. Linear superposition of these elementary resonance waves constitutes voice. The acoustic wave in the vocal tract triggered by a glottal closing is determined by the geometry of the vocal tract at that moment, thus representing the instantaneous timbre. It is reasonable to term the decaying acoustic wave trigger by a glottal closing a âtimbronâ. The timbrons are literally the elements of human voice. The production mechanism of consonants is then presented, which is relatively straightforward.
In the history of the theory of human voice, especially for vowels, there are two schools of thought, analogous to the centuries-long controversy of the theory of light: the particle theory of Isaac Newton and the wave theory of Christiaan Huygens [31]. The first school of human voice, the transient theory or inharmonic theory, was proposed by British scientist Robert Willis (1800-1875) in 1829 [99]. Motivated by the similarity between human voice organ and pipe organ proposed by Leonhard Euler (1707-1783), Willis designed a series of mechanical models to artificially imitate human voice production. By following Eulerâs theoretical analysis, he showed that vowel sounds are composed of a series of decaying acoustic waves excited by pulsations emitted from the vocal folds. After the invention of phonograph by Thomas Edison in 1877, speech waveforms could be recorded and displayed. Ludimar Hermann, a German physiologist (1838-1914), using an optical amplification system to record the speech waveform on photographic plates, then verified Willisâs theory with extensive data [40, 41, 42]. In 1902, American physiologist Edward Wheeler Scripture (1864-1945) published a monograph The Elements of Experimental Phonetics [83], systematically expounding the transient theory of human voice.
However, the early transient theories conjectured that the source of excitation is the air puff coming through the glottis after being pushed open by the pressure in the trachea. After the invention of electroglottograph by French physiologist Philippe Fabre in 1956 [27], a universal experimental fact was found: The speech signal is triggered by the closing of glottis, rather than by its opening. The waveform of the air puff during the open phase of glottis has little effect on the voice. In order to elucidate that experimental fact, in this book, the acoustic process inside the vocal tract immediately after a glottal closing is studied by solving the time-dependent wave equation. The solution, a dynamic acoustic process inside the vocal tract, is a quantitative representation of the transient sound wave.
An alternative theory of human voice production, the overtone-resonance theory or source-filter theory, was proposed in 1837 by Sir Charles Wheatstone (1802-1875) in a comment on Willisâs paper [98]. Wheatstone agreed in every respect with Willisâs theory, but added an alternative view in terms of overtones and resonance [73]. Wheatstoneâs view was elaborated by Hermann von Helmholtz (1821-1894) in Sensation of Tone [38].
Wheatstone and Helmholtz assumed that the vibration of vocal folds is truly periodic with a frequency f0. A periodic function can be treated as a Fourier series, which consists of a fundamental component with frequency f0, and the overtones with frequencies 2f0, 3f0, and so on. The vocal tract can be treated as a Helmholtz resonator with resonance frequencies F1, F2, F3, and so on, which are called formant frequencies. (An interesting historical fact is that the term âformantâ was coined by Ludmir...