Speech: A dynamic process
eBook - ePub

Speech: A dynamic process

René Carré, Pierre Divenyi, Mohamad Mrayati

  1. 241 pages
  2. English
  3. ePUB (mobile friendly)
  4. Available on iOS & Android
eBook - ePub

Speech: A dynamic process

René Carré, Pierre Divenyi, Mohamad Mrayati

Book details
Book preview
Table of contents
Citations

About This Book

Speech: A dynamic process takes readers on a rigorous exploratory journey to expose them to the inherently dynamic nature of speech. The book addresses an intriguing question: Based only on physical principles alone, can the exploitation of a simple acoustic tube evolve into an optimal speech production system comparable to the one we possess? In the work presented, the tube is deformed step by step with the sole criterion of expending minimum effort to obtain maximum acoustic variations. At the end of this process, the tube is found divided into distinctive regions and an acoustic space emerges capable of generating speech sounds. Attaching this tube to a model, an inherently dynamic and efficient system is created. In the resulting system, optimal primitive trajectories are seen to naturally exist in the acoustic space and the regions defined in the tube correspond to the main places of articulation for oral vowels and plosive consonants. All this implies that these speech sounds are inherent properties of not only the modeled acoustic tube but also of the human speech production system. This book stands as a valuable resource for accomplished and aspiring speech scientists as well as for other interested persons in search for an introduction to speech acoustics that takes an unconventional path.

Frequently asked questions

Simply head over to the account section in settings and click on “Cancel Subscription” - it’s as simple as that. After you cancel, your membership will stay active for the remainder of the time you’ve paid for. Learn more here.
At the moment all of our mobile-responsive ePub books are available to download via the app. Most of our PDFs are also available to download and we're working on making the final remaining ones downloadable now. Learn more here.
Both plans give you full access to the library and all of Perlego’s features. The only differences are the price and subscription period: With the annual plan you’ll save around 30% compared to 12 months on the monthly plan.
We are an online textbook subscription service, where you can get access to an entire online library for less than the price of a single book per month. With over 1 million books across 1000+ topics, we’ve got you covered! Learn more here.
Look out for the read-aloud symbol on your next book to see if you can listen to it. The read-aloud tool reads text aloud for you, highlighting the text as it is being read. You can pause it, speed it up and slow it down. Learn more here.
Yes, you can access Speech: A dynamic process by René Carré, Pierre Divenyi, Mohamad Mrayati in PDF and/or ePUB format, as well as other popular books in Personal Development & Writing & Presentation Skills. We have over one million books available in our catalogue for you to explore.

Information

Publisher
De Gruyter
Year
2017
ISBN
9781501502057

1Speech: results, theories, models

1.1Background

Speech is a communication process: a talker-to-listener flow of information traversing an often noisy media. From an informational viewpoint, this flow has to be a sequence of events that are choices between alternatives. Of course, as Claude Shannon (1948) demonstrated in his classic but forever young and fresh paper on information theory, the flow is not forced to consist of Morse code-like binary or alphabet-like multi-valued discrete elements; it may be an ensemble of features describing continuous processes, such as those in a noisy telephone conversation. Interestingly, the linguist Roman Jakobson was among the first to understand the value of Shannon’s system as he re-formulated his original treatise on distinctive features of phonemes (Jakobson, Fant, and Halle 1951) to correspond to information-theoretic concepts (Jakobson and Halle, 1956), but it was Alvin Liberman and his colleagues who first talked about the “speech code” (Liberman et al., 1967). Yet, it should be clear that the above flow constitutes only one of the two arches of communication, the other being the explicit or tacit indication by the listener to the talker that the information was received and decoded correctly. In the dynamic framework, information is defined, among others, by properties of gestural deformations of the vocal tract’s area function or, equivalently, by the continuous change of formant resonances. We can apply a sparse code for communicating articulatory motions via the resulting acoustic changes: the code can portray a formant transition by specifying its point of departure (most often a time-domain discontinuity of a formant trajectory), its direction, its velocity, and its duration, and send a packet of bits only when any of these properties changes. Although such a communication framework does share features with several of the computational speech analysis systems currently used for automatic speech recognition (ASR, e.g., Deng, 2006), and although those systems do represent and code information in the form of spectral changes in the speech waveform, the system we will present is significantly sparser. When it comes to information, we also believe that production had to evolve using decision trees, which first had to be simply binary and which then had to expand logarithmically by factors of 2, as Shannon’s (1948) original system predicts.
But, as we have already said in the Preface, it is dynamics – the constantly changing flow of information – that constitutes the most essential feature of the communication process and of the system we are about to describe. Our interest in the dynamics of speech production and perception was piqued early on when we became exposed to the off-the-beaten-track studies by Ludmilla Chistovich and her collaborator-husband Valery Kozhevnikov (Kozhevnikov and Chistovich, 1965; Chistovich et al., 1972). This couple, while investigating the known phenomenon of gesture anticipation in coarticulated consonants, was the first to suggest that speech should be studied as a dynamic flow of syllables rather than as a sequence of phonemes. Under the Soviet system much of this research was classified and to this date only a small fragment of the vast productivity at the Pavlov Institute’s speech laboratory has been translated from the Russian.3Sadly, as first-rate scientists as Kozhevnikov and Chistovich were, they are still largely unfamiliar to Western audiences, and their work has remained outside the main body of current scientific knowledge.
Above all, we are greatly indebted to Gunnar Fant from whom we learned so much about speech acoustics and production, by way of reading his extensive work and also through invaluable personal contact. His seminal book (Fant, 1960) has been a constant source of reference while writing this volume. We also owe much to the work and ideas of Kenneth Stevens and James Flanagan, two scientists whose contributions to speech science are universally recognized. As anybody who had any interest in learning speech acoustics beyond the basics, we were students of their respective treatises in this field, “Acoustic Phonetics” by Stevens (2000) and “Speech Analysis Synthesis and Perception” by Flanagan (1972). Our book has also benefited from Stevens’ quantal theory of speech (Stevens, 1972) that defined articulatory-acoustic relations in dynamic terms. The model we shall describe in Chapter 4 rests on foundations laid by Manfred Schroeder, a physicist whose research had a significant impact in several branches of acoustics. We particularly appreciate his elegant mathematical treatment of area function dynamics and his solution (one of the earliest) to the thorny problem of converting articulatory motion to acoustics (Schroeder, 1967; Schroeder and Strube, 1979). We gladly acknowledge Sven Öhman’s contribution to our view on the dynamics of coarticulated consonant–vowel (CV) and vowel– consonant (VC) utterances (Öhman, 1966a, 1966b), from the points of view of both production and perception. Finally, we have learned much from Hiroya Fujisaki’s writings about his imaginative experiments and precise models on the perception of static and dynamic speech sounds (Fujisaki and Kawashima, 1971; Fujisaki and Sekimoto, 1975; Fujisaki, 1979). Our understanding of the importance of articulatory dynamics was also stimulated by Björn Lindblom’s “H & H” (hyper- and hypo-speech) theory (Lindblom, 1990b) and vowel system prediction (Lindblom, 1986b) – a linguist of broad interests with whom all three of us had the good fortune to interact over several years. We also credit him for his investigations on the prediction of vowel systems, and on incomplete vowel transitions (“vowel reduction,” Lindblom and Studdert-Kennedy, 1967). Incidentally, it was Lindblom who first championed taking a deductive approach for the study of speech (Lindblom, 1992). We also benefited from the pioneering research of investigators at Haskins Laboratories – Alvin Liberman, Frank Cooper, and many of their colleagues – who, using their innovative spectrogram-to-audio “Pattern Playback” talking machine, were among the first to study the role of CV and VC transitions in speech perception (Delattre et al., 1955). Although still focused on the essentially static phoneme as the smallest element of speech, as stated by the motor theory, Liberman (1957; 1967) nonetheless viewed speech as a dynamic process, and so did Fowler (1986) when stating the dynamic property of perceptual direct realism. However, it was Catherine Browman’s and Louis Goldstein’s hypothesis (1986) positing a direct relation between articulatory movements and phonological representation (articulatory phonology) that moved the motor theory to a new level at which dynamics became essential – as it is for our own approach. Articulatory phonology is an attempt to represent the speech code in terms of articulatory gestures. Associated with a motor equivalence mechanism (Saltzman, 1986; Guenther, 1995), this approach allows the study of respective roles these gestures play in diverse speech phenomena, such as coarticulation, vowel reduction, etc. This approach was also encouraged by the then-surprising results of silent consonant experiments by Winifred Strange and James Jenkins (Strange, Verbrugge, Shankweiller, et al., 1976; Strange et al., 1983). Their experimental results, confirmed by automatic speech recognition tests by Furui (1986b), showed that transitions in nonsense CVC (consonant-vowel-consonant) syllables are sufficient for the recognition of a silenced vowel or silenced initial and final consonants (see also Hermansky, 2011).
In recent years the view of speech as a dynamic process has gained an everincreasing number of adepts and many studies have demonstrated that, when listening to speech, people recognize the dynamic progression of sounds rather than a sequence of static phonemes. This dynamic view has been given support by neurophysiological results showing that perceptual systems respond principally to changes in the environment (Barlow, 1972; Fontanini and Katz, 2008) and has become a cornerstone of kinetic theories of speech perception (Kluender et al., 2003). At the same time, more precise measurements have made it possible to discover vowel-inherent spectral changes (e.g., Nearey and Assmann, 1986) in vowel formant frequencies once considered static (Morrison and Assmann, 2013). Interestingly, while formant frequencies measured at the midpoint of natural American English vowels display large inter-speaker variability (Peterson and Barney, 1952), CV and VC formant transitions for a given vowel and a given consonant are remarkably stable across talkers (Hillenbrand et al., 1995, 2001).
Nevertheless, the human acoustic communication system could not be optimally efficient without the articulatory apparatus being paired with an efficient receiver – an auditory counterpart capable of detecting and discriminating both the sounds produced and the dynamic flow of the articulatory information. But we know that the vertebrate auditory system, including the human one, had developed into the fine device it is – probably a lot earlier than the human vocal tract evolved to serve the purposes of speech communication – mostly to detect and identify predator, prey, and sexual mate. So, the two pertinent questions are: (1) What particular auditory capabilities are needed for the detection, discrimination, and identification of sounds emitted by a possibly distant talker communicating possibly in an environment where the speech information is often masked by noise? And (2) are the particular auditory functions that comprise those capabilities fine-tuned to perform optimally when the parameters of the functions match those of the sounds produced dynamically by the articulatory apparatus?
To answer these two questions, first one should list the speech signal’s physical characteristics and describe them from the listener’s point of view. A good point of departure is Rainer Plomp’s characterization of speech as being a signal slowly varying in amplitude and frequency (Plomp, 1983). In other words, speech is a modulated signal and, in order to perceive it, the auditory system needs more than effectively dealing only with basic auditory attributes – such as audibility, signal-to-noise ratio (SNR) at threshold, intensity, frequency and pitch, timbre, duration, masking, localization – a list that about 150 years of research in psychoacoustics has been attacking with greater and greater success. On the whole, it seems that the human auditory system is an excellent match for analyzing the speech waveform: the normal speaking intensity at conversation range is sufficient to transmit information even for a whisper, frequency resolution is finer than the one necessary to resolve neighboring formants for all vowels and consonants, pitch resolution is so exquisite that minute prosody-driven fundamental frequency fluctuations (signaling some subtle emotional change) can be perfectly decoded, gaps of only 1–2 milliseconds in duration can be recognized, horizontal localization of a source is good enough to track even slight displacements of a talker. But, thinking of Plomp’s definition of speech as a modulated waveform, it is necessary to investigate the perception of both amplitude (AM) and frequency (FM) modulations: AM for understanding the mainly syllabic-rate fluctuations of the speech waveform, and FM specially to study the perception of formant frequency modulations (i.e., sweeps). The earliest measurements of dynamic changes in the formant frequencies of natural speech are credited to Lehiste and Peterson (1961), although perceptual effects of formant frequency changes in both natural and synthesized vowels were also investigated early on by the Leningrad group (Kozhevnikov and Chistovich, 1965; Chistovich, 1971; Chistovich, Lublinskaja, et al., 1982). When presented with non-speech filtered harmonic series single-formant (vowel-like) tone complexes, trained listeners are able to perceive very rapid formant frequency changes at very short, and low-velocity changes at a little longer but still quite short, durations (e.g., Schouten, 1985). These experimental results shed light on how we must be perceiving speech: for natural speech samples formant changes are inherent even in vowels that, for a long time, have been considered steady-state (Nearey, 1989).
In real life settings, however, we are often faced with the task of listening to a specific talker when in the background other persons are also talking (a situation named the “cocktail party effect” by Colin Cherry [1953; 1966]). Interference in the perception of one signal by another has been studied for a very long time; first, it was referred to as a higher intensity interferer, mostly noise, masking the audibility of a lower intensity signal. Such masking, however, was more recently identified as only one of two kinds of interference, the other being one in which the masker does not abolish or diminish detectability of the signal but decreases its identifiability or discriminability – i.e., it lowers the ability to extract information from it. Appropriately, the first kind is now called “energetic” and the second “informational”masking (see Durlach et al., 2003, for a thorough explanation of the difference between the two). But masking in cocktail party situations is more complicated: there we have the AM/FM“noise” of the talkers in the background interfering with the AM/FM signal of the target talker...

Table of contents

  1. Cover
  2. Title Page
  3. Copyright
  4. Preface
  5. Contents
  6. Dedication
  7. Introduction
  8. 1 Speech: results, theories, models
  9. 2 Perturbation and sensitivity functions
  10. 3 An efficient acoustic production system
  11. 4 The Distinctive Region Model (DRM)
  12. 5 Speech production and the model
  13. 6 Vowel systems as predicted by the model
  14. 7 Speech dynamics and the model
  15. 8 Speech perception viewed from the model
  16. 9 Epistemological considerations
  17. 10 Conclusions and perspectives
  18. Bibliography
  19. Index of terms
  20. Author Index