Speech Synthesis and Recognition
eBook - ePub

Speech Synthesis and Recognition

  1. 316 pages
  2. English
  3. ePUB (mobile friendly)
  4. Available on iOS & Android
eBook - ePub

Speech Synthesis and Recognition

Book details
Book preview
Table of contents
Citations

About This Book

With the growing impact of information technology on daily life, speech is becoming increasingly important for providing a natural means of communication between humans and machines. This extensively reworked and updated new edition of Speech Synthesis and Recognition is an easy-to-read introduction to current speech technology.
Aimed at advanced undergraduates and graduates in electronic engineering, computer science and information technology, the book is also relevant to professional engineers who need to understand enough about speech technology to be able to apply it successfully and to work effectively with speech experts. No advanced mathematical ability is required and no specialist prior knowledge of phonetics or of the properties of speech signals is assumed.

Frequently asked questions

Simply head over to the account section in settings and click on “Cancel Subscription” - it’s as simple as that. After you cancel, your membership will stay active for the remainder of the time you’ve paid for. Learn more here.
At the moment all of our mobile-responsive ePub books are available to download via the app. Most of our PDFs are also available to download and we're working on making the final remaining ones downloadable now. Learn more here.
Both plans give you full access to the library and all of Perlego’s features. The only differences are the price and subscription period: With the annual plan you’ll save around 30% compared to 12 months on the monthly plan.
We are an online textbook subscription service, where you can get access to an entire online library for less than the price of a single book per month. With over 1 million books across 1000+ topics, we’ve got you covered! Learn more here.
Look out for the read-aloud symbol on your next book to see if you can listen to it. The read-aloud tool reads text aloud for you, highlighting the text as it is being read. You can pause it, speed it up and slow it down. Learn more here.
Yes, you can access Speech Synthesis and Recognition by Wendy Holmes in PDF and/or ePUB format, as well as other popular books in Tecnología e ingeniería & Ingeniería eléctrica y telecomunicaciones. We have over one million books available in our catalogue for you to explore.

Information

CHAPTER 1

Human Speech Communication

1.1 VALUE OF SPEECH FOR HUMAN-MACHINE COMMUNICATION

Advances in electronic and computer technology are causing an explosive growth in the use of machines for processing information. In most cases this information originates from a human being, and is ultimately to be used by a human being. There is thus a need for effective ways of transferring information between people and machines, in both directions. One very convenient way in many cases is in the form of speech, because speech is the communication method most widely used between humans; it is therefore extremely natural and requires no special training.
There are, of course, many circumstances where speech is not the best method for communicating with machines. For example, large amounts of text are much more easily received by reading from a screen, and positional control of features in a computer-aided design system is easier by direct manual manipulation. However, for interactive dialogue and for input of large amounts of text or numeric data speech offers great advantages. Where the machine is only accessible from a standard telephone instrument there is no practicable alternative.

1.2 IDEAS AND LANGUAGE

To appreciate how communication with machines can use speech effectively, it is important to understand the basic facts of how humans use speech to communicate with each other. The normal aim of human speech is to communicate ideas, and the words and sentences we use are not usually important as such. However, development of intellectual activity and language acquisition in human beings proceed in parallel during early childhood, and the ability of language to code ideas in a convenient form for mental processing and retrieval means that to a large extent people actually formulate the ideas themselves in words and sentences. The use of language in this way is only a convenient coding for the ideas. Obviously a speaker of a different language would code the same concepts in different words, and different individuals within one language group might have quite different shades of meaning they normally associate with the same word.

1.3 RELATIONSHIP BETWEEN WRITTEN AND SPOKEN LANGUAGE

The invention of written forms of language came long after humans had established systems of speech communication, and individuals normally learn to speak long before they learn to read and write. However, the great dependence on written language in modern civilization has produced a tendency for people to consider language primarily in its written form, and to regard speech as merely a spoken form of written text—possibly inferior because it is imprecise and often full of errors. In fact, spoken and written language are different in many ways, and speech has the ability to capture subtle shades of meaning that are quite difficult to express in text, where one’s only options are in choice of words and punctuation. Both speech and text have their own characteristics as methods of transferring ideas, and it would be wrong to regard either as an inferior substitute for the other.

1.4 PHONETICS AND PHONOLOGY

The study of how human speech sounds are produced and how they are used in language is an established scientific discipline, with a well-developed theoretical background. The field is split into two branches: the actual generation and classification of speech sounds falls within the subject of phonetics, whereas their functions in languages are the concern of phonology. These two subjects need not be studied in detail by students of speech technology, but some phonetic and phonological aspects of the generation and use of speech must be appreciated in general terms. The most important ones are covered briefly in this chapter.

1.5 THE ACOUSTIC SIGNAL

The normal aim of a talker is to transfer ideas, as expressed in a particular language, but putting that language in the form of speech involves an extremely complicated extra coding process (Figure 1.1). The actual signal transmitted is predominantly acoustic, i.e. a variation of sound pressure with time. Although particular speech sounds tend to have fairly characteristic properties (better specified in spectral rather than waveform terms), there is great variability in the relationship between the acoustic signal and the linguistic units it represents. In analysing an utterance linguistically the units are generally discrete—e.g. words, phrases, sentences. In speech the acoustic signal is continuous, and it is not possible to determine a precise mapping between time intervals in a speech signal and the words they represent. Words normally join together, and in many cases there is no clear acoustic indication of where one word ends and the next one starts. For example, in “six seals” the final sound of the “six” is not significantly different from the [s] at the beginning of “seals”, so the choice of word boundary position will be arbitrary. All else being equal, however, one can be fairly certain that the [s] sound in the middle of “sick seals” will be shorter, and this duration difference will probably be the only reliable distinguishing feature in the acoustic signal for resolving any possible confusion between such pairs of words. The acoustic difference between “sick seals” and “six eels” is likely to be even more subtle.
Although the individual sound components in speech are not unambiguously related to the identities of the words, there is, of course, a high degree of systematic relationship that applies most of the time. Because speech is generated by the human vocal organs (explained further in Chapter 2) the acoustic properties can be related to the positions of the articulators. With sufficient training, phoneticians can, based entirely on listening, describe speech in terms of a sequence of events related to articulatory gestures. This auditory analysis is largely independent of age or sex of the speaker. The International Phonetic Alphabet (IPA) is a system of notation whereby phoneticians can describe their analysis as a sequence of discrete units. Although there will be a fair degree of unanimity between phoneticians about the transcription of a particular utterance, it has to be accepted that the parameters of speech articulation are continuously variable. Thus there will obviously be cases where different people will judge a particular stretch of sound to be on the opposite sides of a phonetic category boundary.
Images
Figure 1.1 Illustration of the processes involved in communicating ideas by speech. It is not easy to separate the concepts in the brain from their representation in the form of language.

1.6 PHONEMES, PHONES AND ALLOPHONES

Many of the distinctions that can be made in a narrow phonetic transcription, for example between different people pronouncing the same word in slightly different ways, will have no effect on meaning. For dealing with the power of speech sounds to make distinctions of meaning it has been found useful in phonology to define the phoneme, which is the smallest unit in speech where substitution of one unit for another might make a distinction of meaning. For example, in English the words “do” and “to” differ in the initial phoneme, and “dole” and “doll” differ in the middle (i.e. the vowel sound). There may be many different features of the sound pattern that contribute to the phonemic distinction: in the latter example, although the tongue position during the vowel would normally be slightly different, the most salient feature in choosing between the two words would probably be vowel duration. A similar inventory of symbols is used for phonemic notation as for the more detailed phonetic transcription, although the set of phonemes is specific to the language being described. For any one language only a small subset of the IPA symbols is used to represent the phonemes, and each symbol will normally encompass a fair range of phonetic variation. This variation means that there will be many slightly different sounds which all represent manifestations of the same phoneme, and these are known as allophones.
Phonologists can differ in how they analyse speech into phoneme sequences, especially for vowel sounds. Some economize on symbols by representing the long vowels in English as phoneme pairs, whereas they regard short vowels as single phonemes. Others regard long and short vowels as different single phonemes, and so need more symbols. The latter analysis is useful for acknowledging the difference in phonetic quality between long vowels and their nearest short counterparts, and will be adopted throughout this book. We will use the symbol set that is most widely used by the current generation of British phoneticians, as found in Wells (2000) for example. With this analysis there are about 44 phonemes in English. The precise number and the choice of symbols depends on the type of English being described (i.e. some types of English do not make phonetic distinctions between pairs of words that are clearly distinct in others). It is usual to write phoneme symbols between oblique lines, e.g. /t/, but to use square brackets round the symbols when they represent a particular allophone, e.g. [t]. Sometimes the word phone is used as a general term to describe acoustic realizations of a phoneme when the variation between different allophones is not being considered.
Many of the IPA symbols are the same as characters of the Roman alphabet, and often their phonetic significance is similar to that commonly associated with the same letters in those languages that use this alphabet. To avoid need for details of the IPA notation in this book, use of IPA symbols will mostly be confined to characters whose phonemic meaning should be obvious to speakers of English.
There is a wide variation in the acoustic properties of allophones representing a particular phoneme. In some cases these differences are the result of the influence of neighbouring sounds on the positions of the tongue and other articulators. This effect is known as co-articulation. In other cases the difference might be a feature that has developed for the language over a period of time, which new users learn as they acquire the language in childhood. An example of the latter phenomenon is the vowel difference in the words “coat” and “coal” as spoken in southern England. These vowels are acoustically quite distinct, and use a slightly different tongue position. However, they are regarded as allophones of the same phoneme because they are never used as alternatives to distinguish between words that would otherwise be identical. Substituting one vowel for the other in either word would not cause the word identity to change, although it would certainly give a pronunciation that would sound odd to a native speaker.

1.7 VOWELS, CONSONANTS AND SYLLABLES

We are all familiar with the names vowel and consonant as applied to letters of the alphabet. Although there is not a very close correspondence in English between the letters in conventional spelling and their phonetic significance, the categories of vowel and consonant are for the most part similarly distinct in spoken language.
During vowels the flow of air through the mouth and throat is relatively unconstricted and the original source of sound is located at the larynx (see Chapter 2), whereas in most consonants there is a substantial constriction to air flow for some of the time. In some consonants, known as stop consonants or plosives, the air flow is completely blocked for a few tens of milliseconds. Although speech sounds that are classified as vowels can usually be distinguished from consonants by this criterion, there are some cases where the distinction is not very clear. It is probably more useful to distinguish between vowels and consonants phonologically, on the basis of how they are used in making up the words of a language. Languages show a tendency for vowels and consonants to alternate, and sequences of more than three or four vowels or consonants are comparatively rare. By considering their functions and distributions in the structure of language it is usually fairly easy to decide, for each phoneme, whether it should be classified as a vowel or a consonant.
In English there are many vowel phonemes that are formed by making a transition from one vowel quality to another, even though they are regarded as single phonemes according to the phonological system adopted in this book. Such vowels are known as diphthongs. The vowel sounds in “by”, “boy” and “bough” are typical examples, and no significance should be assigned to the fact that one is represented by a single letter and the others by “oy” and “ough”. Vowels which do not involve such a quality transition are known as monophthongs.
There are some cases where the different phonological structure will cause phonetically similar sounds to be classified as vowels in one language and consonants in another. For example the English word “pie” and the Swedish word “paj”, which both have the same meaning, also sound superficially rather similar. The main phonetic difference is that at the end of the word the tongue will be closer to the palate in the Swedish version. However, the English word has two phonemes: the initial consonant, followed by a diphthong for the vowel. In contrast the Swedish word has three phonemes: the initial consonant, followed by a monophthong vowel and a final consonant. The final consonant is very similar phonetically to the initial consonant in the English word “yet”.
All spoken languages have a syllabic structure, and all languages permit syllables consisting of a consonant followed by a vowel. This universal fact probably originates from the early days of language development many thousands of years ago. The natural gesture of opening the mouth and producing sound at the larynx will always produce a vowel-like sound, and the properties of the acoustic system during the opening gesture will normally generate some sort of consonant. Some languages (such as Japanese) still have a very simple syllabic structure, where most syllables consist of a single consonant followed by a vowel. In languages of this type syllable sequences are associated with alternate increases and decreases of loudness as the vowels and consonants alternate. In many other languages, however, a much wider range of syllable types has evolved, where syllables can consist of just a single vowel, or may contain one or more consonants at the beginning and the end. A syllable can never contain more than one vowel phoneme (although that one may be a diphthong), but sometimes it may not contain any. In the second syllable of many people’s pronunciation of English words such as “button”, “prism” and “little”, the final consonant sound is somewhat lengthened, but is not preceded by a vowel. The articulatory constriction needed for the consonant at the end of the first syllable is followed immediately by that for the final consonant. Other English speakers might produce a short neutral vowel between the two consonants; the end result will sound fairly similar,...

Table of contents

  1. Cover
  2. Half Title
  3. Title Page
  4. Copyright Page
  5. Table of Contents
  6. Preface to the First Edition
  7. Preface to the Second Edition
  8. List of Abbreviations
  9. 1 Human Speech Communication
  10. 2 Mechanisms and Models of Human Speech Production
  11. 3 Mechanisms and Models of the Human Auditory System
  12. 4 Digital Coding of Speech
  13. 5 Message Synthesis from Stored Human Speech Components
  14. 6 Phonetic synthesis by rule
  15. 7 Speech Synthesis from Textual or Conceptual Input
  16. 8 Introduction to automatic speech recognition: template matching
  17. 9 Introduction to stochastic modelling
  18. 10 Introduction to front-end analysis for automatic speech recognition
  19. 11 Practical techniques for improving speech recognition performance
  20. 12 Automatic speech recognition for large vocabularies
  21. 13 Neural networks for speech recognition
  22. 14 Recognition of speaker characteristics
  23. 15 Applications and performance of current technology
  24. 16 Future research directions in speech synthesis and recognition
  25. 17 Further Reading
  26. References
  27. Solutions to Exercises
  28. Glossary
  29. Index