Behavioral and Neural Foundations of Multisensory Face-Voice Perception in Infancy
Daniel C. Hyde, Ross Flom and Chris L. Porter
ABSTRACT
In this article, we describe behavioral and neurophysiological evidence for infantsâ multimodal face-voice perception. We argue that the behavioral development of face-voice perception, like multimodal perception more broadly, is consistent with the intersensory redundancy hypothesis (IRH). Furthermore, we highlight that several recently observed features of the neural responses in infants converge with the behavioral predictions of the intersensory redundancy hypothesis. Finally, we discuss the potential benefits of combining brain and behavioral measures to study multisensory processing, as well as some applications of this work for atypical development.
Overview
Human communication is a multisensory experience. For example, we can see and hear our communicative partners. Often, such cues from these different sense modalities need to be coordinated in order to understand the intent of another personâs thoughts and actions. Despite the fact that human communication is a multisensory experience, historically much of the research and theorizing on infant perception, cognition, and learning related to face-voice perception has been done from a unimodal or single sense modality perspective. The purpose of this review article is to summarize and integrate research on the behavioral and neural foundations of face-voice perception in human infants. Specifically, we review selected behavioral studies that examine the development of attention, learning, and memory relevant to infantsâ multisensory perception of faces and voices. We then review the rapidly emerging literature examining the neural basis of multisensory processing in infants including mention of some relevant findings from the nonhuman animal and human adult literatures. From the outset, we propose that the intersensory redundancy hypothesis (IRH) provides a useful framework for understanding the perceptual and cognitive processes associated with multisensory communication and may provide insights to the neural foundations of face-voice perception as well. Finally, we briefly review some emerging translational applications of this work.
Unimodal perception of faces and voices
As previously noted, our day-to-day interactions, including communicative exchanges, are multimodal or multisensory. While research that examines how auditory and visual information collectively lead to face-voice or person perception has increased in recent years, the majority of research that has examined infantsâ face-voice perception has done so from either the visual or auditory sense modality. Therefore, before we describe research on the process and development of infantsâ multisensory perception of faces and voices, we first briefly summarize findings on infantsâ perception of faces and voices from a unimodal or unisensory perspective.
Visual perception of faces
Over the past several decades a vast literature has accrued demonstrating that infants are excellent perceivers of faces (see Farah, Wilson, Drain, & Tanaka, 1998; Nelson, 2001; Pascallis & Kelly, 2009; for reviews). For example, newborns prefer faces compared to other visual stimuli (e.g., Fantz, 1963; Maurer & Barerra, 1981) and 2- to 4-day-old newborns discriminate dynamic images of their mother and a strangerâs face (Field, Cohen, Garcia, & Greenberg, 1984; Sai & Bushnell, 1988). Certainly by 3-months of age and possibly from birth, infants can recognize individual faces based on facial features alone and from different orientations (Kelly et al., 2007; Pascalis, De Haan, Nelson, & de Schonen, 1998; Turati, Bulf, & Simion, 2008). Furthermore, the infant brain shows some degree of cortical specialization for face processing. For example, studies using functional near-infrared spectroscopy (fNIRS) have shown that occipital and temporal regions known to respond selectively to faces in adults (e.g., Haxby, Hoffman, & Gobbini, 2000; Kanwisher, McDermott, & Chun, 1997) also respond more to face than to nonface control stimuli in young infants (e.g., Blasi et al., 2007; Csibra et al., 2004; Otsuka et al., 2007). Brain electrophysiology measured from the scalp (using electroencephalography) of infants also shows selectivity in the cortical response to faces compared to nonface stimuli in young infants (e.g., de Haan, Pascalis, & Johnson, 2002). Thus, while infantsâ face perception is shaped and modified by visual experience throughout the first year of life, even young infants are in fact adept perceivers of faces (see Bar-Haim, Ziv, Lamy, & Hodes, 2006; Pascallis & Kelly, 2009; for reviews). Similarly, while young infants show neural selectivity for faces compared to nonface stimuli early in development, their neural responses become increasingly specialized for the types of faces (race, species, etc.) experienced in their environment (e.g., Scott & Monesson, 2010; Scott, Shannon, & Nelson, 2006).
Auditory perception of voices
Just as infants are excellent perceivers of faces they are also excellent perceivers of voices. Fetuses, for example, are sensitive to auditory stimulation during their last trimester (Querleu, Renard, Boutteville, & Crepin, 1989; Querleu, Renard, Versyp, Paris-Delrue, & Crepin, 1988) and show a preference for their motherâs voice compared to a strangerâs voice or the voice of their father (DeCasper & Fifer, 1980; DeCasper & Prescott, 1984). Additionally, four-day-old infants discriminate between their âownâ language and an unfamiliar languageâbut not between two unfamiliar languages (Mehler et al., 1988; Moon, Cooper, & Fifer, 1993). Likewise, two-day-old infants, who prenatally heard their mother read a story once a day during their last trimester, showed a preference for their mother reading the familiar story compared to her reading of a novel story (DeCasper & Spence, 1986). Studies using functional magnetic resonance imaging (fMRI) and fNIRS have shown that from the first few months of life the infant brain is specialized for language processing (Dehaene-Lambertz, Dehaene, & Hertz-Pannier, 2002; Dehaene-Lambertz et al., 2010; Peña et al., 2003). More specifically, left-lateralized regions of the frontal and temporal lobe respond selectively to speech and other language-like stimuli compared to backwards speech or nonlanguage sounds within the first few months of life (Dehaene-Lambertz et al., 2002, 2010; Peña et al., 2003). Furthermore, between 6 and 9 months of age both the neural response and looking behavior of infants suggests that language processing becomes increasingly tuned to the particular patterns and sounds of their own language (e.g., Jusczyk, Cutler, & Redanz, 1993; or see Kuhl, 2010; for a review). Thus, from very early in development, infants discriminate, recognize, preferentially attend to, and engage specialized regions of the brain for processing of faces and voices.
As previously reviewed, early in development, infants are adept at perceiving faces and voices. For the most part, however, infants do not encounter faces or voices in isolation, rather infants are typically exposed to coordinated or temporally and spatially collocated faces and voices. Therefore, just as it is important to understand how infants perceive faces and voices separately (i.e., from a unimodal perspective), it is also important to account for how, when, and under what conditions infants are able to perceive, learn, and remember faces and voices from a multimodal or multisensory perspective.
Multisensory or multimodal communication
Communication between two organisms is the canonical multimodal experience. Within human social-communicative exchanges, visual cues from the face and body of one speaker, along with that speakerâs auditory vocal cues, need to be coordinated in order to understand the intent and content of another personâs thoughts and actions. In other words, everyday communication involves the coordination of a variety of sensory information. Broadly speaking, communicative exchanges, like various objects and events, provide two types of sensory information: amodal and modality specific information (see Bahrick, Lickliter, & Flom, 2004; J. J. Gibson, 1966, E. J. Gibson, 1969; for reviews).
Modality specific information is typically defined as that information that is conveyed or tied to one specific sense modality (see Bahrick et al., 2004). For instance, the pitch and timbre (i.e., complexity) of a personâs voice is restricted to the auditory system. Likewise, the color of a personâs hair or appearance of their face is restricted to the visual system. In contrast, amodal information is that information that is âwithout modalityâ. In other words, the property specified by amodal information is common across two or more sense modalities and is not specific to any one sensory system. Examples of amodal properties include tempo and rhythm of speech, as these properties can be both seen as well as heard. Furthermore, some social information, like affect or emotion, can also be categorized as amodal. That is, the communication of affect is often conveyed in both facial and vocal expressions.
Intersensory redundancy hypothesis
The fact that everyday objects and events, including communicative exchanges, provide both modality specific and amodal information raises the question of how human infants arrive at a veridical perception of their world. J. J. Gibson (1966, 1979) argued that the senses work together to pick up what he termed invariant or amodal information. E. J. Gibson (1969) extended this work to perceptual development and described perceptual development as a process of increasing specificity where more global, invariant, and amodal properties like tempo, rhythm, and intensity, are typically perceived and learned prior to modality specific properties (Bahrick, 2001; E. J. Gibson, 1969; E. J. Gibson & Pick, 2000). More recently, Bahrick and her colleagues proposed an Intersensory Redundancy Hypothesis (IRH) that builds from this ecological approach to perception as a framework for understanding how infants arrive at a veridical and unitary perception of our world, including infantsâ perception of faces and voices (Bahrick & Lickliter, 2000, 2002; Bahrick et al., 2004). More specifically, the IRH provides an explanation for understanding what properties of an object or event (including faces and voices) infants will attend to, process, and remember based on the information available to the infant (Bahrick & Lickliter, 2000, 2002; Bahrick et al., 2004). One prediction of the IRH is that amodal properties such as tempo, rhythm, and intensity are more perceptually salient and will more easily capture infantsâ attention when experienced in multimodal or multisensory contexts compared to a unimodal or unisensory contexts (see Bahrick et al., 2004; J. J. Gibson, 1966, E. J. Gibson, 1969). Furthermore, the IRH predicts that infants will show an attenuated response to amodal properties when experienced in a unimodal context. In contrast, infantsâ attention toward and learning of modality specific properties such as color, pitch, or visual orientation will be facilitated in unimodal contexts and are attenuated in multimodal contexts (Bahrick & Lickliter, 2002; Bahrick et al., 2004; Flom & Bahrick, 2010). Thus, attentional biases and subsequent learning and memory in infancy depend upon the nature of information available for perceptual exploration. Finally, the Intersensory Redundancy Hypothesis (IRH) also offers a developmental prediction. As perceptual processing becomes more efficient later in infancy, or with increased perceptual experience, both amodal and modality specific properties will be detected, attended to, and learned in contexts of redundant multisensory or nonredundant unimodal stimulation (Bahrick et al., 2004).
Evidence supporting the predictions of the intersensory redundancy hypothesis
For nearly two decades, researchers have tested the predictions of the intersensory redundancy hypothesis (see Bahrick, 2010 for a review). Some of the initial experiments investigating the intersensory redundancy hypothesis examined infantsâ (i.e., 3- to 5-month-oldsâ) discrimination of different amodal properties such as rhythm and tempo within unisensory or unimodal contexts as well as multimodal or multisensory contexts (Bahrick, Flom, & Lickliter, 2002; Bahrick & Lickliter, 2000). For example, Bahrick and Lickliter (2000) showed that 5-month-olds discriminate a change in the rhythm of a plastic toy hammer hitting a surface when provided redundant and temporally synchronous bimodal auditory-visual stimulation, but fail to discriminate a change in the eventâs rhythm when provided unimodal auditory, unimodal visual stimulation, or temporally asynchronous bimodal stimulation (Bahrick & Lickliter, 2000). Likewise, Bahrick and colleagues (2002) showed that 3-month-olds discriminate a change in the tempo or rate of the same toy hammer hitting a surface when provided redundant and temporally synchronous bimodal auditory-visual stimulation and not when provided either unimodal auditory or unimodal visual stimulation. These experiments show that the amodal properties of rhythm and tempo are perceived when conveyed in redundant multimodal stimulation and not when conveyed in unimodal auditory or visual stimulation. Evidence from other studies provides support for the developmental prediction of the intersensory redundancy hypothesis, which is that infantsâ attention is initially captured by amodal properties and over the course of development infants perceive and discriminate changes in the same amodal property in multimodal or unimodal stimulation (e.g., Bahrick et al., 2002; Bahrick & Lickliter, 2004). For example, Bahrick and Lickliter (2004) found that 5-month-olds can discriminate a change in tempo when provided either redundant bimodal (audio-visual) stimulation or when provided unimodal (auditory) stimulation, while younger infants (i.e., 3-month-olds) make the same discrimination only within the context of bimodal (audio-visual stimulation) (Bahrick et al., 2002; Bahrick & Lickliter, 2004). Similarly, older infants (8-month-olds) can discriminate a change in rhythm (i.e., the tapping hammer) when provided either bimodal or unimodal stimulation, but younger infants (5-month-olds) show discrimination only when provided bimodal stimulation (see Bahrick & Lickliter, 2000). Thus, older infants show discrimination of tempo and rhythm in either bimodal or unimodal contexts whereas younger infants only did so within the context of bimodal stimulation (Bahrick & Lickliter, 2004). Taken together, early in development infants appear sensitive to changes in amodal properties of tempo and rhythm within the context of synchronous bimodal stimulation. Two- to three-months later, however, infants can discriminate changes in these same amodal properties when provided either bimodal or unimodal...