The ability to imitate sound using one’s voice is an early-emerging behavior in humans and is a critical behavior in language acquisition (for a review, see Kuhl, 2010). Beyond language development, the ability to vocally imitate sound, specifically pitch, is an important component in music production (i.e., singing). Singing is a ubiquitous activity that serves social, cultural, and educational functions. Singing emerges spontaneously in early childhood, as is the case with speech. However, unlike speech, singing abilities vary later in life, specifically with respect to vocal pitch imitation. Furthermore, a sizeable minority of the population never acquires the ability to match a sung pitch within a semitone of the target pitch (Dalla Bella, Giguère, & Peretz, 2007; Pfordresher & Brown, 2007; Pfordresher, Brown, Meier, Belyk, & Liotti, 2010; Pfordresher & Larrouy-Maestri, 2015). The research presented here addresses what basic cognitive processes are related to individual differences in vocal pitch imitation ability.

The vocal imitation of pitch recruits a broad network of processes related to perception, motor control, and cognition (Pfordresher, Halpern, & Greenspon, 2015). The core of this network involves associations between auditory perception and vocal motor control within the vocal sensorimotor loop (Berkowska & Dalla Bella, 2009), a simplified version of which is shown in Fig. 1. The research to date suggests that individual differences in vocal pitch imitation may primarily reflect variability in sensorimotor translation. Support for this claim has come from the following lines of research. First, inaccurate singers often exhibit precise pitch discrimination skills, indicating that inaccurate vocal imitation is not a direct consequence of impaired perception (Dalla Bella et al., 2007; Pfordresher & Brown, 2007). Second, a purely motor-control account of inaccurate pitch imitation is not sufficient, either, given evidence that inaccurate singers exhibit vocal ranges similar to those of accurate singers during spontaneous vocal production and show precision similar to that of accurate singers when sustaining a constant pitch (Pfordresher & Mantell, 2009). Finally, inaccurate singers are able to match pitch outside the vocal motor system (Hutchins & Peretz, 2012). These lines of research have motivated a sensorimotor account of inaccurate pitch imitation, such that inaccurate singers are ineffective at mapping auditory representations onto motor representations of the vocal tract, resulting in inaccurate vocal performance (Hutchins & Peretz, 2012; Pfordresher & Brown, 2007; Pfordresher et al., 2015). Although sensorimotor translation may be the source of individual differences in pitch imitation ability, it is not clear from these observations what processes are necessary for sensorimotor translation to occur.

Fig. 1
figure 1

A simplified version of the vocal sensorimotor loop

One process that might facilitate sensorimotor translation is the formation of an accurate auditory image. Various lines of evidence suggest that auditory imagery integrates both perceptual and motor representations, and thus might provide a basis for sensorimotor translation. Importantly, auditory imagery recruits brain activity in both secondary auditory areas and motor-planning areas (Leaver, Van Lare, Zielinski, Halpern, & Rauschecker, 2009; Lima et al., 2015; McNorgan, 2012; Zatorre & Halpern, 2005). Motor components of auditory images have also been established using vocal–motor interference tasks, such as subarticulation, which can interfere with auditory imagery (Smith, Wilson, & Reisberg, 1995). Finally, the assumption of a link between imagery and performance has a long-standing history in music education, in which notational audiation, or the formation of an auditory image from notation, is assumed to be critical for sight-reading (Gordon, 1975). Consistent with these observations, inaccurate singers report experiencing less vivid auditory images than do accurate singers (Greenspon, Pfordresher, & Halpern, 2017; Pfordresher & Halpern, 2013). However, no studies to date have related vocal pitch imitation to the accuracy with which auditory images can be formed online.

Although some pitch imitation tasks involve the reproduction of single pitches, which have a low short-term memory load, most singing involves listening to and then reproducing sequences of pitches in a melody: a task involving high short-term memory demands. Therefore, individual differences in pitch imitation accuracy might be due in part to differences in short-term memory capacity. However, research to date has been mixed regarding the relationship between auditory short-term memory capacity and vocal pitch imitation ability. Pfordresher and Brown (2007) found that inaccurate singers do not exhibit a greater deficit for sequences with more complex pitch patterns. In fact, the largest deficits among this group are found for imitating single pitches (Pfordresher & Brown, 2007). On the other hand, a more recent study did uncover an association between backward digit span and vocal pitch imitation, but only when participants were required to produce mental transformations of pitch sequences (Greenspon et al., 2017). In support of the role of short-term memory capacity in vocal performance, Christiner and Reiterer (2013) reported a positive correlation of both forward and backward digit span with a measure of singing ability, but it is not clear how closely associated the measure of singing was to accuracy in pitch imitation. In sum, although auditory short-term memory is a theoretically critical process involved in pitch imitation, the relationship between short-term capacity and pitch imitation ability is currently not well-defined. More important, to date, no studies have focused specifically on the relationship between pitch imitation and memory span for pitch information (Williamson & Stewart, 2010). As such, it remains to be determined whether these variable results reflect measurements of a memory store that is not critical for singing or differences in the operationalizing of short-term memory and singing ability across the studies.

Thus, in the present study we analyzed the correlations between accuracy of vocal pitch imitation and online measures of both auditory imagery and short-term memory, as is shown in Fig. 2. Although it is difficult to separate imagery and short-term memory, given their likely integration (Baddeley & Andrade, 2000), we separated these processes as clearly as possible on the basis of task demands. As such, our imagery tasks were designed to place minimal demands on memory capacity, while requiring participants to simulate a perceptual experience based on imagery. In contrast, our memory tasks involved the physical presentation of stimuli—thus placing minimal demands with respect to imagery—but could include many items to store, thus placing demands on capacity. In this way, in the present study we evaluated the degree to which auditory imagery and auditory short-term memory are involved in sensorimotor translation between auditory pitch perception and vocal–motor control of pitch.

Fig. 2
figure 2

Critical measures and the associated hypotheses

In addition to inconsistencies in how auditory imagery and auditory short-term memory have been operationalized in the past singing literature, previous studies have also been inconsistent with respect to the dimensions (e.g., pitch or phonetic) that are reflected in measures of auditory imagery, auditory short-term memory, and vocal imitation ability (Christiner & Reiterer, 2013; Greenspon et al., 2017; Pfordresher & Halpern, 2013). For instance, studies investigating the role of auditory imagery in pitch imitation have used self-report measures of imagery that include verbal items (Greenspon et al., 2017; Pfordresher & Halpern, 2013), and studies investigating the role of short-term memory in pitch imitation have used verbal span tasks to measure short-term memory capacity (Christiner & Reiterer, 2013; Greenspon et al., 2017). However, the degree to which pitch imitation relies on pitch-specific processes with regard to imagery and memory has not been directly addressed—a primary aim of the present study, as we outline in Fig. 2. If pitch imitation does involve pitch-specific processes, then the inconsistent relationships between short-term memory and pitch imitation in the current literature could be attributed to the reliance on verbal measures of short-term memory capacity in previous studies.

Current evidence for the domain specificity of music and language processing has been mixed (for reviews, see Peretz, 2009; Peretz, Vuvan, Lagrois, & Armony, 2015; Schulze & Koelsch, 2012). On the one hand, the modular approach, motivated by case studies, states that music and language rely on separate neural resources (Fodor, 1983; Peretz & Coltheart, 2003). Recently, neuroimaging studies have shown that pitch and verbal representations rely on overlapping neural regions, but the patterns of activation within these regions differ for speech and music (Norman-Haignere, Kanwisher, & McDermott, 2015; Peretz et al., 2015; Rogalsky, Rong, Saberi, & Hickok, 2011; Schulze, Zysset, Mueller, Friederici, & Koelsch, 2011). On the other hand, Mantell and Pfordresher (2013) found that individuals who were inaccurate at imitating pitch in melodies tended also to be inaccurate at imitating pitch in speech, a result that aligns with previous work suggesting that pitch processing relies on shared resources for speech and music, as opposed to an encapsulated system (Mantell & Pfordresher, 2013; Patel, 2003; Pfordresher & Brown, 2009; Mantell & Pfordresher, 2013).

To address the question of dimension specificity in pitch imitation, we incorporated measures of imagery and short-term memory that focused on either pitch or verbal processing. For our auditory imagery measures, we included tasks that required the formation of pitch images or images based on the sequencing of phonemes into pronounceable nonwords (verbal imagery). To measure short-term memory, we included memory span tests for digits (verbal memory) as well as for pitch. We used the Seattle Singing Accuracy Protocol (Demorest & Pfordresher, 2015; Demorest et al., 2015) to measure pitch imitation ability and pitch discrimination thresholds and to assess music background. Figure 2 lays out four tests of basic cognitive processes (boxes) that could be related to the accuracy of vocal pitch imitation. With this design, we focused on two key hypotheses. One hypothesis is that vocal pitch imitation might participate in a network of process-specific mechanisms: Correlations between pitch imitation ability and performance on auditory imagery tasks—both pitch-based and verbal—but not performance on memory span tasks would support this hypothesis. The second hypothesis is that vocal pitch imitation might participate in a network of dimension-specific mechanisms: Correlations between pitch imitation ability and performance on tasks that involve the processing of pitch—either imagery or short-term store—and not tasks that involve the processing of verbal information would support this hypothesis. Finally, we also addressed whether auditory imagery mediates the relationship between auditory short-term memory and pitch imitation errors. This analysis was motivated by the proposed roles of auditory imagery and an auditory store in the sensorimotor loop shown in Fig. 1, in combination with the close association between self-reported imagery vividness and working memory that has been proposed by Baddeley (Baddeley, 2012; Baddeley & Andrade, 2000).

Method

Participants

A sample of 285 participants were recruited through the Introductory Psychology subject pool at the University at Buffalo and completed the study in exchange for course credit. All participants reported normal hearing, speech, and reading abilities. Eleven of the participants were removed from the sample who exhibited extremely high pitch discrimination thresholds above 1200 cents (range: 3435–9187 cents). Thresholds in this range imply that the participant did not understand the instructions. An additional 58 participants were removed because they lacked data from at least one predictor measure, due to coding or experimenter error. In the final sample of 216 participants, 33 of the participants reported having vocal training (15%), 61 reported having six or more years of music experience (28%), and 62 reported no music experience (29%). All participants were native English speakers, and only three speakers reported experience with a language other than English (French, Russian, and Yoruba).Footnote 1

Apparatus

Participants completed vocal production tasks in either an acoustically treated room or a sound-attenuated recording booth (Whisper Room SE 2000). Auditory stimuli were presented over Sennheiser HD 280 Pro headphones, while vocal performance was recorded using a Shure WH30 headset microphone connected through a Lexicon Omega I/O box.

Materials

Seattle Singing Accuracy Protocol (SSAP)

This protocol consists of a battery of measures designed to assess singing accuracy, pitch perception, and musical background (Demorest & Pfordresher, 2015; Demorest et al., 2015). Participants first completed tasks designed to measure their comfortable singing range, which consisted of singing a familiar song and producing a sustained “comfort pitch” using the syllable “doo” for several seconds.

Participants then completed pitch imitation tasks in which they would listen to a target stimulus and then reproduce it by singing. The trials were blocked by the type of task and progressed from trials in which the target stimulus comprised a single vocal pitch (ten trials), to a single piano pitch (ten trials), and then to a four-note vocal pitch sequence (six trials). Vocal pitches were based on recordings of a male or a female trained singer (vocal gender matched to the participant) using the syllable “doo.” The range of sung pitches was adapted to the participant’s comfort pitch as exhibited in the warm-up phase.

After the vocal imitation tasks, participants sang a familiar song from memory (first with lyrics and then on “doo”), completed an adaptive pitch discrimination task (Loui, Alsop, & Schlaug, 2009; Loui, Guenther, Mathys, & Schlaug, 2008), and completed a musical background questionnaire. The SSAP analyzes pitch errors automatically by extracting the F0 trajectory with the YIN algorithm (De De Cheveigné & Kawahara, 2002), identifying tone onsets on the basis of amplitude, removing the beginning and ends of tones so as to avoid the influence of “scoops,” and taking the median F0 from inside each tone. Errors are defined as pitches with a difference from the target greater than 50 cents (half a musical semitone).Footnote 2

Short-term memory measures

We estimated short-term memory span for pitch and verbal sequences using an adaptive two-up, one-down staircase procedure based on Williamson and Stewart (2010). Participants heard a sequence of items that were randomly sampled from the set of ten items, followed by a second sequence of the same length. The second sequence was either the same as the first or contained two items that were reversed in serial position. The tone sequences started at a length of two tones, and the digit sequences started at a length of four digits, reflecting the different baseline levels of difficulty for the tasks as determined by Williamson and Stewart. The task ended after eight reversals of sequence length (increasing to decreasing or vice versa). Each participants’ final span length was calculated by averaging across sequence lengths for the last six reversals.

The pitch span task used ten triangle-waveform tones between 262 Hz (C4) and 741 Hz (F#5), such that each tone was separated by at least one whole step (200 cents) in the scale. The rise and fall times of each tone were linear over a duration of 20 ms, and total tone durations were 500 ms. The sequences were atonal, thus preventing chunking strategies.

The digit span task used digits ranging from 0 to 9, spoken by a male speaker in American English with a Midland dialect for male participants, or by a female speaker with an Inland North dialect for female participants. Each digit was resynthesized to have a constant fundamental frequency of 102 Hz, for male stimuli, or 193 Hz, for female stimuli, and a constant intensity. The duration of each spoken digit was approximately 500 ms.

Imagery measures

We estimated the accuracy with which participants could generate auditory images online on the basis of pitch or phonetic (verbal) qualities.

The pitch imagery task was adapted from Gelding, Thompson, and Johnson (2015); an example is shown in Fig. 3. Each trial began with a C-major scale that provided a tonal context (each tone used the same acoustic dimensions as in the pitch span task), followed by a single cue tone. Participants then heard a sequence of three pitches, accompanied by visually presented arrows that pointed up or down, with the arrow direction corresponding to the direction of the change in pitch from the previous to the current tone. Next, participants observed a sequence of one to three more arrows and had to imagine what pitch should accompany these arrows, based on the tonal context and the preceding auditory pitches. Finally, participants heard a probe tone (with no accompanying arrow) and judged whether that pitch matched the last pitch that was imagined (the correct response for the example in Fig. 3 would be “yes”).

Fig. 3
figure 3

Illustration of a trial in the pitch imagery task

The verbal imagery task used line drawings (Duñabeitia et al., 2018) representing high-frequency monosyllabic words that consisted of three phonemes. For half of the words, the phonological and orthographic representations were congruent (e.g., bus), and for the other half the phonological and orthographic representations were incongruent (e.g., thumb), to prevent the use of orthographic strategies. Participants first completed a familiarization phase, in which they were presented a line drawing on a computer screen while the word illustrated by the picture was spoken over headphones. In an experiment trial, participants observed the three line drawings presented on a computer screen in a sequential, additive order (see the example in Fig. 4). Beneath each line drawing was a row of three dashes, representing three phonemes, with the phoneme to be imagined indicated by an asterisk. Participants were asked to combine the phonemes from each word in order to imagine a pronounceable nonword. Participants then heard a spoken nonword over headphones (produced by the same talker as for the male digit span stimuli) and were asked to indicate whether this stimulus matched the imagined nonword.

Fig. 4
figure 4

Illustration of a trial in the verbal imagery task

Procedure

After providing informed consent, participants were seated in an acoustically treated experiment room or sound-attenuated vocal booth. They completed the SSAP after being instructed by an experimenter on the appropriate posture for singing. Participants then completed the imagery and short-term memory tasks, with the order of these tasks counterbalanced through a Latin square design. Then they completed music and language background questionnaires, as well as the Bucknell Auditory Imagery Scale (BAIS; Halpern, 2015), a self-report measure of auditory imagery ability consisting of Vividness and Control subscales. Finally, participants were debriefed and given course credit for their participation.

Results

Descriptive statistics for each task are presented in the Appendix Table 3. We focus here on correlations across the measures, given that the primary purpose of this study was to measure what core cognitive abilities are associated with singing accuracy. Bivariate correlations across all measures are shown in Table 1. Pitch imitation error rates were significantly correlated with all variables, with the exceptions of digit span and BAIS Control. Correlations between the predictor variables were also examined.

Table 1 Bivariate correlations between tasks

Next we evaluated the relative contributions of significant the predictor variables for pitch error rates that were associated with the theoretical framework shown in Fig. 2 (this was limited to the online measures, and thus excluded the BAIS subtests). We conducted a five-step hierarchical linear regression with pitch imitation error rate as the dependent variable. The variables were ordered according to their conceptual status, with those that had a less critical theoretical role or that already enjoyed strong support from previous research being entered before variables that had a more critical and hypothetical role. The results are shown in Table 2. Not surprisingly, musical experience (Step 1) and pitch discrimination threshold (Step 2) were strong predictors of pitch error rates during singing. More importantly, short-term memory for pitch (Step 3) and online pitch imagery (Step 5) contributed significantly beyond the other predictors. However, performance on the verbal imagery task (Step 4) did not contribute beyond the predictors entered at earlier steps, despite having a significant bivariate relationship (Table 1). This significant bivariate relationship may in large part reflect the shared variance between verbal imagery performance and short-term memory for pitch (also shown in Table 1).Footnote 3 Other orderings of the predictors, carried out after the reported ordering for purposes of testing alternatives, yielded the same results.

Table 2 Five-step hierarchical linear regression of significant predictors on pitch imitation errors

Finally, we tested the relationship between short-term memory and imagery from the vocal sensorimotor loop model (Fig. 1) through a mediation analysis. We reasoned that auditory imagery is necessary to make pitch representations in short-term memory accessible for associations with vocal motor planning, on the basis of the hypothesis that pitch imagery supports sensorimotor translation. This analysis was predicated on the fact that pitch imagery and pitch span were correlated with each other, in addition to the finding that both pitch imagery and pitch memory were unique predictors of pitch imitation error rates. A nonparametric bootstrapping procedure with 1,000 simulations revealed a significant indirect effect, B = – .01, 95% CI = – .02, – .003. Large pitch spans were associated with high scores for pitch imagery, which in turn were associated with low rates of pitch imitation errors; see Fig. 5. The proportion of the indirect effect with respect to the total effect was .20. In addition, the direct effect was also significant, B = – .04, 95% CI = – .09, – .03; high pitch span thresholds were associated with low rates of pitch imitation errors, above and beyond the contribution of pitch imagery. In sum, these results indicate that pitch imagery was a partial mediator of the effect of pitch span on pitch imitation errors.

Fig. 5
figure 5

Mediation analysis with pitch imagery as the mediator of the relationship between pitch span and pitch imitation. Unstandardized regression estimates are presented, along with standard errors in parentheses. **p < .01, ***p < .001

Discussion

We have reported associations between pitch imitation ability and measures of basic cognitive functions that are hypothesized to play a role in the vocal sensorimotor loop, shown in Fig. 1 (Berkowska & Dalla Bella, 2009; cf. Pfordresher et al., 2015). Accurate imitators exhibited larger short-term memory spans for pitch and greater accuracy in generating auditory images of musical pitch than did inaccurate imitators. These relationships were found to be independent of other factors, including pitch perception and musical experience, which, not surprisingly, also correlated with pitch imitation ability. We also showed that pitch imagery partially mediates the relationship between pitch span capacity and pitch imitation errors, providing evidence for a possible pathway by which auditory short-term memory and auditory imagery may be involved in vocal pitch imitation.

The significance of the correlations of pitch imitation ability with imagery and memory span measures is twofold. First, these associations draw connections between the theoretical construct of the vocal sensorimotor loop model (Fig. 1) and basic cognitive functions. Our results support auditory imagery’s role in sensorimotor translation of auditory perception and vocal motor planning (Pfordresher et al., 2015) and suggest that short-term memory is necessary for maintaining a representation of a target sequence for purposes of planning a series of motor actions (cf. Palmer & Pfordresher, 2003). Whereas other studies have shown an association between singing accuracy and self-report of auditory imagery vividness (Greenspon et al., 2017; Pfordresher & Halpern, 2013), this is the first study we know of to have reported an association with the accuracy of pitch images formed online. Importantly, we also established the unique contribution of pitch short-term memory capacity in pitch imitation. Second, the general cognitive capacities we outlined—and the measures we used here—are not limited to music performance contexts. Vocal imitation of pitch may therefore not reflect an encapsulated ability, but may instead reflect the integration of several coordinated functions.

Although the cognitive functions associated with vocal pitch imitation serve broader purposes, it is also important to note that these associations are specific to the representation of pitch and are not measures of imagery and memory for phonological information. At first glance, these results appear consistent with claims that music cognition, whether with respect to perception or performance, relies on domain-specific resources that are not shared with speech (e.g., Peretz & Coltheart, 2003). However, pitch also plays an important role in language processing. The present design does not distinguish between language-based and music-based uses of pitch, in that the tasks only involved musical pitch processing. Other studies of vocal pitch imitation have suggested that imitations of sung and spoken pitch rely on shared resources (e.g., Mantell & Pfordresher, 2013; Pfordresher & Brown, 2009; Wisniewski, Mantell, & Pfordresher, 2013). Therefore, we only make claims about dimension-specific processing of pitch, rather than domain-specific processing of music. In this context, it is worth noting that an earlier study did report an association of verbal span with singing accuracy, which was not found here (Greenspon et al., 2017; cf. Christiner & Reiterer, 2013). However, the association was found using backward span and was only found for singing tasks that involved a particularly demanding reproduction task—that is, reproduction of mental transformations of melodies. Here we found that simple reproduction of melodies correlates specifically with pitch span.

Interestingly, there was a general tendency for participants to perform more accurately with verbal than with pitch stimuli in both the imagery and memory span tasks (cf. Benassi-Werke, Queiroz, Araujo, Bueno, & Oliveira, 2012). It is possible that these differences relate to the nature of the representations underlying pitch and verbal items. With respect to the span task, digits are highly familiar items and reflect both visual (numeric and orthographic) and auditory (phonological) representations. Pitches, on the other hand, may reflect both visual (musical notation) and auditory representations for musicians (≥ 6 years music experience, 28% of the sample); however, for nonmusicians (72% of the sample), pitches reflect auditory-only representations. Similarly, phonological images, even for the nonwords used here, may relate to orthographic visual representations that enhance performance. At this point, however, it is hard to draw firm conclusions from these tasks, given that the tests we used were not calibrated with respect to the perceptual salience of the individual items.

In conclusion, the present research helps clarify the role of basic cognitive processes in the vocal imitation of pitch. The role of auditory imagery in vocal imitation has been established by recent research (Greenspon et al., 2017; Pfordresher & Halpern, 2013; Pfordresher et al., 2015). However, a novel finding from the present research is that auditory imagery relies on pitch-specific resources that uniquely relate to pitch imitation ability. In addition, the present study established a pitch-specific role of auditory short-term memory in vocal pitch imitation, further supporting the notion that pitch and phonological representations may rely on separate memory stores within the phonological loop (Schulze et al., 2011).

Open Practices Statement

This research was supported by National Eye Institute grant no R01 EY025275.