Introduction

The study of cognitive control and conflict monitoring has played a major role in cognitive neuroscience in the last decade, with a wealth of new data coming from functional neuroimaging, electrophysiology, and lesion studies (see Mansouri, Tanaka, & Buckley, 2009, for a review). Whereas in daily life, sensory stimuli that guide behavioral responses belong to multiple modalities, much of our knowledge about cognitive control processes derives from studies investigating stimulus conflict within the visual modality. Paradigms such as the Stroop (Stroop, 1935) and flanker (Erikson & Erikson, 1974) tasks have been extensively used to examine both behavioral and underlying neural mechanisms of the processing of visual stimulus conflict. Despite this wealth of studies, there is still much debate about the specific processes involved in conflict detection and resolution. In addition, due to the focus on conflict processing in the visual modality, the generalizability across modalities and tasks of the reported findings is not well characterized.

Behaviorally, the visual Stroop paradigm (Stroop, 1935) has been studied for decades. The causes for the observed behavioral and neural effects have been considered almost exclusively within the visual domain, leading to various accounts for visual cognitive control. In its classic form, participants are visually presented with a color word (e.g., red) and are asked to report the color of the font, which can be either congruent or incongruent with the meaning of the word. According to an early popular account, word reading is relatively automatic and rapid. When the word meaning and font color do not match, additional processing is required to resolve associated cognitive conflict because of the need to select the appropriate response (to the font color) and suppress the alternative and competing response (to color word meaning; Posner & Snyder, 1975). Consistently, participants are slower and less accurate to respond to incongruent than to congruent color words (for review see Macleod, 1991). Computational models have been developed to explain the observed behavioral patterns, in terms of the relative strengths or speed of processing of the different stimulus inputs (e.g., Botvinick, Braver, Barch, Carter, & Cohen, 2001). In behavioral studies, researchers have pursued various approaches to tease apart the underlying conflict processes, including manipulating the stimulus onset asynchrony and/or the location of the relevant and irrelevant stimuli (Appelbaum, Meyerhoff, & Woldorff, 2009; M. O. Glaser & Glaser, 1982; W. R. Glaser & Dungelhoff, 1984; W. R. Glaser & Glaser, 1989; Lu & Proctor, 2001; Weekes & Zaidel, 1996). Such studies typically support the conclusion that the interference effect arises from “higher-order” conflict at the stage of response selection, rather than stemming from perceptual conflict among different feature dimensions of the stimuli (Botvinick et al., 2001; Cohen, Dunbar, & McClelland, 1990).

The neural underpinnings involved in cognitive control during the processing of visual stimulus conflict have also been extensively investigated. Functional neuroimaging studies of the Stroop task employing positron emission tomography and, more recently, functional magnetic resonance imaging (fMRI) have consistently reported greater activity in the anterior cingulate cortex (ACC) and/or the nearby pre-supplementary motor area (pre-SMA) in response to high-conflict stimuli (incongruent), as compared with low-conflict ones (congruent or neutral; Aarts, Roelofs, & van Turennout, 2009; MacDonald, Cohen, Stenger, & Carter, 2000; Mars, Klein, Neubert, Olivier, Buch, Boorman and Rushworth 2009; Milham & Banich, 2005; Roberts & Hall, 2008; Ruff, Woodward, Laurens, & Liddle, 2001). These results have provided a solid foundation for the notion that the ACC/pre-SMA play a central role in cognitive control and error-monitoring operations (Botvinick et al., 2001; van Veen & Carter, 2002). Another higher-level (i.e., control-oriented) region, the dorsolateral prefrontal cortex (DLPFC), has also been implicated in conflict processing—in particular, being proposed to help implement cognitive adjustments to conflicting stimulus inputs, following being signaled to do so by the ACC (e.g., Egner, 2007; MacDonald et al., 2000; but see Silton, Heller, Towers, Engels, Spielberg, Edgar and Miller 2010).

An additional critical aspect in the understanding of the underlying neural mechanisms of conflict monitoring concerns the temporal dynamics of such processes during the cascade of brain activity triggered by a visual conflict-monitoring task, dynamics that are determinable only by high-temporal-resolution techniques such as electroencephalography (EEG) or magnetoencephalography (MEG). For example, using manual responses in the classic color Stroop task, several studies (e.g, Liotti, Woldorff, Perez, & Mayberg, 2000; Markela-Lerenc, Ille, Kaiser, Fiedler, Mundt and Weisbrod 2004; West, 2003; West & Alain, 1999) have reported an early fronto-central negative-polarity wave peaking around 450 ms (which has been termed the “N450”) and a later, more sustained posterior positivity (SP, 500–800 ms), that are greater in voltage for incongruent than for congruent color words. In our earlier study of the visual Stroop effect (Liotti et al., 2000), the influence of alternative response modalities (manual, overt verbal, and covert verbal) was also examined, finding that the centrally-maximal negative N450 had a more anterior scalp topography for both of the verbal response modalities than for the manual response one. The source of the N450 in both response modalities was also modeled as arising, at least in part, from the dorsal ACC (Liotti et al., 2000), although perhaps from different parts thereof (Swick & Turken, 2002). Various later ERP studies of the color-naming visual Stroop task have reported similar results (Atkinson, Drysdale, & Fulham, 2003; Badzakova-Trajkov, Barnett, Waldie, & Kirk, 2009; Hanslmayr, Pastotter, Bauml, Gruber, Wimber and Klimesch 2008; Huster et al., 2009; Markela-Lerenc et al., 2004; West, 2003), with the same general pattern of conflict-related neural response in these tasks.

Another aspect of conflict monitoring that has recently received considerable attention involves trial-by-trial behavioral adjustments. In tasks such as the Eriksen flanker task (Erikson & Erikson, 1974), behavioral findings have suggested that participants tend to modulate their attentional allocation as a function of the previous trial type, with faster response times (RTs) observed for incongruent trials that are preceded by an incongruent trial, as compared with those preceded by a congruent trial (e.g., Gratton, Coles, & Donchin, 1992). This RT difference has been interpreted as deriving from ongoing behavioral-monitoring processes, wherein the detection of conflict on one trial leads to increased allocation of attention on the next trial to the relevant stimulus dimension, thereby enabling the relative speeding of the response. Studies using other conflict paradigms, including the Stroop task, have reported similar sequential-trial modulations of RT (e.g., Kerns, Cohen, MacDonald, Cho, Stenger and Carter 2004). While these effects have been challenged as deriving, at least in part, from facilitation due to response repetition and other factors (Mayr, Awh, & Laurey, 2003), some level of sequential-trial effects attributable to cognitive adjustments has still been observed even when these factors are taken into account (e.g., Egner & Hirsch, 2005; Kerns et al., 2004). Importantly, studies employing neural activity measures have suggested a dynamic interplay between the ACC and the frontal cortex, wherein increased ACC activity on one trial (e.g., presumably reflecting robust conflict detection) is followed by increased DLPFC activity to facilitate performance on the next trial (e.g., attentional allocation; Kerns et al., 2004; see Mansouri et al., 2009, for a review).

Taken together, the aforementioned visual-modality work strongly suggests a dynamic processing model wherein conflict is detected in the 400- to 500-ms range by the ACC or pre-SMA, which then signals to other attentional control regions (such as the DLPFC) to allocate increased attention to the relevant visual feature, at least by the next trial. Yet it is unclear whether this temporal cascade is unique to the visual modality or, rather, whether it is a supramodal process, representing a more general cognitive control mechanism that would be applicable across other modalities. That is, if the operation of cognitive control during conflict processing is modality independent, both within-trial and between-trial conflict monitoring and control effects previously observed for the visual modality should operate with similar temporal patterns and within common spatial regions, regardless of modality.

In the present study, we sought to characterize the temporal flow of the neural processing of conflict in the auditory modality. To this end, an auditory EEG version of the Stroop task was implemented that was patterned closely after the visual one employed in one of our previous studies (Liotti et al., 2000), including employing three different types of behavioral response (overt verbal, covert verbal, manual). Here, participants made discriminations about the pitch of a spoken word while ignoring its meaning, which could be either congruent or incongruent with the pitch, while behavioral and neural responses were measured to be able to study the temporal cascade of conflict processing in the auditory modality, both within trial and as a function of sequence.

Previous behavioral studies in auditory versions of the Stroop task have shown conflict-related RT effects, providing a foundation for the present electrophysiological investigation (Jerger, Martin, & Pirozzolo, 1988; Most, Sorber, & Cunninghan, 2007; Shor, 1975). The auditory Stroop tasks differed from their visual counterparts due to inherent differences in the physical nature of auditory and visual stimuli. Whereas, with vision, the font color and meaning of the word can be independently varied, in audition the conflict must be created from another inherent physical property. Shor addressed this problem by varying the pitch of the word (high or low pitch) with the identity (i.e., meaning) of the word itself (“high” or “low”), with the task being to discriminate the pitch and with the results indicating slowing and less accurate responses for incongruent than for congruent combinations. In a slightly different variant, Jerger and colleagues used the gender of the speaker to create conflict for children, wherein a female voice would say “daddy” and a male voice would say “mommy,” which also resulted in slower responses for the incongruent trials. The behavioral results from these manipulations thus suggest that the conflict that is generated in the visual Stroop task can be generalized to the auditory domain. Given this apparent behavioral generalization of conflict, it seems likely that the neural bases of the visual and auditory conflict effects may share some overlapping mechanisms.

A recent fMRI study compared visual and auditory versions of the Stroop task in a block design and reported some areas of common activation associated with stimulus conflict across the two modalities (Roberts & Hall, 2008). Accordingly, we expected to observe electrophysiological correlates of conflict processing in an auditory Stroop task similar to those that had been found in our previous visual version (Liotti et al., 2000) on which the present study was patterned. Additionally, a recent EEG study by Larson, Kaufman, and Perlstein (2009) reported sequential EEG effects in a visual Stroop task for the late posterior SP, and thus we also expected that we might find a similar sequential effect in the auditory modality.

More specifically, we predicted that previously identified ERP reflections of the visual Stroop incongruency effect—namely, the frontocentral N450 and the posterior SP—would also be manifested in the auditory version of the task. Moreover, we hypothesized that these effects might show similar scalp distribution differences across behavioral response modalities, reflecting “higher-order” conflict processing at the stage of response selection, rather than perceptual conflict among different feature dimensions of the stimuli, thereby further providing evidence for supramodal characteristics of these electrophysiologically manifested stages of conflict processing. On the other hand, given that auditory processing employs different sensory pathways and association areas and the timing of cortical activation is different from that for the visual system (Hillyard, Mangun, Woldorff, & Luck, 1995), we anticipated that ERP effects related to conflict processing in the auditory modality might have earlier latencies than their visual counterparts.

Method

Participants

Eleven healthy right-handed volunteers (age, 18–48 years; M = 25.9; 3 males) participated in this study. Two additional participants were excluded due to excessive blink and muscle artifacts in their EEG recordings. All participants gave their informed consent in compliance with the Institutional Review Board of the University of Texas Health Science Center at San Antonio.

Stimuli and task

An auditory version of the Stroop task (Stroop, 1935; Shor, 1975) was employed. Participants were seated with their eyes 50 cm away from a computer monitor, on which a central fixation cross was displayed for the duration of the experiment. The auditory stimuli consisted of spoken words by a male native English speaker. Auditory thresholds were individually determined prior to the experiment, and speech sounds were delivered through headphones at 60 dB above threshold (60 dBSL). The duration of each auditory word stimulus was ~300 ms, and the stimulus onset asynchronies were jittered randomly between 1,700 and 2,200 ms.

Congruent and incongruent stimuli were randomly presented with equal probability. Congruent trials consisted of the word “high” spoken in a high pitch and the word “low” spoken in a low pitch. In contrast, incongruent trials were made of the word “high” in a low pitch and the word “low” in a high pitch. Participants’ task was to decide as quickly as possible whether the pitch of the word was high or low, regardless of the meaning of the actual word.

The stimuli above were presented in randomized order within three different types of task blocks that varied in terms of the form of the behavioral response (Liotti et al., 2000). In the overt verbal response condition, participants said aloud whether the pitch of the word they just had heard was high or low, and vocal onset time was measured from the signal obtained from a microphone placed in front of the participant (distance 30 cm). In the covert verbal condition, participants were instructed to say the pitch silently in their mind (without moving their lips, tongue, or jaw). In the manual condition, participants responded by pressing one of two buttons on a gamepad corresponding to the high or low pitch. There were a total of 18 runs, with six blocks for each response modality, giving a total of 130 trials in each of the four trial types (e.g., “low” spoken in a low pitch), yielding 260 congruent and 260 incongruent trials per response mode.

The order of the overt verbal, covert verbal, and manual blocks was counterbalanced within and between participants. Accuracy was analyzed for the manual condition, and RTs were measured for both the manual and overt verbal conditions. Accuracy could not be determined for the overt verbal condition because the verbal responses themselves were not recorded (only voice onset was measured). For each participant, mean RT for the manual and overt verbal conditions for the congruent and incongruent trial types was entered in a repeated measures ANOVA, with the factors being response modality (manual vs. overt) and trial type (congruent vs. incongruent).

EEG recording and analysis

Each participant’s brain electrical activity was continuously recorded via a customized, extended-coverage, 64-channel electrocap (Woldorff, Liotti, Seabolt, Busse, Lancaster and Fox 2002), referenced to the right mastoid (Electrocap Inc, Eaton, OH). Amplifier settings were the following: band-pass filter, 0.01-–100 Hz; sampling rate, 500 Hz; gain, 10 K; impedances, < 5 kΩ. Blinks and eye movements were monitored from four electrodes placed at the external canthi and below the orbits.

Offline data processing included blink and eye movement artifact rejection (implemented by means of an automated computer algorithm after the manual setting of artifact amplitude rejection levels), low-pass filtering (< 57 Hz), and re-referencing to the algebraic average of the left and right mastoids. Selective ERP averages, time-locked to stimulus onset, were extracted for each trial type (congruent and incongruent) and response-mode condition (overt, covert, and manual) from each participant, relative to a 200-ms prestimulus baseline. Finally, grand averages across participants were computed for each condition and trial type, as well as for the incongruent minus congruent difference waves for each condition. Topographical maps of each trial type and condition and difference wave were also generated.

Inspection of the grand-average waveforms for incongruent and congruent trials and their difference waves revealed main effects similar to those previously reported in the closely parallel visual Stroop study from our group (Liotti et al., 2000). More specifically, the incongruent stimuli elicited an enhanced central negativity, followed by a sustained posterior positivity SP. However, here in the auditory modality the conflict-related negativity started and peaked substantially earlier (around 150 ms earlier) than the analogous visual-conflict Stroop response (previously called the N450, due to its peak latency). Accordingly, for greater generality across the modalities, we will term this effect the Ninc, for incongruency negativity. In order to compare the present auditory-conflict results with the visual ones, the Ninc effect was analyzed in a similar manner to that in the Liotti et al. study, employing 100-ms consecutive time windows for the three response conditions and sampling from four adjacent midline electrodes (Fcz, Cz, Pzs, and Pzi), as well as four immediately adjacent sites to the left and right of this midline. Similarly, as in Liotti et al., the later SP effect (500–800 ms) was analyzed with consecutive 100-ms time windows for each response condition, sampling from three pairs of sensors over frontal (F3a–F4a), central (C5a–C6a), and parietal (P3i–P4i) scalp sites.

Statistical analysis of the ERP data employed repeated-measures analyses of variance (ANOVAs) on mean amplitude values for each 100-ms time window. For the early Ninc analyses, the factors were laterality (left, midline, right), anterior–posterior location (frontal, parietal), and trial type (congruent vs. incongruent), conducted separately for each behavioral response mode. For the late SP analysis, three-way ANOVAs were conducted for each response modality in each time window, with factors being hemisphere (left vs. right), anterior–posterior location (frontal, central, parietal), and trial type (congruent vs. incongruent).

We also directly analyzed the incongruency effects to see whether they differed as a function of behavioral response mode. In particular, we conducted additional ANOVAs of the amplitudes of the Ninc and SP activity derived from the incongruency difference waves (incongruent minus congruent), with response mode as the factor. These analyses were done for all the aforementioned time windows, focusing on the midline sites for the Ninc effect.

Our final analysis sought to test a specific hypothesis concerning possible differences in distribution of the Ninc incongruency effect between the verbal (overt and covert) response mode conditions, as compared with the manual response mode, parallel to that seen in Liotti et al. (2000) for the N450 (which, as was mentioned above, had observed a more posterior distribution for the manual response mode). To do this, we took the incongruency effects from the incongruency difference waves (incongruent minus congruent) for the three response mode conditions. We then subjected these values to a 2 × 2 ANOVA (response mode: manual vs. verbal [covert and overt, collapsed] × anterior–posterior location [two anterior channels vs. two posterior channels]) to determine the presence of an interactions between these factors. We also performed this analysis with the normalized amplitude values (McCarthy & Wood, 1985) to ensure that any effects we were observing were not being confounded by overall differences in main effect amplitudes. We also conducted an analogous analysis for the maximal peak of the SP effect to determine whether any such distributional by congruency effects were unique to the early Ninc effect or persisted throughout the entire trial.

For all analyses, significance was set at p < .05, and the degrees of freedom were appropriately adjusted with the Greenhouse–Geisser epsilon method, with epsilon values reported.

Sequential analyses

Behavior

For the conditions in which there were measurable behavioral responses (manual and overt), the data were additionally analyzed to examine any sequential behavioral effects of conflict monitoring or conflict adaptation (Gratton et al., 1992; Larson et al., 2009). In particular, behavioral differences among the conditions were examined for when an incongruent trial was preceded by a congruent versus an incongruent trial (CI vs. II) and for when a congruent was preceded by a congruent versus an incongruent trial (CC vs. IC). A repeated-measures ANOVA was conducted for these behavioral data in the manual and overt conditions using the two factors of current trial (congruent vs. incongruent) and previous trial (congruent vs. incongruent).

ERP analysis

For the ERPs, analogous sequential analyses were performed on the two main incongruency-related ERP effects, separately for each of the three behavioral response conditions. More specifically, incongruency-related sequence variations were calculated for both the early negative-polarity effect, Ninc (200–500 ms), and the longer-latency positive-polarity effect, SP (500–800 ms). These were obtained at a set of posterior sites (P3i–P4i) and at a set of frontal sites (F3a–F4a). Repeated-measures ANOVAs were conducted for each scalp region, with factors being trial position (current vs. previous trial) and trial type.

Results

Behavior

The accuracy analysis (which could be performed in the manual condition only) showed that participants were significantly more accurate for congruent than for incongruent trials (error rates: 0.9% vs. 5.9%), t(10) = 4.36, p = .001. The RT analysis, which could be performed in both the manual and overt conditions, revealed a main effect of behavioral response mode, F(1, 10) = 177.5, p < .001, with the overt verbal condition being slower than the manual condition, as well as a main effect of trial type, F(1, 10) = 68.4, p < .001, with the expected response slowing for incongruent trials relative to congruent trials (Fig. 1). However, no response condition × trial type interaction was observed (p > .05).

Fig. 1
figure 1

Response times (RTs). Longer RTs were observed in the overt condition (i.e., verbal), as compared with the manual condition, and for both response mode conditions participants were slower on incongruent than on congruent trials. Error bars represent standard error of the mean (SEM) values

EEG

Analogous to the findings in Liotti et al. (2000) for the visual Stroop effect, the auditory EEG data here revealed both a significant early (200–500 ms) effect (Fig. 2) and a later ( 500–800 ms) effect (Fig. 3) of incongruency. As can be seen in both figures, for all behavioral response conditions, the early effect was a centrally distributed negativity (which we are terming “Ninc” here), with incongruent being more negative than congruent, and the later effect was a superior–posterior distribution (the SP), with incongruent being more positive than congruent. These effects are discussed in detail below.

Fig. 2
figure 2

Early negative-polarity incongruency effect (Ninc). a Time-locked ERP waveforms for the incongruent and congruent conditions (overt congruent, 1,644 sums; overt incongruent, 1,751 sums; covert congruent, 1,754 sums; covert incongruent, 1,836 sums; manual congruent, 1,925 sums; manual incongruent, 1,991 sums). The early negative-wave incongruency effects emerged starting at around 200 ms in all conditions. (Each tick mark represents 100 ms for the traces.) b Topographic distributions for incongruent minus congruent trials for overt, covert, and manual conditions from 200 to 300 ms and from 400 to 500 ms. In the covert and overt conditions, by 400–500 ms, the negativity shifted to a more anterior (and somewhat right-lateralized) location, while the effect was basically gone in the manual condition by this time. In addition, the initial negative-polarity effect in the manual condition was distributed more posteriorally than in the other conditions

Fig. 3
figure 3

Late posterior-positivity incongruency effect (SP): Traces showing the late parietal differences between incongruent and congruent trials over site POz for the overt, covert, and manual conditions. (Sums shown here are the same as reported in Fig. 2.) Topographic distributions show the difference of incongruent minus congruent trials from 600 to 700 ms

Ninc (200–500 msec)

Table 1 shows the main ANOVA results. Paralleling the analyses of the analogous visual Stroop study (Liotti et al., 2000), direct tests were done in the individual response-mode conditions. In each of these conditions, there was a significant effect of trial type in the 200- to 300-ms time window (Fig. 2), with significantly greater central negativity for incongruent trials, relative to the congruent trials, for the covert [F(1, 10) = 14.89, p = .003, ε = 1], manual [F(1, 10) = 4.95, p = .05, ε = 1], and overt [F(1, 10) = 7.94, p = .02, ε = 1] conditions. The incongruency effect persisted into the 300- to 400-ms time window for the overt and covert conditions [overt, F(1, 10) = 5.63, p = .04, ε = 1; covert, F(1, 10) = 4.95, p = .05, ε = 1], while only the covert condition yielded a significant incongruency effect into the 400- to 500-ms time window [F(1, 10) = 8.6, p = .02, ε = 1]. Figure 2 shows the traces and topographic maps of this early negative-polarity ERP effect across the three different response modes.

Table 1 Summary of results from ANOVAs conducted for Ninc effect from 200 to 500 ms

To test whether the incongruency effects varied as a function of response mode, we took the difference waves (incongruent minus congruent) for each of the three response modes and analyzed these values in each time window with a one-way repeated-measures ANOVA with response mode (three levels: covert, overt, and manual) as a factor. The only significant effect that occurred during any time period was between 300 and 400 ms, F(2, 20) = 6.28, p = .02, ε = 0.65, with the manual condition showing a diminished incongruency effect during this time period.

Finally, to test our specific prediction of a differential response distribution as a function of modality (i.e., manual more posterior than verbal) based on the distributional differences observed in the earlier visual Stroop study (Liotti et al., 2000), we conducted a 2 × 2 ANOVA (manual vs. verbal [covert and overt collapsed] × anterior vs. posterior electrode location) on the difference waves (incongruent minus congruent) for the early Ninc effect (200–300 ms, the window in which the Ninc was significantly present for all the response modes). This revealed a robust significant interaction, F(1, 10) = 9.30, p = .01, ε = 1.0, due to the presence of a more posterior distribution of this early incongruency effect for the manual response-mode condition, as compared with the verbal ones, similar to that observed in the parallel visual Stroop task. To eliminate any possible contribution due to overall amplitude differences, we also ran this analysis on the data after amplitude normalization (McCarthy & Wood, 1985), which still gave a highly significant interaction effect, F(1, 10) = 11.17, p = .007, ε = 1.0.

Late SP (500–800 msec)

From 500 to 800 ms, there was a superior-posterior positivity that was greater for incongruent than for congruent trials, corresponding to what has been referred to as a late sustained positivity (SP). Concurrently, however, there was also an anterior (frontal) negativity that was greater for incongruent than for congruent. The maximum of both of these effects occurred around 600–700 ms (Fig. 3). In that it is not clear whether or not these effects possibly reflected opposite sides of a dipole (or of a set of dipoles), they were analyzed both together and separately. To analyze both together, the 500- to 800-ms time period was divided into 100-ms bins, for each response mode, and these periods were analyzed for interactions of trial type (congruent vs. incongruent), trial type × site interactions (anterior, central, posterior), and trial type × hemisphere (left vs. right).

These analyses indicated that no time window for any of the response-mode conditions had a significant main effect of trial type, likely due to the reverse in polarity across the scalp (i.e., the frontal negativity with the posterior positivity); however, for all three response-mode conditions, all three windows had a significant trial type × site interaction (Table 2), presumably directly reflecting this frontal-parietal polarity inversion. Separate analyses for the frontal and parietal sites were also performed. For the frontal sites, these analyses indicated that in the 500- to 600-ms time window, there was a significant difference between congruent and incongruent activity for the covert condition [t(10) = 2.67, p = .02, incongruent more negative than congruent]. In the 600- to 700-ms time window, there were significant differences between congruent and incongruent activity for the covert condition for frontal sites [t(10) = 2.89, p = .02, more negative for incongruent], as well as for parietal ones [t(10) = 2.26, p = .05, more positive for incongruent]. The overt condition also revealed similar patterns of significant effects of incongruency for the frontal, t(10) = 2.39, p = .04, and parietal, t(10) = 3.63, p = .005, sites, as did the manual condition [frontal, t(10) = 2.25, p = .05; parietal, t(10) = 3.13, p = .01]. In the 700- to 800-ms time period, the covert condition still showed the same significant pattern of incongruency effects at frontal, t(10) = 2.99, p = .01, and parietal, t(10) = 2.45, p = .03, sites. The significant difference between congruent and incongruent trial types persisted in the overt condition at parietal sites during this late time period, t(10) = 4.73, p = .001. These effects reflect a sustained differential processing for the incongruent, as compared with the congruent, conditions across all response-modes that continued even after (in the case of the overt and manual modalities) a response had been made.

Table 2 Summary of results from ANOVAs conducted for SP (sustained late positivity) effect from 500 to 800 ms

For the analyses that included response mode as a factor on the SP incongruency effect (incongruent minus congruent activity), there were no main effects of this factor across any of the 100-ms time windows for the SP, nor did it interact with the anterior–posterior distribution factor.

Finally, from 600 to 700 ms, analogous to our Ninc analyses, we collapsed across the verbal conditions (covert and overt) and ran a 2 (verbal vs. manual) × 3 (anterior, central, posterior) ANOVA on the difference waves to determine whether these two factors interacted significantly. For neither the unnormalized data nor the normalized set (McCarthy & Wood, 1985) were there significant interactions.

Sequential analysis

Behavior

The ANOVA for the RT measures for both the manual and overt response mode conditions (the two conditions in which RTs could be measured) revealed a significant interaction of current trial and previous trial [manual, F(1, 10) = 42.3, p < .001; overt, F(1, 10) = 32.0, p < .001; see Fig. 4]. Specific t-test contrasts revealed that participants were slower on an incongruent trial when it was preceded by a congruent trial than when preceded by an incongruent trial, for both the manual, t(10) = 5.3, p < .001, and the overt, t(10) = 6.50, p < .001, conditions. Additionally, participants were significantly faster to respond to a congruent trial when it was preceded by a congruent trial than when it was preceded by an incongruent trial [manual, t(10) = 3.40, p < .01; overt, t(10) = 2.88, p = .02]. Together, these two findings both contributed to a reduced RT interference effect following an incongruent trial, as compared with following a congruent trial.

Fig. 4
figure 4

Sequential behavioral effects. a Response times in the manual response condition for the incongruent (I) and congruent (C) trials as a function of the previous trial type. Participants lengthened their response times for incongruent trials when they followed a congruent trial, as compared with following another incongruent trial. b A similar sequential lengthening pattern was observed in the overt response condition. Error bars represent SEM values

The sequential analyses above were for all trials, including those in which the required response was repeated. To rule out the possibility that the sequential effects were explained by repetition of the last response (e.g., two “high” responses in a row), rather than being accounted for, at least partially, by cognitive adjustment (Funes, Lupiáñez, & Humphreys, 2010; Hommel, Proctor, & Vu, 2004; Mayr et al., 2003), the sequential analyses were replicated after exclusion of trials with subsequent repeated responses. In such further analyses, the current-trial × previous-trial interaction was still present for the manual condition, F(1, 10) = 6.70, p = .03; however, it was no longer significant for the overt verbal condition, F(1, 10) = 1.27, p = .29.

ERPs

During the time period of the Ninc (200–500 ms), there were no significant electrophysiological sequential interactions observed. During the later SP period (500–800 ms), however, there was a significant current-trial × previous-trial interaction at the posterior sites for the manual condition, F(1, 10) = 29.04, p < .001, ε = 1, as well as for the covert, F(1, 10) = 33.19, p < .001, ε = 1, and overt, F(1, 10) = 7.15, p < .02, ε = 1, conditions (see Fig. 5). In contrast, the interaction was not significant at the frontal sites for any response condition (all ps > .1). Specific t-test contrasts revealed that in the manual condition, the incongruent trials had a greater posterior positivity when they were preceded by a congruent trial than when preceded by an incongruent trial [t(10) = 5.12, p < .001], whereas the congruent trials showed a greater late posterior positivity when they were preceded by an incongruent trial versus by a congruent trial [t(10) = 2.61, p = .03]. Likewise, in the covert condition, the incongruent trials had a greater positivity when they were preceded by a congruent trial than when preceded by an incongruent trial (covert: t(10) = 4.31, p < .005) and a marginally greater positivity in the overt condition [t(10) = 1.46, p < .17]. For both the covert and overt conditions, the congruent trials had a greater late posterior positivity when they were preceded by an incongruent than when preceded by a congruent trial [covert, t(10) = 3.82, p < .005; overt, t(10) = 2.38 p = .04].

Fig. 5
figure 5

Sequential EEG effects. a Differential responses to incongruent trials when they were preceded by a congruent versus an incongruent trial (CI versus II). A greater late (500–800 ms) posterior positivity was elicited for the former case, visible in both the main traces and the difference waves (second column).Topographic distributions of the differences between CI minus II are shown for 600–700 ms for the overt (1,164 sums), covert (1,292 sums), and manual (1,376 sums) conditions. b Differential responses to congruent trials (difference waves) when they were preceded by incongruent (IC), as compared with congruent (CC), trials for the overt (964 sums), covert (1,142 sums), and manual (1,217 sums) conditions. A greater late (500–800 ms) posterior positivity was seen for IC trials, as compared with CC trials, shown in the first column of traces and in the difference waves in the second column. Topographic differences of IC minus CC are shown for 600–700 ms. c Bar graphs for the manual condition comparing differences in response times (RTs) with differences in neural activity. The left graph shows the differences of the mean amplitude for IC minus CC and CI minus II from 600 to 700 ms averaged over sites POz, PO1, and PO2. The right graph shows the differences in RTs for these conditions, respectively. Here, the neural activity and RTs paralleled each other in the relative size of the differences. Error bars represent SEM values

In sum, over all the response-mode conditions, there was differential posterior positivity for an incongruent trial when it was preceded by a congruent versus by another incongruent trial, suggesting that a form of conflict monitoring is occurring with this longer-latency effect. Further supporting this theory is the differential activity pattern observed in all the response-mode conditions for the current congruent trial as a function of the incongruency versus congruency of the previous trial.

ERP–behavior correlations

Since the brain activity on the current trial did appear to be affected by the nature of the previous trial type, particularly in the manual condition, we examined whether differences in brain activity on the previous trial could predict differences in behavior on a current trial in that condition. Specifically, we examined whether any of the ERP sequential effects were significantly associated with the differences in RT for II trials (incongruent trial preceded by an incongruent trial) versus CI trials (incongruent trials preceded by a congruent trial). The differential activity from 500 to 800 ms for incongruent, as compared with congruent, trials (when preceding an incongruent trial) was calculated for sites POz, PZ, and an adjacent channel on either side (left and right), creating a region of interest that encompassed the center of the posterior effects noted above. This previous-trial neural response difference in the manual condition did, indeed, significantly correlate across subjects with the current-trial RT (r = ˗.72, p = .01), in that participants who had a greater difference in activity on the SP component of the previous trial had a greater RT difference for incongruent current trials (i.e., on the II trials vs. CI trials; see Fig. 6). In contrast, there was no analogous behavior/brain-activity correlation present for congruent current trials (IC versus CC trials). Thus, the behavioral sequential effect observed for the latter contrast presumably comes by way of a differential neural mechanism.

Fig. 6
figure 6

Brain–behavior correlation, manual condition. The x-axis shows the difference in neural activity for II minus CI trials in microvolts averaged across sites POz, Oz, PO1, and PO2 for 500–800 ms. The y-axis shows the difference in response time for II trials minus CI trials in milliseconds. Differential activity in previous trial correlates with differential response times across participants on the current trial when the current trial is incongruent

Discussion

The aim of the present study was to explore the temporal dynamics of cognitive control and conflict adaptation in an auditory version of the classic visual Stroop task. As in a previous ERP study from our group employing the visual Stroop task (Liotti et al., 2000), behavioral performance and neural activation patterns during the task were measured using three different behavioral response modes (covert, overt, and manual). Behaviorally, responses were less accurate and slower for incongruent than for congruent trials for all response modes, reproducing the expected behavioral interference effects associated with stimulus conflict—here, in the auditory modality. The event-related brain activity elicited by the incongruent trials, relative to the congruent ones, revealed two distinct effects: an early negative-polarity effect (Ninc) and a later positive-polarity effect (SP). Consistent with our main hypothesis, these effects were similar in temporal order and scalp topography to those observed for the classic Stroop task in the visual modality, although the early negativity-polarity effect occurred earlier in the auditory version. The parallels to the visual Stroop responses also included a similar influence of response modality (cf. Liotti et al., 2000) on the topographic distribution of the early negative-polarity effect, with the Ninc effect in the manual response mode condition being more posteriorly distributed than in the verbal response mode ones. (See Supplemental Fig. 1 for the respective timing and distributions of the incongruency effects for the auditory modality from the present study and analogous effects for the visual modality in Liotti et al., 2000.)

Additionally, behavioral sequential trial effects (i.e., conflict adaptation effects) were found that were similar to those previously reported in the visual modality, with behavioral adjustments following high conflict and affecting differentially consecutive congruent versus incongruent trials (e.g., Botvinick et al., 2001; Egner, Delano, & Hirsch, 2007; Kerns et al., 2004; Larson et al., 2009). In addition, corresponding sequential ERP effects were observed here that paralleled what has been recently been reported in the visual modality (Larson et al., 2009). The similarity with conflict adaptation effects observed in the visual modality suggests that they likely reflect higher order, supramodal aspects of conflict processing and adjustment. Finally, we found that differences across participants in ERP activity on the previous trial predicted differences in RT on the current trial, further supporting that ongoing stimulus conflict processing and cognitive control are at play in a supramodal manner across stimulus modalities.

The incongruency-related N inc (200–500 ms)

The earliest ERP effect to differentiate incongruent from congruent auditory stimuli was a negative-polarity difference (Ninc) that showed a similar central topographic distribution as the “N450” reported in a number of visual Stroop studies (Liotti et al., 2000; Perlstein, Larson, Dotson, & Kelly, 2006; West & Alain, 1999) and previously associated using source modeling with the ACC (Liotti et al., 2000; Szucs, Soltesz, & White, 2009; West, 2003). Interestingly, however, its onset in the present study was about 150 ms earlier than that for the visual Stroop N450 (200 vs. 350 ms; see, e.g., Liotti et al., 2000). A potential explanation for the earlier latency for auditory conflict detection may be that the auditory system has a shorter latency from stimulus onset to processing in the primary sensory cortex than does the visual system (15–20 ms in audition vs. 40–50 ms in vision; see Hillyard, 1993, for a review). Furthermore, considering that the onset and peak of the auditory-conflict Ninc was a full 150 ms earlier than its counterpart in the visual modality, it is possible that subsequent processing of auditory stimuli beyond the primary auditory cortex and before reaching anterior control-monitoring systems may be more rapid than the corresponding processing cascade that visual stimuli undergo in the extrastriate visual pathways beyond the striate cortex.

On the other hand, a shortening of the auditory Ninc by 150 ms, relative to the visual N450, seems rather large to derive solely from the somewhat speedier sensory processing in the auditory modality. Thus, there may be other modality-specific processing factors at play that contribute to this between-modality difference. Another possibility might be that the early negativity in the present study is, at least in part, a modulation of the earlier-latency, anterior N2 wave (or N2c; Gehring, Gratton, Coles, & Donchin, 1992), a component that has also been associated with conflict monitoring. For example, studies of the Eriksen flanker task and go/no-go tasks have observed this waveform, which has also been linked to the ACC through source modeling (Ridderinkhof, Ullsperger, Crone, & Nieuwenhuiss, 2004; van Veen & Carter, 2002). These negative-polarity effects have also been proposed to be related to the error-related negativity (Gehring et al., 1992) as reflecting different levels of conflict (Yeung & Cohen, 2006). In the present study, we were able to analyze accuracy only in the manual-response mode, and, given the very high level of accuracy in that condition (i.e., there were relatively few errors), we were not able to evaluate this possible relationship here.

A related possibility to be considered, however, is that a two-choice RT task, such as those used in both flanker and go/no-go tasks, as well as in the present auditory Stroop study, may entail faster conflict detection or/and selection than does the four-choice RT task typically used in the visual Stroop task, resulting in an early modulation at the stage of the N2, rather than being reflected as an N450 effect. Assuming that the speed of processing from sensory input to conflict detection circuits is actually relatively similar in the auditory and visual modalities, this hypothesis would predict that the onset of the Ninc for an auditory Stroop task with a four-choice format would be substantially later, more like that of the visual-conflict N450, or that the incongruency effect in a two-choice visual Stroop task would be reflected at the earlier latencies reported here. On the other hand, it is very possible that there are both modality-specific influences and response-choice-number influences on the timing of incongruency detection effects, as well as interactions between these two factors. These are possibilities that will be important to explore in future studies.

Importantly, while the auditory Ninc effect occurred earlier than the corresponding N450 incongruency effect in the visual Stroop task, the scalp topographies appeared to be rather similar, including their variation as a function of behavioral response mode (cf. Liotti et al., 2000). In particular, in both sensory modalities, the effect had a fronto-central distribution in the covert and overt response conditions but a more posterior, parietal distribution in the manual condition (see Fig. 3B and Supplementary Fig. 1). This suggests that, while the processing speed characteristics of conflict detection and processing may be modulated by modality-related factors in the two versions of the Stroop task, the neural activations of the response conflict/selection may operate similarly across modalities once it occurs.

The late SP (500–800 ms)

The later main ERP effect of auditory stimulus conflict in the present study manifested as an enhanced sustained positive-polarity wave over the posterior scalp spanning from 500 to 800 ms, with differential activity of opposite polarity over the anterior frontal scalp during some of that time period. This auditory conflict-related ERP effect appears highly similar in timing and scalp distribution to the analogous sustained potential conflict effect (SP; e.g., Larson et al., 2009; Liotti et al., 2000; West, 2003; West & Alain, 1999) previously observed in visual Stroop tasks.

Interestingly, the onset of the late SP did not appear to differ in timing for the auditory Stroop task (present study), relative to visual Stroop tasks (Liotti et al., 2000; West & Alain, 1999), in clear contrast to the earlier latency of the auditory Ninc relative to the visual N450, suggesting that the late SP effect may have a more fully supramodal character, a distinction that could be made here due to the high temporal resolution of the electrophysiological recordings. In addition, the auditory conflict SP tended to be similar in both timing and distribution across all the behavioral response modes (manual, covert, and overt), further supporting its independence from the response mode, and again in contrast to the earlier-latency Ninc effect. Moreover, its presence for the covert condition rules out the possibility that it reflects the processing of an overt verbal or motor response. Furthermore, the SP onset was approximately the same for the overt verbal and manual conditions, in spite of significantly longer RTs (by more than 150 ms) for the former.

Previous studies reporting the late SP have speculated that, given its proximity to Wernicke’s area and its tendency to be left-lateralized (see also West & Alain, 1999), it may reflect the need for further semantic processing of word meaning by posterior word-processing areas (Liotti et al., 2000). In contrast, West and Alain proposed that the late SP might index the processing of perceptual-level color information on trials when conceptual information cannot guide a response. In either case, extra processing of the incongruent word by posterior brain areas would facilitate conflict resolution. Therefore, the late SP might reflect controlled processes that adjust to the level of control necessary to accurately complete a trial and/or to better direct attention in a subsequent trial (Larson et al., 2009), since it often lasts beyond the behavioral-response window. West (2003) provided a three-dipole source localization with generators in bilateral middle/inferior frontal gyri and left extrastriate cortex, potentially suggesting that this component reflects an interaction between frontal and sensory regions as part of a attentional/cognitive control process (e.g., Appelbaum, Smith, Boehler, Chen, & Woldorff, 2011; Grent-'t-Jong & Woldorff, 2007; Wu, Weissman, Roberts, & Woldorff, 2007).

A different functional interpretation and generator for the late conflict SP have been suggested by a recent fMRI study of the auditory Stroop task, which revealed incongruency-specific activity in the left inferior parietal region (Roberts & Hall, 2008). Similar activations in the left inferior parietal lobule have been reported in earlier neuroimaging studies of the visual Stroop task (Bench, Frith, Grasby, Friston, Paulesu, Frackowiak and Dolan 1993; Carter, Mintun, & Cohen, 1995; Leung, Skudlarski, Gatenby, Peterson, & Gore, 2000), as well as by a recent comprehensive meta-analysis of Stroop task studies (Laird et al., 2005). This combined evidence suggests that the late SP may represent activity in parts of a supramodal attentional control system—for example, one for facilitating attention to relevant information while suppressing the influence of distracting information (Weissman, Warner, & Woldorff, 2004).

Sequential effects

Many studies have previously reported sequential effects for visual conflict tasks such as the Stroop (Egner & Hirsch, 2005; Kerns et al., 2004) and Eriksen flanker (Gratton et al., 1992) tasks, with longer RTs on incongruent trials preceded by a congruent versus an incongruent, trial and shorter RTs for congruent trials preceded by a congruent versus an incongruent trial. To our knowledge, the present study is the first to show similar conflict adaptation behavioral effects for an auditory version of the Stroop task. Thus, the present study provides evidence that conflict-related cognitive control processes operate similarly across modalities. Notably, upon accounting for response repetition, the significant current-trial × previous-trial interaction remained significant in the manual response condition, but not in the overt one. Since the overt RTs were measured via a thresholding procedure with a microphone, however, it is possible that we did not have sufficient sensitivity in this condition to pick up on subtle differences in RTs, particularly given the lesser trial numbers available per bin when performing a sequential analyses. Further work should be carried out in the auditory domain to investigate the conflict adaptation effects in overt responses.

Moreover, in the present study, these conflict-related sequential effects were reflected in both behavioral measures and neural activity measures, with the caveat that trials with response repetitions were included in this analysis to provide a sufficient signal-to-noise ratio. More specifically, the electrophysiological counterparts of the sequential effects for the present auditory Stroop task indicated that the late SP, but not the Ninc, showed sequential variations that were associated with the behavioral sequential effects. Increased amplitude of the parietal late SP was found for incongruent trials that were preceded by congruent trials, as compared with incongruent trials preceded by incongruent ones. Larson and colleagues (2009) found a similar pattern of sequence-related activity for the late SP in a visual Stroop task, also with no such sequential modulations of the N450. It is important to note that in the present study, the late conflict-related negative-polarity effect over the prefrontal scalp did not exhibit corresponding sequential effects, suggesting that the concurrent prefrontal negativity may arise from generators independent from those of the posterior positive-polarity effect, such as from frontal regions that have been implicated in other aspects of cognitive control (Aron, Behrens, Smith, Frank, & Poldrack, 2007; Botvinick et al., 2001; West, 2003). Because of the quite early onset of the first conflict-related activity observed here (~200 ms), the Ninc may be occurring more automatically (just after the sensory-evoked components) and, therefore, may reflect slightly different processes than what other studies have previously attributed to the ACC in terms of conflict adaptation (e.g., Kerns et al., 2004).

Finally, it is worth pointing out that the late SP modulation with conflict adaptation could be accounted for by the known role of regions of the inferior parietal lobule/sulcus in the implementation of attentional control, as reported in neuroimaging studies of the visual and auditory Stroop tasks (Laird et al., 2005; Roberts & Hall, 2008). The microlevel adjustments in attentional allocation to the relevant (pitch) information after encountering a difficult (incongruent) trial observed here support the supramodal nature of the attentional control system and its role in adaptive behavior.

ERP–behavior correlations

Further evidence in support of a global conflict-processing account that extends to auditory stimuli comes from the ERP–behavior across-subject correlations observed in the manual condition for this study. More specifically, those participants who had a greater difference in the late parietal SP amplitude for the incongruent versus congruent trials that preceded an incongruent trial also had a greater RT difference for the subsequent incongruent trial (i.e., iI versus cI). Such a pattern suggests that the larger SPs on incongruent (vs. congruent) trials in these participants reflected the implementation of more cognitive control for the incongruent trial, which would, in turn, have led to enhanced selective attention to the relevant stimulus feature on the next trial. Conversely, if a congruent trial came before an incongruent trial, less cognitive control was implemented (as indicated by a relatively decreased positivity), leading to the subsequent incongruent trial having a relatively long RT due to a lesser level of preparatory activity (Egner & Hirsch, 2005).

While our analyses did not indicate a correlation between the current-trial neural activity and the current-trial behavior, more recent findings have suggested that performance can operate independently of the level of dACC activity under circumstances of high DLPFC activity (Silton et al., 2010). Indeed, those participants who likely had more attentional control through up-regulation of DLPFC also would likely have an up-regulation of frontal-parietal attentional networks in general, potentially reflected electrophysiologically in the SP, and this would benefit them in situations of high conflict. Such an up-regulation of the DLPFC would be in line with the cascade-of-control model (Banich, 2009), with our results showing how this varies across participants.

Conclusion

In the present study, we found that a Stroop stimulus-conflict paradigm implemented in the auditory modality elicited incongruency-related neurophysiological responses similar to those for an analogous paradigm in the visual modality, including showing similar scalp distribution variations as a function of behavioral response mode. However, the early, negative-polarity, conflict-related ERP effect, here termed the Ninc, occurred substantially earlier in time (by about 150 ms) for the auditory Stroop task than has the corresponding N450 in typical visual Stroop studies, suggesting that additional processing delays in secondary or association areas may be occurring for visual conflict tasks, relative to auditory ones, or possibly as a result of the simple two-alternative-forced choice nature of this particular task. In contrast the timing—and the distribution—of the longer-latency, conflict-related, parietally distributed, positivite-polarity effects (SP) was much more similar between the visual and auditory task versions, suggesting that this effect may reflect a more fully supramodal mechanism regulating more controlled aspects of conflict resolution for incongruent stimuli, or perhaps a supramodal mechanism for deploying additional selective attentional resources. Additionally, the present study provides further evidence for the cognitive-control hypothesis, showing both behavioral and neural evidence of conflict adaptation for sequential trials in the auditory modality, while also providing correlations of the behavioral and neural measures across participants.