Predicting shifts in generalization gradients with perceptrons

Wisniewski, Matthew G.; Radell, Milen L.; Guillette, Lauren M.; Sturdy, Christopher B.; Mercado, Eduardo

doi:10.3758/s13420-011-0050-6

Predicting shifts in generalization gradients with perceptrons

Published: 08 October 2011

Volume 40, pages 128–144, (2012)
Cite this article

Download PDF

Learning & Behavior Aims and scope Submit manuscript

Predicting shifts in generalization gradients with perceptrons

Download PDF

Matthew G. Wisniewski¹,
Milen L. Radell¹,
Lauren M. Guillette²,
Christopher B. Sturdy³ &
…
Eduardo Mercado III¹

1904 Accesses
14 Citations
Explore all metrics

Abstract

Perceptron models have been used extensively to model perceptual learning and the effects of discrimination training on generalization, as well as to explore natural classification mechanisms. Here, we assess the ability of existing models to account for the time course of generalization shifts that occur when individuals learn to distinguish sounds. A set of simulations demonstrates that commonly used single-layer and multilayer perceptron networks do not predict transitory shifts in generalization over the course of training but that such dynamics can be accounted for when the output functions of these networks are modified to mimic the properties of cortical tuning curves. The simulations further suggest that prudent selection of stimuli and training criteria can allow for more precise predictions of learning-related shifts in generalization gradients in behavioral experiments. In particular, the simulations predict that individuals will show maximal peak shift after different numbers of trials, that easier contrasts will lead to slower development of shifted peaks, and that whether generalization shifts persist or dissipate will depend on which stimulus dimensions individuals use to distinguish stimuli and how those dimensions are neurally encoded.

UESMANN: A Feed-Forward Network Capable of Learning Multiple Functions

Comparing Regularization Techniques Applied to a Perceptron

Multilayer Perceptron (MLP)

When an organism learns that a stimulus results in some consequence, it will often generalize that learning to similar novel stimuli (Shepard, 1987). For instance, Watson and Raynor (1920) famously demonstrated, in experiments with Little Albert, that fear associated with a white rat can generalize to other stimuli, such as a rabbit, a fur coat, or even a piece of cotton. Numerous theoretical efforts have focused on explaining and predicting generalization patterns, with varying degrees of success (for a review, see Ghirlanda & Enquist, 2003; Thomas, 1993). Computational models of discrimination learning, in particular, have proven to be adept at simulating many of the empirically observed generalization patterns (Ghirlanda & Enquist, 1998, 2006; Guillette et al., 2010; Livesey, Pearson, & McLaren, 2005; Saksida, 1999; Shepard, 1990; Staddon & Reid, 1990; Suret & McLaren, 2002; Thomas, 1993). Such models are becoming increasingly useful tools for generating hypotheses about the mechanisms and cues that participants use during learning and generalization. In the present study, we assessed the ability of a neural network model of discrimination learning and generalization to account for recent observations of learning-related shifts in generalization observed during an auditory learning task.

A phenomenon commonly observed after discrimination learning is that the highest levels of responding may occur for stimuli other than those experienced during training. Generally, a peak shift results when an individual is trained to respond to one stimulus (S+) and not to some other stimulus (S−) that varies along a common dimension. When generalization is measured after training, responding is strongest not to the trained S+, but to a stimulus that is shifted along the dimension even further from S−. In a classic example of this, Hanson (1959) trained pigeons to peck a key when presented with a 560-nm light (S+), but not when presented with a 570-nm light (S−). During generalization tests, pigeons responded most strongly to wavelengths other than 560 nm (such as 540 nm) that were farther along the continuum from the trained S−, 570˗nm light. A similar phenomenon (area shift) is seen when a generalization gradient shows a peak at the S+ but an asymmetry in responding to stimuli on either side of the S+, such that stimuli on the side away from S− are responded to more frequently. Such learning-related shifts have been well documented experimentally (Baron, 1973; Bizo & McMahon, 2007; Cheng, Spetch, & Johnston, 1997; Derenne, 2010; Galizio & Baron, 1979; Guillette et al., 2010; Hanson, 1959; Landau, 1968; Lewis & Johnston, 1999; Moye & Thomas, 1982; Newlin, Rodgers, & Thomas, 1979; Nicholson & Gray, 1972; Purtle, 1973; Rowe & Skelhorn, 2004; Spetch, Cheng, & Clifford, 2004; Terrace, 1966; Thomas, Mood, Morrison, & Wiertelak, 1991; Verzijden, Etman, van Heijningen, van der Linden, & ten Cate, 2007; Wills & Mackintosh, 1998; Wisniewski, Church, & Mercado, 2009, 2010) and have been replicated using artificial neural networks (Ghirlanda & Enquist, 1998, 2006; Guillette et al., 2010; Livesey et al., 2005; Saksida, 1999; Suret & McLaren, 2002).

Most of the experimental studies of learning-related shifts have focused on the presence, size, or generality of the effects. For instance, several studies have shown that how far a peak in generalization shifts after training depends on the similarity of the stimuli that the subject initially learned to discriminate (Baron, 1973; Ghirlanda & Enquist, 1998; Hanson, 1959; Purtle, 1973; Thomas et al., 1991). Typically, the more similar the stimuli being discriminated are, the further the learner will shift (Baron, 1973; Ghirlanda & Enquist, 1998; Hanson, 1959; Spence, 1937). Shifts occur not only along simple, single dimensions like wavelength (Hanson, 1959) and sine wave frequency (Baron, 1973), but also along more complex acoustic (Guillette et al., 2010; Verzijden et al., 2007; Wisniewski et al., 2009, 2010), and visual (Derenne, 2010; Livesey et al., 2005; Spetch et al., 2004; Suret & McLaren, 2002; Wills & Mackintosh, 1998) dimensions. The ubiquitous nature of learning-related shifts have made it possible to use the effects to identify natural dimensions along which stimuli vary (Guillette et al., 2010; Rowe & Skelhorn, 2004). For instance, Guillette et al. used the peak shift effect as a way to identify a continuum along which chickadees discriminated notes that make up their species’ call and were also able to use similar generalization shifts in an artificial neural network to identify the acoustic features that defined that continuum.

Most theories of discrimination learning’s impact on generalization have focused on explaining the basic experimental effects. For example, several different associative learning theories (Blough, 1975; McLaren & Mackintosh, 2002; Saksida, 1999; Spence, 1937) adequately explain the direction of shift and the changes to the size of the effect that result from variations in stimulus similarity. Computational models have been used to test how well conflicting associative theories explain peak shift and related generalization phenomena under different conditions (Ghirlanda & Enquist, 1998; Livesey et al., 2005; Saksida, 1999), to generate new hypotheses about how stimuli may be represented (Ghirlanda & Enquist, 1998; Livesey et al., 2005; Suret & McLaren, 2002; Tijsseling & Gluck, 2002), and to determine what aspects of stimuli may be given the most associative weight (Guillette et al., 2010).

Although there has been much work exploring the basic predictions of current theories regarding peak shift, relatively few studies, experimental or theoretical, have looked at the dynamics of generalization over time. A few studies have shown that extinction can occur during testing, such that shifts in generalization dissipate as more nondifferentially reinforced testing trials are experienced (Cheng et al., 1997; Purtle, 1973). For instance, pigeons that exhibit a strong peak shift in the first block of testing show a reduction in the strength of that shift in the following test blocks (Cheng et al., 1997). Other studies have shown that both in humans (Wisniewski et al., 2009) and in nonhumans (Moye & Thomas, 1982), peak shift is stable over time and lasts at least 24 h post-discrimination-training. Recently, in an effort to more fully characterize how learning-related shifts change over the course of experience, Wisniewski et al. (2010) trained humans for 60, 100, 140, 180, 220, or 260 trials on a task requiring the discrimination of two complex sounds that varied in their rate of periodic frequency modulation. Participants who were trained with the fewest trials (60 or 100) or the most trials (220 or 260) did not show a peak shift effect. However, participants who were trained with an intermediate number of trials (140 or 180) did exhibit shifts. These results suggest that in at least some training conditions, peak shift may occur only at intermediate levels of learning.

The dynamics of generalization over the course of learning have been explored in past computer simulations of discrimination learning. For example, the shape of generalization gradients produced by connectionist networks change from being Gaussian to exponential as more training iterations are experienced (Shepard, 1990; Staddon & Reid, 1990). However, simulations investigating how learning-induced generalization shifts vary over different amounts of training are lacking. One study found that a multilayer connectionist model trained to recognize faces along a continuum showed a gradually emerging shift in generalization such that exaggerations of trained faces were recognized more easily than the original faces (Tanaka & Simon, 1996). With extensive training, however, this advantage disappeared. Nevertheless, the network continued to respond to novel faces falling along the continuum of trained facial features just as strongly as it did to the originally-trained faces. The basic trajectory of generalization changes in this network included the emergence of a peak shift effect at intermediate levels of training that later transformed into an area shift after extensive training.

The purpose of the present study was to assess whether a simple connectionist model could replicate the gradual emergence and dissipation of a peak shift effect that occurs over the course of auditory discrimination training in humans (Wisniewski et al., 2010). Toward this goal, variants of a previously developed connectionist model (Dawson, 2004, 2005, 2008; Dawson & Schopflocher, 1992) used to study auditory perception in chickadees and known to exhibit peak shift effects (Guillette et al., 2010) were used to simulate the learning of complex sounds by humans. The hypothesis was that the model would show transitory peak-shifts in generalization. Additionally, because there can be large individual differences in stimulus-evoked neural representations (Miglioretti & Boatman, 2003; Orduña, Mercado, Gluck, & Merzenich, 2005), and because different stimulus representations can yield drastically different generalization patterns (Enquist & Ghirlanda, 2005; Ghirlanda, 2002; Ghirlanda & Enquist, 1998), we tested how altering the representations used to train the model would affect the dynamics of generalization. Because many neural network studies have focused on replicating the effect of stimulus similarity on the strength of shift without examining how that strength varies with the amount of learning, we also tested how stimulus similarity would affect the dynamics of shifts in networks. Finally, because past studies of generalization shifts have often involved different amounts of training (Baron, 1973; Bizo & McMahon, 2007; Derenne, 2010; Galizio & Baron, 1979; Lewis & Johnston, 1999; Newlin et al., 1979; Thomas et al., 1991; Wisniewski et al., 2009, 2010), and because there can be large individual differences in performance gains during training and generalization (Nicholson & Grey, 1972; Withagen & van Wermeskerken, 2009), we assessed how different criteria for ending training affected the variability of generalization gradients.

Simulation 1

In simulation 1, we attempted to produce a transient peak shift during the learning of a discrimination task, using variants of a neural network known as a perceptron. The perceptron is a simple connectionist model developed in the 1950s that transforms a set of inputs to generate a desired output (Rosenblatt, 1957). In single-layer perceptrons, several input units are connected to a single output unit (Fig. 1a). Each input unit connection is associated with a numerical value (called a weight) that determines the contribution of that input to the final output. Multilayer perceptrons are essentially a series of sequentially applied single-layer perceptrons (Fig. 1b), in which each layer typically contains a different number of processing units. The computations performed by output units, which are the same for both single- and multilayer perceptrons, are described in the Appendix.

Associative learning theories often characterize discrimination learning as a process in which links between specific inputs (or elements of inputs) and particular outputs or associated stimuli become stronger or weaker during learning (e.g., Blough, 1975; McLaren & Mackintosh, 2002; Saksida, 1999; Spence, 1937). When properties of an S+ overlap with those of the S−, these theories predict that features of the S+ that are not shared with the S− gain associative weight as training progresses. When such theories have been instantiated as connectionist models (Ghirlanda & Enquist, 1998, 2006; Livesey et al., 2005; Suret & McLaren, 2002), they have proven to be able to explain the direction of shifts in generalization gradients, as well as the finding that learning to discriminate more similar stimuli can lead to larger shifts in generalization (Ghirlanda & Enquist, 1998). Because few experiments have attempted to track changes in generalization shifts over time, past modeling studies have seldom reported simulations of such changes.

As was noted earlier, a recent study examining the development of generalization shifts over time revealed that such learning-induced shifts in generalization can be transient (Wisniewski et al., 2010). In Wisniewski et al. (2010), participants learned to identify one sound by pressing a particular key on a keyboard to indicate that sound and to press another key to indicate that a different sound had occurred. Unlike most past studies with animals, this task reinforces two behavioral responses equally, rather than reinforcing responses to only one of the stimuli (i.e., the “S−’’ was not a nonreinforced stimulus). Nevertheless, a peak shift effect was observed after training, such that sounds further from the nontarget provoked stronger responses in generalization tests (Fig. 2). The finding that shifts in generalization were observed only after intermediate amounts of training had not been previously described in the literature, and did not appear to be predicted by existing models of discrimination learning. The current simulation tested whether an existing computational model, which was known to show a peak shift effect, would also show a dissipation of this effect with extensive training.

We attempted to simulate this effect with a basic perceptron because this neural network has previously been shown to produce a peak shift effect in simulations of auditory discrimination by chickadees (Guillette et al., 2010) and because perceptron learning algorithms are known to capture many of the main features of associative learning theory (Dawson, 2008; Gluck, 1991).

Method

Network architectures

Networks consisted of either a single-layer or a multilayer perceptron. In both architectures, the input layer consisted of 40 units. For the single-layer perceptrons (Fig. 1a), each unit in the input layer was connected to a single output unit. For the multilayer perceptrons (Fig. 1b), each unit in the input layer was connected to three units in a hidden layer (the layer between the input layer and the output unit), and every hidden layer unit was connected to a single output unit. There were no direct connections between input units and the output unit in the multilayer networks. Networks used sigmoid activation functions (see the Appendix for details). This type of function is often used in perceptron models and has been used extensively to model generalization and peak shift (Dawson, 2004, 2005, 2008; Ghirlanda & Enquist, 1998, 2006; Guillette et al., 2010; Livesey et al., 2005, Suret & McLaren, 2002; Tanaka & Simon, 1996).

Stimulus representations

The stimulus set used by Wisniewski et al. (2010) consisted of eight stimuli, rank ordered 1–8, with stimulus 5 used as the “S+” (target stimulus) and stimulus 4 used as the “S–” (nontarget stimulus). Two additional stimuli, which were not part of the generalization distribution, were used during pretraining. Here, we use overlapping patterns of Gaussian-shaped inputs to represent the stimuli used by Wisniewski et al. (2010). These inputs had a variance of 5 and a maximum value of 1. Similar representations have been used previously in connectionist models of peak shift (Ghirlanda & Enquist, 1998, 2006; Livesey et al., 2005; Suret & McLaren, 2002). The input stimulus sets used for pretraining, S+/S– discrimination, and generalization testing are shown in Fig. 3.

Training and testing

Single-layer networks were trained with the standard perceptron learning algorithm for sigmoid units, and multilayer networks were trained with error backpropagation (see the Appendix for details). Connection weights were adjusted to minimize activation error between the actual activation of the output unit and the desired activation (Dawson, 2004, 2005, 2008; Rumelhart, Hinton, & Williams, 1986). Initial network weights were set at random between –0.1 and 0.1.

Networks were initially pretrained with a desired value of 1 in the output unit for the S + and 0 for two other stimuli that were displaced along the dimension to either side of the S+. Network pretraining continued until the sum of the squared error (SSE) in the output unit response was at least .05. Error was a measure of the difference between the model’s output and the desired output. For instance, if the output unit had a desired response of 1 and produced an output of .75, then the squared error would be .05. This is because the difference between the desired and actual response of the output unit (1 – .75) is equal to .25. The square of .25 is then .05. SSE is the sum of this calculation for each input pattern. The stimuli used to pretrain the networks, with the exception of the S+, were not part of the S+/S– discrimination training or generalization stimulus sets. The pretraining procedure is analogous to the pretraining used previously in experimental studies (Spetch et al., 2004; Wisniewski et al., 2009, 2010). It also allows the networks to start S+/S– discrimination with some discrimination ability between stimuli on the dimension, rather than being completely naive.

After pretraining, all networks were given S+/S– discrimination training. The desired output for the networks was set at 1 for the S+ and 0 for the S–. In order to compare each model’s generalization after different levels of training experience, we trained groups of networks to six different criteria that were defined by (SSE) in the output unit. The six criteria were SSEs of .5, .3, .2, .1, .05, and .02. Network training was stopped after the respective SSE level was reached. After training, generalization was assessed by presenting networks with the S+, the S–, and six novel stimuli. In the generalization phase, networks were tested only for their responding to stimuli and did not experience more training trials with the new stimuli. Five networks were trained per network type (multi- or single-layer) and SSE criterion.

Results and discussion

The mean output unit activation of each group of five single layer perceptrons for all test patterns is shown in Fig. 4a. The mean output unit activation was also used to compute the gradient mean and mode for each criterion level, shown in Fig. 4b. In this case, gradient means and modes reflect the strength of shift such that values higher than 5 are shifted from the S+, in a direction away from the S–. The higher the value is, the further the shift. Even though several networks were trained per condition, the standard errors for output activations, gradient means, and gradient modes were extremely low (0–.02). For this reason, we report the values without standard errors for simulation 1 and the rest of the simulations in this article.

Single-layer perceptrons with sigmoid units showed an increase in shift as SSE decreased on the S+/S– discrimination. For instance, the networks trained to an SSE of .02 responded most strongly to stimuli 7 and 8. The gradient means and modes also shifted from the S+, even in the most extensively trained group of networks. The gradient mean increased with each level of decrease in SSE. The gradient mode peaked at SSEs of .2 and .1 and then decreased at SSEs of .05 and .02, but these modes are still displaced from the S+ in a direction opposite S–.

Generalization gradients from multilayer networks are shown in Fig. 4c, d. The multilayer perceptrons showed an increasing amount of shift with decreased levels of SSE that was highly similar to the pattern observed in single-layer perceptrons.

These simulation results are not specific to the parameter settings, input representations, or number of network units used. Numerous additional simulations were conducted using different learning rates, numbers of input and hidden units, and input representations, all of which produced similar patterns of generalization (unpublished data). Furthermore, published simulations of face recognition learning (Tanaka & Simon, 1996) produce comparable patterns of generalization after different amounts of training. Collectively, these results suggest that standard perceptron models of discrimination learning (and related error-correction-based associative models of learning) predict that learning-related shifts in generalization are a stable asymptotic end state of learning. These models predict that increased training should lead to either greater shifts in generalization or no changes in generalization once a maximally shifted state has been reached. Consequently, these models appear to be unable to account for transient shifts in generalization that first appear and then dissipate as learning progresses, as reported by Wisniewski et al. (2010). Further attempts to modify perceptrons to enable them to replicate the observed pattern of shift revealed that the use of an activation function other than the standard sigmoid function (described below in simulation 2), could account for this pattern of learning-induced shifts in generalization.

Simulation 2

Past connectionist models of discrimination learning provide an elegant and efficient way of instantiating associative learning theories and of describing a wide range of empirical data. It would thus be advantageous if these models could also account for the temporal dynamics of generalization shifts without requiring major modifications or increased computational complexity. One feature of perceptrons that is known to impact how they learn is their activation functions. These functions determine how weighted input values are transformed into output values (see the Appendix for details). In fact, the extensive use of sigmoid activation functions in current perceptron simulations reflects the fact that this function makes efficient training of multilayer perceptrons feasible using error backpropagation. Because activation functions are known to affect how artificial neural networks learn (e.g., Dawson, 2004, 2005, 2008; Dawson & Schopflocher, 1992), we explored whether using an activation function other than the sigmoid function might change the trajectory of generalization shifts observed during learning.

One activation function that has proven effective in past pattern recognition systems is the Gaussian function (similar to a bell curve). This activation function is used extensively in a type of neural network known as a radial basis function network (e.g., Park & Sandberg, 1993) and has also been used in simple perceptron simulations (Dawson, 2004, 2005, 2008; Dawson & Schopflocher, 1992); in this context, units with Gaussian activation functions have been described as value units. Value units are like sigmoid units in that they transform the sum of the weighted values from each input unit into an output value that ranges between 0 and 1. Unlike sigmoid units (and more like a dose response curve), value units transform both very low and very high sums into outputs smaller than those for intermediate sums. This allows units to become selective to a range of input values, as is seen in the receptive fields of many cortical neurons (e.g., Elhilali, Fritz, Chi, & Shamma, 2007; Linden, Liu, Sahani, Schreiner, & Merzenich, 2003). Earlier work comparing the effects of stimulus coding on learning-induced shifts in generalization found that simulating different neural codes for stimuli at the input level could dramatically impact how generalization shifts developed with training (Enquist & Ghirlanda, 2005; Ghirlanda, 2002; Ghirlanda & Enquist, 1998). We similarly expected that switching from sigmoid units to value units might significantly impact the patterns of generalization exhibited by perceptrons, even if the use of value units did not affect their ability to learn the trained discrimination.