A NEURAL NETWORK MODEL OF
OF ATTENTION BIASES IN DEPRESSION
Greg J. Siegle
San Diego State University / University of California, San Diego
In Press: In Reggia, J. and Ruppin, E. (Eds.) Disorders of brain, behavior, and cognition: The neurocomputational perspective. (pp. 415-441) Amsterdam: Elsevier
This research was supported by an NIMH grant to the UCSD Mental Health Clinical Research Center and a Sigma Xi Grant in Aid of Research. Thanks go to Rick Ingram, Georg Matt, and Eric Granholm for considerable assistance on empirical validation for the network model.
Correspondence concerning this research should be directed to Greg Siegle, currently at the Clarke Institute for Psychiatry, 11th floor, 250 College St, Toronto, Ontario M5T 1R8, gsiegle@psychology.sdsu.edu.
Abstract
A great deal of research suggests that depressed individuals pay attention to negative information, yet it is not clear what about negative information they pay attention to. A research program designed to clarify the role of biased attention in depression is described. A physiologically constrained computational neural network designed to simulate relevant aspects of attention to emotional information is presented. The model is used to make predictions for behavioral (reaction times and confusion rates) and physiological (pupil dilation) indices of performance on an affective lexical decision and valence identification task with depressed individuals. Empirical evidence supporting some of the model's predictions is presented. The model is used to generate hypotheses regarding the neurobiology of attention biases in depression, and to suggest ways to improve cognitive and pharmacological treatments for depression.
Depression is a disabling disorder characterized by negative moods, lack of interest in pleasurable activities, weight change, sleep disturbance, psychomotor retardation, fatigue, feelings of worthlessness, decreased attention, and suicidal ideation (APA, 1994). The point prevalence for depression in the US has been estimated at between 5% and 44% of population (Flaherty, Gavira, & Val, 1992). The prevalence and seriousness of the disorder make understanding factors associated with its onset and maintenance a common goal of clinical researchers.
A great deal of research suggests that depressed individuals pay attention to negative information. "Seeing things negatively," or "hearing only negative things" are common complaints of people presenting for treatment of depressive symptoms. Research finds that depressed and dysphoric (a sad mood state thought to underlie depression) individuals selectively attend to negative information over positive information (Matthews & Harley, 1996; Williams, Mathews, & MacLeod, 1996), remember negative information better than positive information (Blaney, 1986; Matt, Vazquez, & Campbell, 1992), and interpret information as negative that other people do not see as negative (Williams, Conner, Siegle, Ingram, & Cole, 1998).
Yet, it is unclear what aspects of negative information depressed and dysphoric individuals attend to, and whether biased attention to negative information occurs in the early stages of attention, having to do with initial perceptions of information (e.g., as suggested by Kitayama, 1990 and Matthews and Southall, 1991), or in late stages of attention, involving retrieval of associations from memory and elaboration (Macleod & Mathews, 1991). The current chapter presents a physiologically constrained framework for understanding the role of depression in attention to negative information. A computational neural network is used to generate predictions based on this framework. Data from a number of experiments are used to evaluate the model's predictions. Conclusions about depressive information processing biases, stemming from aspects of the model that appear valid are then presented.
A Physiological Framework for Understanding Information Processing Biases in Depression
Physiological constraints on attention to emotional information help to resolve ambiguities regarding the location and temporal sequence of attention biases in depression. Le Doux (1997) suggests that emotional information is processed in parallel by brain systems responsible for identifying emotional aspects of information (the amygdala system) and nonemotional, conceptual or semantic aspects of information (the hippocampal system). Research documents the importance of the amygdala system in identifying information as either positive or negative (Halgren, 1992; Le Doux, 1992), and the hippocampal system in semantic association, suggesting that it acts an index to the semantic memory system, moderating activation of semantic qualities associated with stimuli in cortex (e.g., Squire, 1992). Extensive feedback occurs between the hippocampal and amygdala systems (Amarel et al., 1992; Tucker & Derryberry, 1992). This feedback may allow individuals to associate emotional aspects of information with non-emotional aspects, but preserves the notion that attention can be separately allocated to affective and non-affective aspects of information (e.g., Kitayama, 1990; Matthews & Harley, 1996).
Le Doux’s (1997) model is conceptually similar to cognitive theories that suggest emotional information processing involves spreading activation throughout a semantic network (e.g., Collins & Loftus, 1975) in which both semantic and affective features are represented as nodes in the network (Bower, 1981). For example, a stimulus such as a crying person might activate both "person" and "sadness" nodes in an observer’s semantic network. Ingram (1984) suggests that people who are depressed suffer from strongly activated connections between negative affective nodes and multiple semantic concepts, creating feedback loops that maintain depressive affect and cognition.
Ingram’s (1984) theory could be used to explain depressive information processing biases in Le Doux’s model in a number of ways, each of which appeals to the notion that attention can be separately allocated to affective and non-affective aspects of information. Depressive attention biases could involve excessive
attention to nonemotional features of information (biased hippocampal system processing), excessive attention to emotional features (biased amygdala system processing), or feedback between affective and semantic processing systems. Each scenario suggests different roles for attention in the onset and maintenance of depression. Different cognitive patterns (e.g., attending to emotional or non-emotional material) and different brain areas could be targeted for intervention as a result.
Ways to Evaluate Attention in Depression
Ideally, a model such as Le Doux’s would be tested by observing feedback and activations in the hippocampus and amygdala systems, as individuals attend to emotional information. This technique is difficult due to the small size of these structures and the temporal resolution of most imaging devices. Instead, behavioral and physiological responses to information processing tasks are used to infer these activations.
Based on Le Doux’s model, two tasks seem particularly useful for elucidating the nature of attention biases in depression. A lexical decision task, in which participants are asked to identify whether a string of letters spells a word, which may be positive, negative, or neutral, directs people’s attention towards nonemotional aspects of information. If depressed individuals focus on non-emotional features, negative information processing biases should be apparent on this task. In contrast, if depressed people focus more on the affective aspects of information, information processing biases would be more apparent on a task designed to focus people’s attention on affective features. A valence-identification task, in which individuals are asked to name whether a word is positive, negative, or neutral, can be used for this purpose. Attentional biases involving feedback between metal representations of affective and semantic aspects of information are assumed to result in biased information processing on both tasks.
By analyzing signal detection and confusion rates on the tasks, the extent to which attention to certain types of information is impaired can be quantified. By analyzing reaction times on the tasks, the extent to which biases are apparent in the early stages of attention can be quantified. To effectively analyze attentional biases over the entire time course of attention, a more continuous measure of cognitive load is needed. Pupil dilation is a strong candidate for such a measure.
Muscles controlling pupil dilation are innervated by structures essential to both cognitive and affective information processing. Thus, the effectiveness of pupil dilation, as a measure of cognitive load, has been repeatedly demonstrated using attention and memory tasks (see Beatty, 1982 for a review). For example, Kahneman and Beatty (1966) show that the pupil reliably dilates one millimeter for each digit research participants are asked to remember in a short term memory task. Pupils also dilate in proportion to the difficulty of tasks (Hess & Polt, 1964) and effort needed to perceive information (Hakerem & Sutton, 1966). These results can be explained by projections from semantic identification structures to the midbrain reticular formation which is connected to the ocularmotor nuclei; stimulation of the midbrain reticular formation has been shown to lead to changes in pupil dilation (Beatty, 1986). Emotional activity is often thought to be mediated by activity in the hypothalamo-thalamo-cortical axis. As such, activity in these structures, and limbic structures connected to them have been shown to result in pupil dilation (see Hess, 1972 for a review). Stimulation of the amygdala, in particular, increases pupil dilation in cats, dogs, and monkeys (Fernandez de Molina & Hunsberger, 1962; Koikegami & Yoshida, 1953).
Empirical Studies Using the Affective Lexical Decision and Valence Identification Tasks
There is a growing body of literature exploring affective lexical decision and valence identification tasks with depressed and nondepressed individuals (for reviews, see Siegle, Ingram, and Matt, 1998a; Siegle, 1998a) though the two tasks have rarely been examined together. Siegle et al. (1998a) performed an affective lexical decision task and affective valence identification task with 30 dysphoric and 46 nondysphoric undergraduates, measuring signal detection rates and reaction times. A number of computational simulations, described below, were generated to help understand this data. Based on results of the simulations and the first experiments, Siegle (1998a) performed the same tasks with 23 unmedicated clinically depressed and 25 nondepressed adults, measuring reaction times, signal detection rates, and pupil dilations, and incorporating variables that the simulations suggested might be relevant to test. To distinguish between personally relevant and non-relevant words, both normed word-lists and words generated by participants in the days preceding the experiment were employed. The results of simulations described in the following sections will be compared to data from these studies.
Why Computational Neural Networks are Particularly Appropriate for Investigating Attention Biases in Depression
The collection of signal detection rates, reaction times, and indices associated with pupil dilation allows generation of many theoretically motivated hypotheses regarding the role of affect in attention. Yet, Le Doux’s model involves complex interactions between highly nonlinear systems. The flow of information through these systems is difficult, if not impossible, to understand just by thinking about the model. Similarly, the flow of information through a network such as Bower’s (1981) model is quite complex; the role of feedback between components of such a model, especially if there is any noise in processing, is notoriously hard to predict (Movellan & McClelland, 1997). Computational modeling can provide a rigorous basis for understanding how negative experiences could impact information processing in models such as Bower’s or Le Doux’s, and can suggest ways to test such these understandings empirically.
Computational neural networks are particularly useful for understanding attention in depression, because they are natural extensions of the notion of semantic networks, on which Bower’s (1981) network theory is based (Anderson, 1990; Blank, Meeden, & Marshall, 1991; Hinton, 1991; Yates & Nasby 1993). Their biological congruity allows an intuitive representation of Le Doux’s (1997) model to be integrated with the semantic network approach.
Additionally, neural network representations avoid the common difficulties associated with representing "hot" versus "cold" cognitions in semantic networks. That is, people can think about semantic aspects of emotional concepts such as sadness without necessarily feeling sad. As long as both the semantic and emotional aspects of sadness are represented by a single node in a semantic network, it is difficult to represent the idea of thinking about sadness without formally being sad in a semantic network. Because simulated neurons in a neural network represent "microfeatures" of concepts (Hinton, McClelland & Rummelhart, 1986) it is logical to assume that emotional and semantic aspects of a single concept would be represented by different nodes, corresponding to the different brain areas implicated by Le Doux’s model.
A Computational Framework for Investigating Affective Information Processing
The following sections augment a growing body of neural network models of unipolar depression (see Siegle (1998b) for a review) by presenting and extending a computational neural network model that embodies the essential features of Bower’s (1981) and Le Doux’s (1997) models of emotional processing (Siegle, 1996; Siegle, 1998a; Siegle & Ingram, 1997a,b; Siegle, Ingram, & Matt, 1995; Siegle, Ingram, Matt, & Granholm, 1998). The goal in producing the computational model was to reproduce salient aspects of attention to emotional information, including the gradual perception and recognition of emotional and nonemotional aspects of a stimulus. Nodes in the model were not intended to strictly represent groups of human neurons, or even the details of brain structures. Rather, the general hypothesis that parallel connected systems are responsible for the recognition of affective and semantic aspects of information was captured. The model can be used to make predictions regarding the time course of emotional information processing on the lexical decision and valence identification tasks.
Architecture
The model is shown in Figure 1. In the figure, each small circle represents an individual node. A large elipse represents a cluster of nodes that perform the same conceptual function. In the model, nine orthographic nodes, representing perceptual characteristics of stimuli, are fully connected to, and feed activation forward to nine nodes representing the semantic content of stimuli and two nodes representing the affective content of stimuli, in parallel. Feedback occurs between the nodes representing affective and semantic features. The semantic nodes roughly correspond to hippocampal system processing. The affective nodes correspond to amygdala system processing. The feedback between them captures Le Doux’s notion of feedback between these brain areas, as well as Ingram’s (1984) notion of feedback between mental representations of affective and semantic features.
-------------------------------------------------------------------------
Figure 1. A neural network model for investigating affective and semantic information processing

-------------------------------------------------------------------------
To allow the network to make an analog of a lexical decision or valence identification, both semantic and affective feature nodes feed activation forward to twelve nodes representing the network’s outputs (nine semantic concepts, three valences). The outputs thus represent products of decision processes assumed to occur in the frontal lobes. Inhibitory connections from the outputs back to the valence units were incorporated to approximate Davidson's (1997) idea that frontal activity inhibits amygdala firing. These connections were not used throughout the majority of simulations for reasons discussed at the end of this chapter.
Which task a person is performing, the lexical decision or valence identification task, is assumed only to affect her eventual decision, and not early attentional processes. This intuition is captured in the network by allowing task units, representing the context in which the stimulus is to be interpreted (either as a lexical decision or valence identification) to feed activation to the output nodes. These nodes are represented on the right side of Figure 1 to imply that they are an internal cognitive phenomenon, rather than perceptual inputs.
Conceptually, the recognition of a stimulus might proceed as follows. At the beginning of a simulation, activations of the input nodes are set to predetermined values representing a stimulus, subject to some perceptual noise. As activation from the input units feeds to the valence and semantic units, a pattern of activation in the semantic units would be formed, corresponding to some non-emotional features the network has learned (e.g., if the stimulus was "birthday", the notion of the date on which a person is born might be retrieved). At the same time, the positive and negative valence units would take on activations suggesting that the stimulus is either positive or negative. Feedback between the semantic and valence units might lead the network to change the pattern of activations in the semantic units, suggesting that the network has associated a different set of non-emotional features with the stimulus. Similarly, the feedback could lead to a different pattern of activations occurring in the valence units, suggesting, for example, that the network originally identified the stimulus as positive, but now identifies it as negative. During this process the output units become active in proportion to activation of the semantic and valence nodes, with additional contextual information from the task units. The fit of these activations to each stored pattern is evaluated simultaneously. When an overwhelming proportion of evidence for one output pattern is accumulated, the network can be said to have "recognized" the stimulus, as that pattern. This event would correspond to a person having recognized the non-emotional features of the stimulus, and having assigned it an affective valence. By allowing activation to continue within the model, associations occurring after a reaction time can also be observed.
The model thereby captures the time course of attention to a presented stimulus. To the extent that the model is valid, it provides information about how relationships between emotional and non-emotional aspects of information could influence attention, and can provide insight into what aspects of an emotional stimulus a person might pay attention to, both before and after the stimulus is recognized.
Representation of Stimuli
Representation of non-affective features was kept as simple as possible. Orthographic, semantic, and output features were bipolar and normalized, such that one node was activated with strength 1 and all others were activated with strength -2/(vectorlength-2). The lexical decision task was represented in the task nodes as activations: 1, -1. Valence identification was represented as activations -1, 1.
Representation of affective features was based on conventional assumptions that positive and negative valences are either opposite or orthogonal. To represent positivity and negativity orthogonally, two nodes could be used. High activation of one node represents positive information while high activation of the other node represents negative information. Low activation of both nodes represents neutral information. The validity of using a near orthogonal representation of positivity and negativity can be supported empirically. Williams, et al. (1998) had 600 undergraduates rate the positivity and negativity of 30 words normed for emotionality. On a scale of 1 (not emotional) to 5 (very emotional), they found that positive words were generally rated as somewhat positive and not negative (mean positivity, mean negativity=3.62,1.13). Negative words were rated as less positive and more negative (1.49, 3.45). Neutral words were rated as slightly positive, but lacking in negativity (2.36, 1.28). Given that these values were not perfectly orthogonal, activation of the valence units in the network was made proportional to Williams et al.’s (1998) means. Specifically, ideal activations of valence units for each valence were: positive: 1, .31; negative: .41, .95; neutral: .65, .35.
Training: Simulation of Normal and Depressed Experiences in the Model
Following the idea that nondepressed individuals are exposed to a variety of positive, negative, and neutral information, an analog of normal experience was induced in the network by it training it on equal numbers of positive, negative, and neutral exemplars, using a Hebb (1949) learning rule. Practically, this was done by multiplying input vectors by the transpose of desired output vectors to obtain a weight matrix, for each set of connections. This technique is equivalent to using Hebb training with the network on equal presentations of each stimulus with no noise and no forgetting.
Many theorists suggest that the induction of depression involves one or a few pervasive negative life events or loss experiences (e.g., Beck 1974, Brewin, Andrews, & Gotlib, 1993, Paykel 1979) that are continuously thought about. This process was operationalized in the neural network model by training the network on a single negative stimulus for a prolonged period after it had been trained on equal numbers of positive, negative and neutral stimuli. Specifically, products of the valence and semantic features for a single negative stimulus to connections between the valence and semantic units were repeatedly added. This technique implemented a Hebb rule. To bound the increase in weights, a slight decay factor on previously learned information (a forgetting rule) was imposed.
A number of different algorithms for allowing the network to "learn" initial exemplars, as well as negative information have been explored. In initial descriptions of the network (Siegle, 1996, Siegle & Ingram 1997a,b), a back-propagation learning algorithm was used. This algorithm treats learning as an error correction process, in which connections within the network are adjusted to minimize the discrepancies between network’s outputs and predetermined expected outputs. This procedure is widely used throughout the neural modeling literature, but can be criticized on two grounds. First, there is little evidence suggesting that back-propagation is a biologically plausible learning rule (Jobe, Fitchner, Port, & Gavira, 1995). Second, given the small number of exemplars on which the current network is trained, unless a great deal of noise is included in the network, it learns all training exemplars perfectly, very quickly. Thus overtraining modifies weights in the network very little. As such, it is difficult to simulate the effects of various levels of overtraining without a great deal of noise. To overcome these barriers new simulations conducted here were done using a Hebb learning rule, which strengthens connections between active nodes during training. The resulting network’s behavior is qualitatively similar to the backpropagation trained network. For these simulations, noise could be considerably reduced. Differences between the Hebb and backpropagation trained networks are discussed in the Appendix.
Network Activation During Tasks
The network is cascaded, meaning that each node’s activation is a function of its input over time. Unless otherwise noted, all multiplication described below is matrix multiplication. Activation of a layer is represented by that layer’s name. Connections are represented by the name of each layer. For example, "InputSemantic" represents connections from the input to the semantic nodes.
Before and after stimulus presentation, only noise entered the system. Activation of nodes represents the average firing rate of a population of neurons at a given time t. Initial and late activations of the semantic and valence units occurred according to the rules:
Semantict=(1-t)Semantict-1+t*(noise*InputSemanticT)
Valencet=(1-t)Valencet-1+t*(noise*InputValenceT)
where t is the diffusion rate for inputs. Noise was bipolar and uniformly distributed. During the presentation of a stimulus, the stimulus also accounted for input activations as:
Semantict=(1-t)Semantict-1+t*((Input + noise)*InputSemanticT)
Valencet=(1-t)Valencet-1+t*((Input + noise)*InputValenceT)
Stimuli were presented for 10 epochs, after which the network operated entirely on noise input plus feedback between the semantic and affective feature units for 250 epochs, representing the brief (150ms) presentation time for empirical stimuli in Siegle's (1998a) tasks. Feedback between semantic and valence nodes was operationalized according to the differential equations:
Semantict=(1-b)Semantict-1+b*lyapunov*(Valence*ValenceSemanticT)
Valencet=(1-b)Valencet-1+b*lyapunov*(Semantic*SemanticValenceT)
where b governed the amount of feedback between the structures. lyapunov governed how quickly the network settled on a set of activations. Values below one act as a decay factor, allowing activations to approach zero. Values above one tend to preserve and increase activation, creating a positive feedback loop between the affective and semantic structures.
Activation of the output units was based on the activation of all units feeding to them as:
Outputt=Semantic * SemanticOutput' + Valence*ValenceOutput' + TaskPriority*(Task*TaskOut')
where TaskPriority governed how much the context could affect the task. Nonlinearity was introduced by limiting activations of nodes to 2. This technique was used rather than a sigmoid activation function because even small deviations from zero, using a sigmoid, tended to magnify on feedback as a function of the squashing function rather than other properties of the network. Using a piecewise linear function allowed all observed biasing effects to be based on the architecture and training of the network.
Soft competition was introduced for output nodes by subtracting the maximum activation of any other node in the output layer from each node’s activation. Matches were determined in a manner analogous to that used by Cohen, Dunbar, and McClelland (1990) to represent word and color naming in a connectionist model of the Stroop task. Following Ratcliff’s (1978) notion that semantic identification is a diffusion process, they suggest that a semantic identification occurs when the activation of the mental representation of a stimulus reaches a threshold. Counters were therefore defined to represent the accumulated evidence for each possible item the network might identify. The counters added evidence for a given stimulus proportional to the difference in the fit of the outputs to the expected output for the stimulus the maximum fit to any other trained output, subject to gaussian noise (magnitude 0 for Hebb-trained simulations). Fit was computed as the cosine of the output vector with an expected output vector. When any counter exceeded a threshold (arbitrarily set to 2.5), the network was said to have made an identification. That epoch was counted as the network’s reaction time.
Using Valence Ratings to Empirically Estimate Decay
To check that the processes used to represent valence in the network, as well as to induce an analog of depression, were effective, an analog of a Williams et al’s (1998) valence-rating procedure can be used, in which dysphoric and non-dysphoric individuals rated the positivity and negativity of a normed word set. The network was presented with each stimulus on which it was trained, for 200 epochs (long enough to reach assymptotic activations in the valence units). The resulting activation values for valence units representing positivity and negativity were recorded as an analog of a rating for how positive and how negative stimuli were rated. These values were scaled from 1 to 3.62 (mean valence rating for positivity). The median valence ratings for stimuli of each valence, for 10 trials are shown in Table 1. For each valence, ratings for all three stimuli were within 1 1/10th of a point. As expected, positive stimuli were primarily positive. Negative stimuli were somewhat positive. Neutral stimuli were more positive than negative stimuli.
Whether the overtraining procedure behaved as expected can also be tested using this method. Williams et al. (1998) found that more dysphoric college students generally rated stimuli progressively more negatively and less positively. Inspection showed that the rate of forgetting in the Hebb network governs the magnitude of resulting weights after overlearning. With too little forgetting, all ratings go up after overtraining (i.e., too much positivity for all stimuli). With too much forgetting, previously learned stimuli are no longer recognized after overlearning. The desired effect was obtained for a minimum forgetting rate of .89, which was therefore adopted for subsequent simulations. Table 1 also presents the simulated valence ratings for the network, overtrained five times, with a forgetting rate of .89. In each case, ratings are more negative for the overtrained network. While ratings for the positive information are lower on positivity than in the original network, ratings for neutral stimuli are similar, and positivity ratings for negative stimuli are higher in the overtrained network.
------------------------------------------------------------------
Table 1
Median valence ratings for each stimulus from 10 simulated rating sessions
|
|
Nonovertrained |
Overtrained (5 epochs) |
||
|
positivity |
negativity |
positivity |
negativity |
|
|
positive |
3.6 |
1.4 |
2.5 |
1.9 |
|
negative |
2.2 |
2.3 |
1.9 |
3.1 |
|
negative person-ally relevant |
2.2 |
2.3 |
5.0 |
5.0 |
|
neutral |
2.7 |
1.1 |
2.0 |
1.9 |
-------------------------------------------------------------------
Implementation
The Hebb trained network was implemented in Matlab on an Intel Pentium II computer. The backpropagation network was implemented in the PlaNet modeling environment (Miyata, 1991) on a Sun SPARC 1 computer. The code used to implement all networks is available from the author upon request. The parameters used for Hebb trained network simulations included here are shown in Table 2.
------------------------------------------------------------------
Table 2
Parameters used in the Hebb trained neural network simulations
|
Parameter |
Value |
|
Network construction |
|
|
Number of input nodes |
9 |
|
Number of semantic nodes |
9 |
|
Number of Valence nodes |
2 |
|
Activation parameters |
|
|
t (input diffusion rate) |
0.1 |
|
b (affective-semantic loop diffusion rate) |
0.02 |
|
lyapunov (lyupanov exponent) |
.2 |
|
TaskPriority |
.5 |
|
maximum network activation |
2.0 |
|
minimum network activation |
-2.0 |
|
noise magnitude |
0.05 |
|
Task parameters |
|
|
stimulus duration |
10 epochs |
|
Total measured duration |
250 epochs |
|
accumulation noise |
0.0 |
|
positive determination accumulation threshold |
1.0 |
|
negative determination accumulation threshold |
1.0 |
|
Learning parameters |
|
|
additional epochs of training on negative stimuli |
5 |
|
rate at which new training exemplars are assimilated |
1.0 |
|
preservation of old learning during new learning (i.e., the forgetting rate) |
.89 |
|
Training set |
|
|
Number of stimuli |
9 |
|
Number of negative stimuli representing depressogenic loss |
1 |
-------------------------------------------------------------------
Use of the Network to Predict the Time Course of Attention to Emotional Information
The validity of the current model can be evaluated by examining how well it captures behavioral and physiological aspects of depressed and nondepressed individuals’ responses to the affective valence identification and lexical decision tasks. Three aspects of the tasks were modeled using the network, including reaction times, signal detection rates, and pupil dilation. For each aspect of behavior, the network’s behaviors are first examined. Results of empirical experiments derived from the network’s predictions are then described. Finally, implications of confirmed predictions are discussed.
Modeling Reaction Times
Attention to the emotional and nonemotional aspects of information can be examined separately by observing how quickly individuals respond to questions regarding either the emotional or nonemotional aspects of a stimulus. Reaction times have long been assumed to reflect the amount of attention an individual pays to aspects of information (Massaro, 1988). Longer reaction times are associated with paying less attention to a task, potentially because an individual is attending to aspects of information not related to the task.
Network Predictions. As noted previously, the network’s reaction times are thought of as the culmination of a diffusion process, in which evidence is accumulated for various possible responses, until one reaches some threshold. The network could be said to make a lexical decision when evidence for some learned semantic pattern in the output layer (semantic layer in early simulations) reaches a threshold. Similarly, the network can be judged to have identified the valence associated with a stimulus when a pattern of activation in the valence nodes in the output layer (valence layer in early simulations) reaches some threshold. Simulated reaction times are presented for simulations done by Siegle and Ingram (1997a) in Figure 2.
-------------------------------------------------------------------------
Figure 2. Reaction time predictions from simulated affective and lexical decision tasks, from Siegle and Ingram (1997a)
Simulation details: Backpropagation trained network

-------------------------------------------------------------------------
Because the overtrained network tends to associate incoming information with the negative stimulus on which it was overtrained (Siegle & Ingram, 1997a, Siegle, 1996), the overtrained network is shown in Figure 2 to recognize negative stimuli as negative quickly on the simulated valence identification task. In contrast, it is slower to recognize positive information as positive, because of competing activation from the representation of negativity. The network’s behavior suggests that depressed individuals will be slow to report that positive words are positive, but quick to report that negative words are negative, on a valence identification task. If neutral decisions are made in the same way as positive decisions, these would also be slowed (since depressed people would think of negativity rather than neutrality). Alternately, Siegle and Ingram (1997a) suggest that neutral decisions may be exclusionary, made when neither a positive or negative decision is reached after a variable temporal threshold. If depressed and nondepressed individuals have the same threshold for neutral decisions, neutral decision making would not be biased.
On a simulated lexical decision task, the network is slowest to make associations with negative stimuli on which it is not overtrained because the representation of negative information on which it was overtrained competes for activation. It is fastest at making associations with the negative stimuli on which it is overtrained. The network’s behavior thus suggests that depressed individuals will be slow to say that negative words not specifically associated with their particular depression, are words, because they will be reminded, so strongly, of personally relevant information that they will not immediately respond to the task. In contrast depressed individuals are expected to be especially fast at responding to negative personally-relevant words on a lexical decision task.
Human Data. Human reaction time data largely supports the network's predictions. In a meta-analysis of affective lexical decision task studies, Siegle (1996; Siegle et al., 1998a) found that depressed people generally appeared to react more slowly to negative words than to positive or neutral words, in comparison to nondepressed people. Results from the network simulations parallel Siegle et al’s (1998a) study, in which the difference (>0) between negative and positive reaction times was larger for dysphoric undergraduates than nondysphoric undergraduates on an affective lexical decision task. The same dysphoric undergraduates were slower to respond to positive than negative words on an affective valence identification task. Siegle (1998a) found similar results with clinically depressed individuals. Specifically, depressed individuals were slowest to name the affective valence of positive words, and were no faster to say that negative words were words than positive words. For comparison with model predictions, Siegle et al’s (1998a) and Siegle’s (1998a) data are shown in Figure 3.
Implications. Similar reaction time biases were observed in the network and in people. In the network, biased reaction times were due to association of affective and semantic aspects of stimuli with personally relevant negative information. To the extent that mechanisms behind delays in the network match people, it is suggested that depressed individuals could have personally relevant negative thoughts in response to environmental stimuli. Biases observed in the network happen as a function of feedback between structures responsible for identifying affective and semantic features of information. If mechanisms responsible for information processing biases in the network are similar to those in humans, the amount of feedback between structures in the amygdala and hippocampal systems could moderate such biases. Depressed individuals with biases similar to those exhibited by the network are expected to have particular difficulty processing positive information.
-------------------------------------------------------------------------
Figure 3. Reaction times from Siegle et al's (1998) and Siegle's (1998) affective lexical decision and valence identification tasks

-------------------------------------------------------------------------
Effects of rumination on reaction times
The previous analysis intuitively suggests that greater information processing biases should be associated with more feedback between structures responsible for representing affective and semantic aspects of information. Yet, it is unclear from these predictions what "more feedback" means, e.g., more connections between the structures, larger synaptic weights, or greater numbers of itterations through a feedback cycle. Additionally, it is unclear when this increased feedback is expected to take place. Simulations are useful for examining these variables.
Network Predictions. Siegle and Ingram (1997a) suggest that increasing the number of cycles in which a network engages in feedback between the affective and semantic nodes could be considered an analog for depressive rumination. They show that information processing biases on the valence identification task tend to increase when the number of feedback cycles is increased throughout the network’s training. This type of rumination may represent a coping style in which individuals think excessively about emotional information throughout their lives.
In contrast, when excessive feedback, representative of ruminative coping, is invoked only during overtraining in a system that uses a back-propagation learning algorithm, information processing biases on the valence identification task decrease. This situation may represent individuals who invoke ruminative coping processes (e.g., contemplate emotional aspects of environmental stimuli, consciously or unconsciously) only after a personally relevant negative event, as a way of dealing with it. The decrease in biases comes because the would-be stressor is immediately associated with already-learned information through autoassociative feedback, and thus little new learning of the stressful information takes place using a back-propagation learning rule. Siegle and Ingram (1997a) suggest that such a coping strategy may be protective against depression, but may also hinder recovery from depression, as the model with increased feedback also has difficulty relearning positive information after overtraining. This finding is not preserved using a Hebb learning rule. Since the Hebb rule does not attempt to minimize errors in decision making, previous training does not affect the network’s response to current stimuli. Thus biases would be expected to be exaggerated in ruminative copers based on a Hebb-training model.
Human Data. To test the prediction that increased feedback, representing rumination, is associated with increasing information processing biases, Siegle (1998a) gave depressed and nondepressed individuals a measure of ruminative coping (Nolen-Hoeksema and Morrow’s (1991) Response Styles Questionnaire; RSQ) along with the valence identification task. The RSQ is a self-report measure that asks test takers to endorse thoughts and behaviors they engage in while in a depressed mood. It contains a rumination subscale, composed of questions that ask about how often individuals think of aspects of their depression, e.g., "think ‘I am ruining everything'".
Scores on the rumination scale of the RSQ were compared to depressive information processing biases on the valence identification task, operationalized as the difference in reaction times to positive and negative stimuli. Nearly all high rumination scores on the RSQ were associated with depression. Because the network’s predictions affected both depressed and nondepressed individuals, a hierarchical regression was performed on valence identification biases in which depression status was entered on the first step and an individual’s score on the rumination scale of the RSQ was entered on the second step. Depression accounted for 20.3% of the variation in valence identification biases. Rumination was positively linked to depressive information processing biases, accounting for an additional 7.6% of the variation in biases, which was statistically significant..
Implications. Results examining relationships between information processing biases and rumination were consistent with predictions from the backpropagation model in which rumination was operationalized as a constantly operative personality factor, and the Hebb learning model in which rumination was considered a coping mechanism, but not the back-propagation model in which rumination was operationalized as a coping mechanism. This finding suggests that rumination could be understood as a process which occurs throughout some individuals’ lives, independent of the context in which they are situated. If rumination happens happens only after a negative event, then it is probably not a type of processes in which individuals consider the information before learning it in an attempt to understand it; rather they overlearn their initial perceptions of negative information.
Personally Relevant Information
Network Predictions. The network’s performance highlights a distinction between stimuli rarely made in empirical experiments. The network responded very differently to negative information on which it was overtrained and other negative information. On the valence identification task, biases were especially strong for negative information on which it was overtrained. On the lexical decision task, the network responded especially quickly to negative stimuli on which it was overtrained, but especially slowly to all other negative stimuli.
These differences suggest that it could be important to examine depressed individuals’ responses to negative information that is representative of the stimuli on which they could have been overtrained, and information on which they have not been overtrained. Based on the network’s performance, Siegle (1996) suggests that confounding of such personally relevant and nonrelevant information may contribute to the wide variability in effect sizes obtained on affective lexical decision tasks with depressed people.
Human data. To test the hypothesis that depressed individuals would respond differently to personally relevant and non-relevant information on the tasks, Siegle (1998a) asked individuals to generate stimuli they considered representative of what they thought about when they were depressed. These stimuli were included along with normed stimuli on the valence identification and lexical decision tasks. Results suggested that depressed individuals responded especially slowly to personally relevant negative words on the valence identification task, in comparison to other negative words and in comparison to positive words. Depressed individuals did not appear to respond particularly quickly to personally relevant negative words on the lexical decision task.
Implications. There are a number of possible implications of the human data. Potentially personal relevance is not representative of overtraining, in the sense in which it is implemented in the model. Another possible explanation is that the model is missing components relevant to explaining the range of depressive information processing biases. In support of this idea, many depressed individuals commented on their long reaction times during the debriefing, saying that they had actively attended to personally relevant negative words. In fact, their attention was so taken by these stimuli that they had been unable to respond to the task. Two depressed individuals broke down in tears whenever personally relevant negative words were displayed.
These reports suggest that slow reaction times for personally relevant negative information could be due to motor inhibition, present when depressed people think about negative information. Potentially, when depressed people think hard about negative information, their entire attention is drawn to the stimulus, and away from their motor response. This type of motor inhibition was not represented in the neural network, and was thus not accounted for in initial predictions. To "fix" the network, aspects of the motor system could be incorporated. This technique would add considerable complexity to the network, and the knowledge to be gained by such an endeavor is questionable. Rather, it may be more useful to examine whether other behavioral and physiological indices, not subject to motor slowing, do not mirror these delays.
Modeling Valence Confusion Rates
Reaction times yield information regarding how quickly individuals recognize information, but not about what they have recognized, e.g., whether they deemed positive information to be positive when they reacted to it. To understand whether information processing biases could lead to valence confusions (e.g., saying that a positive word is negative) a number of simulations were performed.
Network Predictions. Error and confusion rates can be simulated by examining whether the pattern of activation in relevant sections of the network, at its simulated reaction time, is closer to the expected pattern, than other erroneous patterns of activation. Using this metric, the predicted confusion rates for various levels of overtraining, based on Siegle and Ingram’s (1997b) model are presented in Figure 4.
-------------------------------------------------------------------------
Figure 4. Valence Confusion Rate predictions from the simulated valence identification task, from Siegle and Ingram (1997b)
Simulation details: Backpropagation trained network, mean of 25 simulation

------------------------------------------------------------------------
As shown in the figure, the non-overtrained model rarely makes valence identification errors. As a consequence of the model’s tendency to associate incoming information with the stimulus on which it was overtrained, the overtrained model displays a tendency to label neutral and positive stimuli as negative. The frequency of valence confusions increases as the model is overtrained. The model never made a valence confusion for negative words. These predictions can be summarized by suggesting that depressed individuals will be biased to label all stimuli as negative.
Human Data. To examine whether depressed individuals were biased to name words as negative, Siegle (1998a) calculated confusion rates for each valence with each other valence. To account for the possibility that some valences were closer to each other in semantic space than others, leading to a false appearance of bias, Luce’s (Luce & Narens, 1983) response rule was used to calculate distance-independent bias estimates for each valence. Bias terms for negative words were over twice as high as those for positive or neutral words, for depressed individuals, whereas bias terms were roughly equal for nondepressed individuals. This result suggests depressed people are biased to say that many stimuli are negative on a valence-identification task, whereas nondepressed individuals are not.
Implications. To the extent that these data generalize to situations outside the laboratory, it is suggested that depressed individuals will rarely have difficulty in perceiving negative information. In contrast they may have difficulty processing nonnegative information. Mechanisms behind the model’s similar performance suggest that depressed individuals may tend to see even positive information as negative, because it becomes associated with personally relevant negative information. This type of bias may help to maintain depressive affect, as few environmental stimuli would appear positive. Relearning of positive information would thus be impaired (Siegle, 1996; Siegle & Ingram, 1997a). Siegle and Ingram (1997b) have used the model’s performance to explain the occurance of individuals who are "too depressed" to complete information processing tasks, or who make excessive errors on these tasks, suggesting that they interpret nearly all incoming information as negative.
Modeling Pupil Dilation
Using the neural network model to produce analogs of reaction times and valence confusion rates shows whether a snapshot the network’s course of attention can reflect a single moment of human attention. The network provides a great deal more information regarding the nature of attention to emotional information than the moment at which an individual reacts to information. To capture this additional information, it is useful to examine the network's activation during the entire course of attention.
To understand the time course of attention in the network, activation of the valence units (amygdala system functions), semantic units (hippocampal system functions) and output accumulators (frontal functions) can be compared for positive, negative, or neutral information. The sum of positive activations throughout these layers (the network’s energy) is thus a rough estimate of total cognitive load over time. This sum was therefore used as an analog of pupil dilation.
Network Predictions. To make predictions for pupil dilation, it is useful to examine the activity of the network in response to positive, negative, and neutral information on each task after a moderate amount of overtraining. Figure 5 presents the Hebb trained network’s median response from 5 trials, in response to positive and negative stimuli on the valence identification task, over time.
As shown in the figures, before overtraining (left side), the network generally responds to the presentation of positive and negative stimuli by activating its representation of semantic aspects of the incoming stimulus and its valence. This activation falls off after a period. This behavior is seen by the one peak in the top left panel of each sub-figure. Similarly, for positive or negative stimuli, there is a peak and decay for the appropriate valence unit (top right panel). Activation of the appropriate output units leads to a sustained match for the correct output (bottom left panel). Consequently, there is a peak and dip in the expected pupil dilation waveform (bottom right panel). When the network is overtrained on negative information (right side), its activation of the appropriate valence and semantic pattern are initially kindled, but after a short time the network’s representation of the negative information on which it has been overtrained becomes more highly activated. A similar reversal occurs for neutral words. For personally relevant negative words, no reversal occurs because semantic nodes representing personally relevant information are initially activated. The reversal happens most quickly for negative words on which the network has not been overtrained, followed by neutral words, and finally by positive words. Similarly, the negative valence unit becomes activated late in the course of attention, even for positive stimuli. As the task nodes do not affect semantic and valence unit activations, the corresponding graphs are nearly identical for the lexical decision task, with the match unit for the personally relevant stimulus, eventually becoming very activated.
It is therefore predicted that average dilations will be highest to negative personally relevant words, followed by nonpersonally relevant negative stimuli, neutral, and positive stimuli on both tasks. These predictions are consistent with the idea that consideration of personally relevant negative information interferes with the processing of environmental stimuli by depressed individuals.
More intriguing predictions emerge when simulated pupil dilations are examined as a function of depressive overtraining. Examples of how pupil dilation changes continuously with overtraining on the valence identification task are shown in Figure 6. The figure on the bottom right shows that initial activations in the non-overtrained network are generally high, and decay slowly. As the network is overtrained, its initial activations become lower, but later activations are higher, for all non-personally relevant information, as a function of the activation of personally relevant negative information. This pattern becomes stronger as overtraining increases. The same patterns shown in Figure 6 are apparent on the simulated lexical decision task. The prediction for pupil dilations is thus, that nondepressed people should have relatively high early dilations on both tasks, but should have considerably lower late dilations, for all stimuli, on both tasks. Depressed individuals should have lower early dilations and higher late dilations for all non-personally relevant information. Depressed individuals are expected to have high early and late dilation for personally relevant negative information. With no forgetting function on learning, the same pattern emerges for late simulated dilations, but there is no decrease with overtraining in simulated early dilations.
-------------------------------------------------------------------------
Figure 5. Simulated valence identification task network responses. Sub-figures on the left represent the network’s behavior before overtraining. Sub-figures on the right represent the network’s behavior after 4 epochs of overtraining. Each sub-figure represents the network’s response to an exemplar of a different valence. The x-axis in each panel represents time, and the y-axis represents activation. In each of the sub-figures, the top left panel represents the activation of the network’s semantic features. The activation of each trained exemplar is represented by a single line in the panel. The top right panel is the activation of its affective features. The bottom left panel is the accumulation of evidence for a given valence or semantic exemplar. On the bottom right is the sum of activations for these layers, used as an analog of pupil dilation.
Simulation details: Hebb trained network, median of 5 simulations

--------------------------------------------------------------------------
Figure 6. Simulated early and late network activation on the valence identification task
For each set of graphs, representing the median activation over 5 simulations, the top left panel is the network’s average activation in the first 30 processing epochs. The top right activation is the network’s average activation after the first 100 epochs. The bottom left panel is the network’s simulated reaction time. The bottom right panel shows the network’s activation on the vertical Z axis. Time is represented on the horizontal X axis, and overtraining is on the horizontal Y axis.
Simulation details: Hebb trained network, 8 levels of overtraining. The median of 5 simulations was taken at each level.

--------------------------------------------------------------------------
Human Data. To examine whether the predicted attentional patterns could be obtained empirically, Siegle (1998a) measured pupil dilation for six seconds after stimuli were presented, in depressed and nondepressed individuals, on each task. To rule out effects of differential response times, responses were averaged time-locked to reaction times. The average response curves for depressed and non-depressed individuals were similar for all valences; curves averaged over all individuals, for all trials, of all valences are shown in Figure 7. Principal components analysis was performed on pupil dilation waveforms for each valence, for each condition, for each individual to establish separate early and late dilation intervals. As predicted by the network model, factors representing early dilations were uniformly lower for depressed than non-depressed individuals. Late dilations were uniformly higher. Depressed individuals' dilations in response to personally relevant stimuli were not, on average, dramatically different from those to other stimuli, as predicted by the network. Still, tests of parameters associated with dilation suggested that in comparison to nondepressed individuals, the slopes of depressed individuals’ pupil dilation curves in the late phases of attention were flatter in response to personally relevant negative words than to positive words on the valence identification task.
-------------------------------------------------------------------------
Figure 7. Mean of median pupil dilations averaged over valences, for depressed and nondepressed individuals, from Siegle (1998a). Responses were time-locked to reaction times. Time on the X axis is relative to individuals' reaction times.

-------------------------------------------------------------------------
Implications. The results above suggest that depressed individuals do not attend to information in the early stages of attention. Their attention is particularly sustained in the late stages of attention. Mechanisms behind the model’s similar performance suggest that environmental stimuli could serve primarily as cues for depressed individuals to think about overtrained information, which they do, long after stimuli are presented. This interpretation is consistent with the idea that depressive information processing biases are primarily associated with late elaborative or ruminative processes, rather than the earliest perceptual processes (e.g., Mathews & Macleod, 1991).
On using different sets of parameters
Different sets of parameters, and slightly different models developed over the last five years, were used to create the different simulations discussed here. In some sense, this practice is problematic in that a consistent set of parameters was not used to make a consistent set of predictions. The rational for this decision speaks to the seriousness, or lack thereof, with which the models are intended to be taken. The models presented here are simply sets of differential equations that help to formalize intuitions about the behavior of groups, and to generate predictions about ways these groups may behave under certain conditions. Humans may have enough actual degrees of freedom that slight changes in parameters approximated by the models would not lead to catastrophic differences in human behavior as they would in the highly constrained models. Different humans may have different values for human analogs of model parameters (Siegle & Ingram, 1997a). The models’ behaviors should, at most, be interpreted as interesting ways to understand aspects of behavior, one at a time, that allow predictions about aspects of the behavior of possibly different, and variable, depressed and nondepressed individuals.
What This Experiment Could Say About Depressed People
Depression leaves people miserable, thinking about negative things, feeling bad, and frequently, becoming suicidal. By understanding the processes that transform normal patterns of attention and association to be very negative, insight can be gained regarding the experiences of depressed people. The current model specifically lends insight into aspects of attention occurring in the seconds after information is presented.
Taken together the model’s predictions present a novel picture of attention in depression. Depressed individuals are hypothesized to pay little attention to information as it is presented. As time passes they begin to think and ruminate on the affect associated with the information. They turn things negative. Seconds after a stimulus is presented, they are still thinking, but about whatever negative information is central to their depression, rather than the presented stimulus. In contrast, nondepressed individuals are suggested to process stimuli quickly, and to be done with them. Their responses will therefore differ based on the valence of stimuli.
Some Speculations About the Neuropsychology of Information Processing Biases in Depression
To better understand the mechanisms behind the observed information processing biases, it is instructive to view the internal representations within the network that produced them. Figure 8 graphically presents the weights for every connection within the network in a Hinton diagram. Each layer of connections within the network is represented. The layer’s inputs are graphed across the top. Outputs down the left. Filled circles represent positive weights and can be thought of as activitory connections. Empty circles represent negative weights and can thus be thought of as inhibitory. The condition in which the network is overtrained on one negative stimulus, 5 times, is presented in the bottom part of the figure. In each case, the first three stimuli are positive, the next three are negative, and the final three stimuli are neutral.
-------------------------------------------------------------------------
Figure 8. Strength of connections with in the network before and after overtraining
Simulation details: Hebb trained network, 5 epochs overtraining

-------------------------------------------------------------------------
In the non-overtrained network, connections suggest that each input strongly activates one semantic unit. Each input and semantic unit is slightly negatively associated with one valence, and more negatively associated with the other. Thus, positive stimuli strongly inhibit the activation of negative associations both from inputs and from the semantic units. Negative stimuli inhibit the activation of positive associations. Similarly, positive associations in the valence layer inhibit negative semantic associations, and negative valence associations inhibit positive semantic associations. Neutral associations inhibit both positive and negative associations. In the output layer, semantic units activate exactly one semantic output. Semantic units activate one valence output, and inhibit the others. Valence units activate either the positive or negative valence, and inhibit the semantic and other valence units.
After overtraining, connections throughout the network are affected. Each input activates the semantic units for all other units to a greater degree than in the non-overtrained network, with the exception of the one negative exemplar on which the network was overtrained. Unintuitively, all inputs inhibit the activation of this exemplar, while that exemplar inhibits the activations of all other semantic units. When presented, the personally relevant stimulus activates its semantic representation to a greater degree than in the non-overtrained network. Also after overtraining, the negative valence unit is inhibited to a greater extent by all stimuli other than that for the personally relevant stimulus. The valence units inhibit the activation of all semantic units other than the personally relevant stimulus unit, to a greater extent than before the overtraining.
Biases in the network can thereby be explained in the following way. All valences are represented by positive activations of both valence units to some small degree. In the overtrained network, even slight positive activation of the valence units excites the semantic unit for the personally relevant stimulus. This in turn excites the negative valence unit to a large degree. All other units are strongly inhibited as the vicious cycle of activation of a personally relevant negative thought excites the mental representation of sadness.
While analogs of such a simplistic system to a human brain are tenuous at best, basic application of the same processes to physiology may be useful. Were the semantic units to represent hippocampal system activity and the valence units to represent amygdala system activity, the following interpretation would be offered. When people become depressed, the transfer of information between the hippocampal and amygdala systems is inhibited for all but a few personally relevant thoughts. Normal semantic associations, based on emotional valence would thus be disrupted. Any activitory amygdala activity would be assumed to kindle thoughts of the personally relevant stimulus, which would rekindle the amygdala.
Predictions about Inhibition of Amygdalar Activity by Cortex
These connection weights help to derive predictions that consistent with recent proposals regarding amygdala activity in depressed individuals. Davidson (1998) suggests that inhibition of amygdalar activity by prefrontal cortex prevents affect from interfering with normal semantic associations, in nondepressed individuals. He further suggests that amygdala activity becomes generally uninhibited in depressed individuals, such that any stimulus would lead to amygdalar activity. In support of this idea, Davidson has shown that depressed individuals have lower activity in the left dorsolateral prefrontal cortex (PFC) than do nondepressed individuals, and hence less amygdalar inhibition.
The following experiment with the neural network suggests that amygdalar inhibition might help to maintain normal information processing in the face of overtraining. Figure 9 shows the simulated responses from a network in which feedback from the output nodes (representative of PFC activity) was allowed to inhibit the valence nodes (representing amygdalar activity). The top segment of the figure contains panels representing the network’s simulated pupil dilation as a function of inhibition strength. The bottom right panel in this segment shows time on horizontal x axis, simulated dilation on the vertical y axis, and inhibition strength on the depth z axis, increasing towards the observer. The bottom segments show the response in each network layer at the minimum (0) and maximum (.85) simulated levels of inhibition. As shown in the figure, as inhibition increased, early dilation did not change appreciably, but late dilation decreased, as did the match to negativity and the activation of the personally relevant semantic stimulus. The resulting intuition is that inhibition of the amygdala by the PFC could prevent depressive rumination after an individual has negative experiences. Still, there is no reason, based on the model, to believe that low inhibition of the amygdala alone would cause depressive information processing biases. Rather, frontal activation could serve as a protective factor against depression.
--------------------------------------------------------------------------
Figure 9. Simulated pupil dilation for increasing inhibitory feedback from output nodes to valence nodes, on presentation of a non-personally relevant negative stimulus on the valence identification task
Simulation details: Hebb trained network, 8 epochs overtraining, median of 5 simulations.

--------------------------------------------------------------------------
Some predictions derived from the network do appear to contrast with Davidson's proposal though. Under Davidson's theory, hypofrontal activation could lead to a constant disinhibition of amygdalar activity. In contrast, the weight diagrams in Figure 9 suggest that only thoughts of personally relevant negative information will lead to amygdala activity in depressed individuals, and other stimuli will effectively inhibit it, until they remind the individual of personally relevant negative information. The theories could be experimentally differentiated in the following manner. Davidson’s (1997, 1998) theory of a disinhibited amygdala would suggest that in depressed individuals the amygdala would be especially active throughout the course of attention. The current theory suggests that the amygdala (and hippocamus) in depressed individuals would be less active than in nondepressed people in the early stages of attention, but as they began to associate a stimulus with personally relevant negative information in the late stages of attention, their amygdala would become progressively more active. Neuroimaging of the amygdala system during a valence identification task could thus differentiate between these theories.
Potential Treatment Implications
Ideally, information about depressive attentional styles can be used to create interventions that account for them. For example, if clinicians understand what aspects of information a depressed person focuses on, these aspects of information could be focussed on during cognitive interventions. Moreover, experiments can be done to understand what strategies remediate information processing biases in the network; these strategies could then be applied to people.
The current findings have potential implications for the cognitive and pharmacological treatment of depression. Cognitive therapies are often based on the idea that by identifying and challenging negative thoughts, depressed individuals can see things more positively. The current model represents many aspects of negative information processing without ever accounting for a person’s subjective belief in the validity of his or her negative thoughts; the thought, irrespective of its subjective validity kindles the vicious spiral of depression. Any therapy that focussed primarily on identifying negative cognitions could serve to help the depressed individual learn them better, and thus to have more negative thoughts! Other techniques seem necessary to make such a strategy work.
Replacing dysfunctional negative cognitions with more adaptive or useful positive cognitions seems especially promising. In terms of the current model it could be construed as training a person on positive thoughts, whose entire limbic system is actively inhibiting the consideration of such material.
The following experiment with the simulated neural network shows that positive retraining can largely overcome biases induced by overtraining. Figure 10 shows the network’s response to a positive stimulus, along with connection weights between the semantic and valence layers after the network is retrained on one positive, and one neutral exemplar for five epochs. As shown in the figure, the retrained network’s valence activation, match accumulation, and simulated pupil dilation curves for the valence identification task look increasingly like they did before the overtraining. In the semantic nodes, it can be seen that the retrained network responds to the presented retrained stimulus by activation of the retrained stimuli, to the extent that the overtrained personally relevant negative stimulus can not usurp their activation. As shown in the Hinton diagrams, the retrained network still inhibits positive information more than it had originally done so, but activation from the new personally relevant positive and neutral patterns allows competition from valence nodes representing positivity.
---------------------------------------------------------------------
Figure 10. Network response to a positive stimulus before and after positive retraining
Simulation details: Hebb trained network, median of 5 simulations

---------------------------------------------------------------------
Retraining depressed people on positive exemplars is thus expected to lead people think about specific positive exemplars, even when negative cognitions are not challenged. The trick will be to make positive cognitions "stick" for depressed people in the same way that negative cognitions do. The more a depressed person associates incoming information with learned negative exemplars, the less likely a positive exemplar is to be learned, as such. Siegle (1996; Siegle & Ingram, 1997a) have shown that the amount of feedback occurring between the affective and semantic representations of information in the brain govern how likely information is to be turned negative. Potentially, ruminative response styles can be targeted in therapy before positive retraining is engaged in to aid in relearning. Meditative interventions such as mindfulness training (e.g., Teasdale, Segal, & Williams, 1995) may help individuals to stop associating affective aspects of stimuli with semantic associations, thereby breaking the ruminative cycle that drives the network's information processing biases.
There are also implications of the current analyses for pharmacologic interventions. It was noted that the primary function of depressive overtraining was to increase inhibition of cognitions that are not personally relevant and negative. This analysis suggests that a pharmacologic agent that could block inhibition in the amygdala and hippocampal systems might be useful in the remediation of depression. Park (1998, unpublished) presents converging evidence suggesting that seratonergic pathways stemming from the median raphe may serve a primarily inhibitory function, and may thus be candidates for pharmacologic intervention. Additionally, because biases are hypothesized to occur as a result of inhibitory feedback between the hippocampus and amygdala systems, drugs targeting either of these structures could break the cycle. If later research shows that certain depressed individuals attend primarily to the affective or semantic aspects of information, drugs specifically targeting one or the other of these structures could be considered.
Conclusions
The work presented in this chapter has brought together a number of converging lines of research. Using a computational neural network model, cognitive and physiological theories of attention to emotional information were shown to be equivalent. A theory of depression, originally advanced for semantic networks (Ingram, 1984) was shown to translate, with only slight modification, to a plausible set of physiological structures. By implementing the model computationally, predictions regarding the time course of attention in these models, for depressed and nondepressed individuals, were advanced. Using behavioral and physiological measures, these predictions were tested. Based on the results of the research, integrative conclusions regarding the behavioral, cognitive, and physiological underpinnings of depression were advanced. Most notably, depressed people were observed to ruminate more than nondepressed people.
Depression is, by definition (APA, 1994), a disorder that affects people’s behavior, cognitions, and physiology. It therefore seems necessary to account for all three domains in explaining the onset and maintenance of the disorder. It is my hope that this type of integrative research, tied together using computational models, may prove to be a powerful research tool in future studies of depression. I believe that integrating behavioral, cognitive, and physiological research on depression, through computational modeling and empirical model evaluation, has the power to advance our understanding of depression in all of these fields.
APPENDIX -- DIFFERENCES BETWEEN THE BACKPROPAGATION AND HEBB TRAINED NEURAL NETWORK
This appendix details the design of the backpropagation trained network used in early simulations. Because the code for the backpropagation network is similar to the code for the Hebb trained network, only salient differences will be described.
Representation. In most simulations activation fed forward from orthographic inputs, through a layer of generalization nodes, to activate semantic feature nodes and 2 valence nodes. Feedback occurred between the semantic and valence units. Siegle (1996) did not connect inputs to valence nodes, and used 18 input and semantic nodes and 12 generalization nodes. Siegle & Ingram (1997a) also did not connect inputs to valence nodes, eliminated the hidden nodes, and included only 10 orthographic and 10 semantic nodes. For most simulations, only the first nine nodes were used. The tenth node was reserved for simulations involving "novel" stimuli to which the network was not exposed during its initial training period.
Siegle’s (1996) network was trained on twelve positive, twelve negative and twelve neutral stimuli. Stimuli were generated as pseudo-random strings of .5’s and -.5’s in which 2/3 of the stimuli were made to be -.5’s. To simplify analyses, three positive, three negative, and three neutral stimuli were used for Siegle & Ingram’s (1997a) study. Stimuli were represented in a localist fashion in which only one orthographic, one semantic and one valence node was expected to be active for a given stimulus. The restriction to a localist representation was not essential for the simulations, but was useful for illustrating how network connections changed when various aspects of personality were simulated, and pre-empted concerns regarding the differential feature frequencies in a distributed representation.
In Siegle’s (1996) study valences were represented orthogonally (positive: .5,-.5, negative: -.5,.5, neutral:-.5,-.5). In subsequent simulations Positive affective features were coded as activations of .2,-.2. Negative features are coded as -.2,.1. Neutral features are coded as 0, -.2. These valences were empirically determined from Williams et al's (1998) valence rating experiments.
Training. Training involved presenting a simulated orthographic representation of a stimulus to the network, using the presentation parameters for the tasks, for 10 cycles, observing the network’s semantic and valence nodes, and adjusting the weights within the network until the desired semantic and valence representations were achieved using a modified back-propagation learning algorithm (Rumelhart, Hinton, and Williams, 1986). Weights were modified based on the error after 10 cycles rather than according to the standard backpropagation through time algorithm which updates weights based on the average error at each cycle, because it is assumed that learning only occurs after associations are made. Training continued until the sum of the mean squared error in the semantic and valence nodes was below 0.001 for Siegle’s (1996) study, for a block of all inputs. Due to the greater error incurred by not using hidden nodes, Siegle & Ingram (1997a) used an error threshold of 0.004 for all stimuli. Training for connections from the semantic and valence nodes to output nodes was done separately. To represent the induction of depression, the network was trained on a single negative stimulus for 100 epochs after the network’s initial training was complete. Due to their smaller network, Siegle and Ingram (1997a) used 70 epochs of overtraining.
Network Activation During Tasks. The rules governing the network’s activation were the same as for the Hebb trained network, with the exception that nonlinearity was introduced as a logistic function rather than a piecewise linear function. t was generally set to .5, and b was generally set to .1. The threshold for affective and semantic determinations on match filters is .46. Gaussian noise was incorporated on all layers. After a specified stimulus onset asynchrony network inputs were eliminated entirely, rather than propagating noise through the network, as in the Hebb network simulations. To allow for neutral judgements in a network with unipolar weights (i.e., neutrality is represented as the absence of positivity and negativity) the network was said to judge a stimulus to be neutral when little evidence was accumulated for either valence (both accumulators less than 0.8) after a temporal threshold of 132 epochs plus gaussian noise. Nonword decisions were made using a temporal threshold.
------------------------------------------------------------------
Table 3
Parameters used in the backpropagation trained neural network simulations
|
Parameter |
Value |
|
|
Number of input nodes |
10 |
|
|
Number of semantic nodes |
10 |
|
|
Number of Valence nodes |
2 |
|
|
Activation parameters |
|
|
|
t (input diffusion rate) |
0.5 |
|
|
b (affective-semantic loop diffusion rate) |
0.2 |
|
|
maximum network activation |
1.0 |
|
|
minimum network activation |
0 |
|
|
network noise |
0.05 |
|
|
Task parameters |
||
|
accumulation noise |
1.0 |
|
|
temporal threshold for "Nonword" decisions |
200 epochs |
|
|
temporal threshold noise |
10 |
|
|
positive determination accumulation threshold |
1.0 |
|
|
negative determination accumulation threshold |
1.0 |
|
|
Learning parameters |
|
|
|
eta (learning rate) |
0.2 |
|
|
alpha (learning momentum) |
0.4 |
|
|
error threshold for initial learning |
0.004 |
|
|
additional epochs of training on negative stimuli |
70 |
|
|
Activations in one training epoch |
10 |
|
|
Training set |
||
|
Number of stimuli |
9 |
|
|
Number of negative stimuli representing depressogenic loss |
1 |
|
-------------------------------------------------------------------
Siegle & Ingram (1997a) allowed less feedback between the network’s representation of semantic and valence identification than did Siegle (1996). The network performed relatively similarly to Siegle’s (1996) original network with the exception that when overtrained on negative stimuli it was facilitated on negative stimuli on the valence identification task with respect to the network which was not overtrained.
Differences between Hebb and Backpropagation learning rules
Differences between the network behaviors, their interpretation, and the types of parameters used to change network behavior all exist when learning is done using Hebb or Backpropagation. There are theoretical arguments for the biological relevance of each system, e.g., many people argue there is no biological analog of backpropagation; others argue that we surely have hidden layers, and it is unclear how to incorporate hidden layers in a supervised Hebb learning system. The more interesting issues occur at a more highly theoretical level.
Differences in assumptions about the nature of learning. Backpropagation treats learning as a procedure for minimizing error between actual and expected responses to stimuli. As such, when information is learned sufficiently well, little new learning takes place, unless there is a great deal of noise in a system (i.e., unless error is always introduced into the system’s output, so that there is something to be minimized). Two consequences of this approach affected modeling efforts. First, enough noise was added to the system that its behavior was often erratic. Second, when feedback was increased in the network after original training, no new learning took place during overtraining since the network’s outputs tended to resemble learned patterns; the feedback acted like a "cleanup" system, minimizing the network’s errors. One downside of this rule is that someone who continues to have negative experiences would not be expected to become progressively more depressed, based on the network’s behavior.
In contrast, Hebb learning assumes that each experience strengthens connections, regardless of how strong those connections were previously. Thus, new learning can always occur. In this case, rumination could not be considered a coping mechanism. More feedback would strengthen associations with negativity, and thus, would allow the network to relearn negative associations more strongly. Someone who has more negative experiences would be predicted to become progressively more depressed.
Parameter differences. Different parameters are also available for investigating as analogs of cognitive variables, to the researcher using back-propagation versus Hebb learning. In backpropagation, two parameters representing the rate at which new stimuli are learned, and the effect of recent previous learning on new learning, are available. Siegle & Ingram (1997a) interpret these parameters as representing different dimensions of the personality variable Openness to Experience. In Hebb learning a parameter representing the effect of new experiences on connection strengths is relatively analogous to the learning rate in back-propagation. A parameter representing the rate at which new learning occludes old learning, i.e., a forgetting function, is also available.
A nonjudgemental conclusion. The differences between Hebb and back-propagation learning rules do not immediately suggest that one is better than another. Rather, they afford different interpretations of similar phenomena involving exposure to new information. Simulations using both architectures may be valuable for better understanding disorders that may involve overlearning information, such as depression.
References
Amaral, D., Price, J., Pitkanen, A., & Carmichael, S. T. (1992). Anatomical organization of the primate amygdaloid complex. In J. P. Aggleton (Ed.) The amygdala: Neurobiological aspects of emotion, memory, and mental dysfunction. (p. 1-66). New York, NY: Wiley-Liss.
American Psychiatric Association, (1994). Diagnostic and Statistical Manual of Mental Disorders, Fourth Edition, Washington, D. C.: American Psychiatric Association.
Anderson, J. A. (1990). Hybrid computation in cognitive science: Neural networks and symbols. Applied Cognitive Psychology, 4, 337-347.
Beatty, J. (1982). Task-evoked pupillary responses, processing load, and the structure of processing resources. Psychological Bulletin, 91, 276-292.
Beatty, J. (1986). The pupil system. In M. G. H. Coles et al. (Eds.), Psychophysiology: Systems, Processes, and Applications. New York: Guilford.
Beck, A. T. (1974). The development of depression. In R. J. Friedman & M. M. Katz (Eds.), The psychology of depression. New York: Winston-Wiley.
Blaney, P. (1986). Affect and memory: a review. Psychological Bulletin, 99, 229-246.
Blank, D. S., Meeden, L. A., Marhsall, J. B., (1991). Exploring the symbolic/subsymbolic continuum: A case study of RAAM. In J. Dinsmore (Ed.), Closing the Gap: Symbolism vs. Connectionism, Hillsdale, NJ: Erlbaum.
Bower, G. (1981). Mood and memory. American Psychologist, 36, 129--148.
Brewin, C. R., Andrews, B., & Gotlib, I. (1993). Psychopathology and early experience: A reappraisal of retrospective reports. Psychological Bulletin, 113, 82-98.
Cohen, J. D., Dunbar, K., & McClelland, J. (1990). On the Control of Automatic Processes: A Parallel Distributed Processing Account of the Stroop Effect, Psychological Review, 97, 332-361.
Collins, A., & Loftus, E., (1975). A Spreading-Activation Theory of Semantic Processing, Psychological Review, 82, 407-428.
Davidson, R. (1997). Affective style and affective disorders: Perspectives from affective neuroscience. Address given at the meeting of the Society for Research in Psychopathology. Palm Springs, CA.
Davidson, R. (1998). Affective style and affective disorders: Perspectives from affective neuroscience. Address given at the Fourth Annual Wisconsin SYmposium on Emotion: Affective Neuroscience. Madison, WI.
Fernandez de Molina, A. & Hunsberger, R. W. (1962). Organization of the subcortical system governing defence and flight reactions in the cat. Journal of Physiology, 7, 200-213.
Flaherty, J. A., Gavira, F. M. & Val, E. R. (1982). Diagnostic considerations. In E. R. Val, F. M. Gavira, & J. A. Flaherty (Eds.), Affective disorders: Psychopathology and treatment. Chicago: Year Book Medical Publishers.
Halgren, E. (1992). Emotional neurophysiology of the amygdala within the context of human cognition.
In J. P. Aggleton (Ed.) The amygdala: Neurobiological aspects of emotion, memory, and mental dysfunction. (p. 191-228). New York, NY: Wiley-Liss.
Hakerem, G. Sutton, S. (1966). Pupillary Response at Visual Threshold. Nature, 212, 485-486.
Hebb, D. O. (1949). The organization of behavior: A neuropsychological theory. New York: Wiley.
Hess, E. H. (1972). Pupillometrics: A method of studying mental, emotional, and sensory processes. In N. S. Greenfield & R. A. Sternbach (Eds.), Handbook of psychophysiology. (pp. 491-531). New York, N.Y.: Holt, Rinehart & Winston.
Hess, E. H. & Polt, J. H. (1964). Pupil size in relation to mental activity during simple problem solving. Science, 182, 177-180.
Hinton, G. E. (Ed.). (1991). Connectionist symbol processing, Cambridge, MA: MIT Press.
Hinton, G. E., McClelland, J. L., & Rumelhart, D. E. (1986). Distributed Representations. In J. L. McClelland, & D. E. Rumelhart (Eds.), Parallel distributed processing: Explorations in the microstructure of cognition, Vol 1 (pp. 77-109). Cambridge, MA: MIT Press.
Ingram, R. (1984). Towards an information processing analysis of depression, Cognitive Therapy and Research, 8, 443-478.
Jobe, T. H., Fichtner, C. G., Port, J. D., & Gavira, M. M. (1995). Neuropoiesis: Proposal for a connectionistic neurobiology. Medical Hypotheses, 45, 147-163.
Kahneman, D. Beatty, J. (1966). Pupil diameter and load on memory. Science, 154, 1583-1585.
Kitayama, S. (1990). Interaction between affect and cognition in word perception. Journal of Personality & Social Psychology, 58, 209-217.
Koikegami, H. & Yoshida, K. (1953). Pupillary dilation induced by stimulation of amygdaloid nuclei. Folia Pychiatrica Neurologica Japonica, 7, 109-125.
Le Doux, J. (1997). Emotion, Memory, and the Brain. Presentation at the meeting of the American Psychological Association.
Le Doux, J. (1992). Emotion and the amygdala. In J. P. Aggleton (Ed.) The amygdala: Neurobiological aspects of emotion, memory, and mental dysfunction. (p. 339-351). New York, NY: Wiley-Liss.
Luce, R. D. & Narens, L. (1983). Symmetry, scale types, and generalizations of classical physical measurement. Journal of Mathematical Psychology, 27, 44-85.
MacLeod, C., & Mathews, A. M. (1991). Cognitive-experimental approaches to the emotional disorders. In Paul R. Martin, (Ed.), Handbook of behavior therapy and psychological science: An integrative approach. Vol. 164 (pp. 116-150). New York: Pergamon Press.
Massaro, D. (1988). Experimental Psychology: An Information Processing Approach. San Diego, CA: Harcourt, Brace, Jovanovich.
Matt, G., Vazquez, C., & Campbell, W. (1992). Mood-congruent recall of affectively toned stimuli: A meta-analytic review, Clinical Psychology Review, 12, 227-255.
Matthews, G. & Harley, T. A. (1996). Connectionist models of emotional distress and attentional bias. Cognition & Emotion, 10, 561-600.
Matthews, G. & Southall, A. (1991). Depression and the Processing of Emotional Stimuli: A study of Semantic Priming, Cognitive Therapy and Research 15, 283-302.
Miyata, Y. (1991). A user’s guide to PlaNet version 5.6: A tool for constructing, running, and looking into a PDP network. (Available from Yoshiro Miyata, Department of Computer Science, Univ. of Colorado at Boulder, Boulder, CO 80309-0430).
Movellan, J. R., & McClelland, J. L. (1994). Stochastic interactive processing, channel separability, and optimal perceptualinterference: An examination of Morton’s law. Department of Psychology, Carnegie Mellon University, Technical Report PDP.CNS.95.4.
Nolen-Hoeksema, S. & Morrow, J. (1991). A prospective study of depression and posttraumatic stress symptoms after a natural disaster: The 1989 Loma Prieta earthquake. Journal of Personality & Social Psychology, 61, 115-121.
Park, B. (1998, unpublished) A connectionist account of antidepressant action. Available from the Connectionist models of cognitive, affective, brain, and behavioral disorders website at www.sci.sdsu.edu/CAL/connectionist-models/.
Paykel, E. S. (1979). Causal relationships between clinical depression and life events. In Barrett, J. E. (Ed.), Stress and mental disorder (pp. 71-86). New York: Raven Press.
Ratcliff, R. (1978). A theory of memory retrieval. Psychological Review, 85(2), 59-108.
Rumelhart, D. E., Hinton, G. E., & Williams, R. J. (1986). Learning internal representations by error propagation. In D. E. Rumelhart, J. L. McClelland, & the PDP Research Group (Eds.), Parallel distributed processing: Explorations in the microstructure of cognition (Vol 1., pp. 318-362). MA: MIT Press.
Siegle, G. J. (1996). Rumination on affect: Cause for negative attention biases in depression? Unpublished Master’s Thesis, San Diego State University.
Siegle, G. J. (1998a). The neuropsychology of cognitive bias and pupilary response in depresssion. Unpublished Doctoral Dissertation, San Diego State University / University of California, San Diego.
Siegle, G. J. (1998b). Connectionist Models of Cognitive, Affective, Brain, and Behavioral Disorders. World Web Site located at http://www.sci.sdsu.edu/CAL/connectionist-models/.
Siegle, G. J. (1998c). A neural network model of affective interference in depression. International Workshop on Neural Network Models of Cognitive and Brain Disorders, College Park, MD.
Siegle, G. J. & Ingram, R. E. (1997a). Modeling Individual Differences in Negative Information Processing Biases. In Matthews, G. (Ed.), Personality and Individual Differences in Psychopathology. Princeton, NJ: Erlbaum.
Siegle, G. J. & Ingram, R. E. (1997b). A neural network model of inability to process emotional information in depression. Presentation at the meeting of the Society for Research in Psychopathology. Palm Springs, CA.
Siegle, G., Ingram, R., Granholm, E., & Matt, G. (1998b). Modeling the time course of attention to negative information in depression. In G. Matthews (Chair), Cognitive science perspectives on personality and emotion. Presentation at the 9th European Conference on Personality, Surrey, England.
Siegle, G. J., Ingram, R. E., & Matt, G. E., (1995). A neural network model of information processing biases in depression. Poster session presented at the workshop Neural Modeling of Cognitive and Brain Disorders. College Park, Maryland.
Siegle, G. J., Ingram, R. E., & Matt, G. E., (1998a, submitted). Affective Interference: Cause for Negative Attention Biases in Depression? Note: This information is also presented in Siegle (1996).
Squire, L. R. (1992). Memory and the hippocampus: A synthesis from findings with rats, monkeys, and humans. Psychological Review, 99, 195-231.
Teasdale, J. D., Segal, Z., & Williams, J. M. (1995). How does cognitive therapy prevent depressive relapse and why should attentional control (mindfulness) training help? Behavior Research and Therapy, 33, 25-39.
Tucker, D. M. & Derryberry, D. (1992). Motivated attention: Anxiety and the frontal executive functions. Neuropsychiatry, Neuropsychology, & Behavioral Neurology, 5, 233-252.
Williams, G., Conner, J., Siegle, G., Ingram, R., & Cole, D. (1998). Is more negative less positive? Relating dysphoria to emotion ratings. Presentation at the meeting of the Western Psychological Association, Albuquerque, New Mexico.
Williams, J. M. G., Mathews, A., & MacLeod, C. (1996). The emotional Stroop task and psychopathology. Psychological Bulletin, 120, 3-24.
Yates, J., & Nasby, W. (1993). Dissociation, affect, and network models of memory: An integrative proposal. Journal of Traumatic Stress, 6, 305-326.