Calculating semantic relatedness of lists of nouns using WordNet path length

Ensor, Tyler M.; MacMillan, Molly B.; Neath, Ian; Surprenant, Aimée M.

doi:10.3758/s13428-021-01570-0

Calculating semantic relatedness of lists of nouns using WordNet path length

Published: 12 April 2021

Volume 53, pages 2430–2438, (2021)
Cite this article

Download PDF

Behavior Research Methods Aims and scope Submit manuscript

Calculating semantic relatedness of lists of nouns using WordNet path length

Download PDF

Tyler M. Ensor¹,
Molly B. MacMillan²,
Ian Neath² &
…
Aimée M. Surprenant²

1569 Accesses
3 Citations
3 Altmetric
Explore all metrics

Abstract

Lists of semantically related words are better recalled on immediate memory tests than otherwise equivalent lists of unrelated words. However, measuring the degree of relatedness is not straightforward. We report three experiments that assess the ability of various measures of semantic relatedness—including latent semantic analysis (LSA), GloVe, fastText, and a number of measures based on WordNet—to predict whether two lists of words will be differentially recalled. In Experiment 1, all measures except LSA correctly predicted the observed better recall of the related than the unrelated list. In Experiment 2, all measures except JCN predicted that abstract words would be recalled equally as well as concrete words because of their enhanced semantic relatedness. In Experiment 3, LSA, GLoVe, and fastText predicted an enhanced concreteness effect because the concrete words were more related; three WordNet measures predicted a small concreteness effect because the abstract and concrete words did not differ in semantic relatedness; and three other WordNet measures predicted no concreteness effect because the abstract words were more related than the concrete words. A small concreteness effect was observed. Over the three experiments, only two measures, both based on simple WordNet path length, predicted all three results. We suggest that the results are not unexpected because semantic processing in episodic memory experiments differs from that in reading, similarity judgment, and analogy tasks which are the most common way of assessing such measures.

A comparison of homonym meaning frequency estimates derived from movie and television subtitles, free association, and explicit ratings

Article 10 September 2018

Semantic knowledge influences whether novel episodic associations are represented symmetrically or asymmetrically

Article 18 June 2019

Set size and long-term memory/lexical effects in immediate serial recall: Testing the impurity principle

Article 07 December 2018

The recent increase in the number of high-quality norms and databases allows the memory researcher to better equate two classes of stimuli such that they differ only on the dimension of interest. For example, age of acquisition refers to when a word is typically learned. Unfortunately for the experimenter, age of acquisition correlates with dimensions such as concreteness, number of phonemes, frequency, and orthographic and phonological Levenshtein distance and frequency, to mention only a few, and these confounds have led to quite different reports of the effect of age of acquisition on different memory tasks (for a review, see MacMillan, Neath, & Surprenant, in press). By using the new norms, which were unavailable to earlier researchers, MacMillan et al. were able to create a set of early and late age of acquisition words that were equated on all of these (and other) dimensions and found that whereas age of acquisition affected recognition, it had no effect on either free or serial recall. However, one issue they encountered was that there were no published assessments of which measure of semantic relatedness^{Footnote 1} was appropriate for episodic memory tasks. That is, although many such measures exist, and although their predictions have been compared to human behavior, the tests chosen to evaluate these measures are typically lexical decision, similarity judgment, semantic categorization, and analogy tasks. It is not clear whether these measures capture the processing involved in, for example, immediate serial recall. The purpose of the current paper is to report an initial assessment.

Semantic relatedness effects are readily observable in immediate memory tasks. For example, Murdock and Vom Saal (1967), Murdock (1976), Poirier and Saint-Aubin (1995), and Chubala, Neath and Surprenant (2019) all found better serial recall of categorized lists, those containing exemplars from the same category, than of uncategorized lists, those containing exemplars from different categories. Such lists are typically constructed using norms such as Battig and Montague (1969) and Van Overschelde et al. (2004). Semantic relatedness can also affect memory when the words are related by association or meaning rather than by category membership. For example, Tse (2009), Tehan (2010), and Tse et al. (2011) all found better serial recall of associated than unassociated lists. Such lists are typically constructed using word association norms such as De Deyne et al. (2019) and Nelson et al. (2004).

However, one difficulty is how to assess the extent to which two lists differ in semantic relatedness. Consider two lists of nouns. The first is alien, companion, polo, soul, transient and the second is bout, clientele, diary, fuss, sensation. Do these lists differ in semantic relatedness?

Although word association norms are useful for creating new lists, they are less useful for assessing existing lists. For example, according to the De Deyne et al. (2019) norms, no word in either list is a forward or backward associate of any of the other words in the list. Another way of measuring semantic relatedness is based on analyzing co-occurrence within a large corpus and representing each word as a vector. The similarity measure reported is the cosine of the angle between the two vectors representing the two words of interest. This measure ranges from –1 to +1, with +1 indicating maximum similarity, 0 indicating a 90° angle between the vectors (i.e., the vectors are orthogonal), and –1 indicating maximum dissimilarity. LSA (latent semantic analysis; Landauer & Dumais, 1997; see http://lsa.colorado.edu) has been used previously as a measure of semantic relatedness in the immediate memory literature (e.g., Tse, 2009). Using the topic space “general reading up to 1st year college,” LSA indicates that the two lists do differ in semantic relatedness, t(8) = 2.446, p = 0.040. LSA indicates the alien list is more related (M = 0.160, SD = 0.080) than the bout list (M = 0.070, SD = 0.016). GloVe (Global Vectors for Word Representation; Pennington et al., 2014) is a distributional model that also represents words as vectors based on co-occurrence. Using a library of pre-trained words based on 840 billion tokens with a vocabulary of 2.2 million words, this measure indicates that there is no difference between the two lists, t(8) = 0.446, p = 0.659, with a mean cosine of 0.160 (SD = 0.071) for the alien list compared to 0.178 (SD = 0.047) for the bout list. A second distributional model, fastText (2021), using the pre-trained word vectors from Grave et al. (2018), also indicates no difference, 0.149 (SD = 0.034) for the alien list compared to 0.116 (SD = 0.462) for the bout list, t(8) = 1.218, p = 0.235.

A third way of measuring semantic relatedness is based on WordNet (Fellbaum, 2017; Miller et al., 1990), a lexical database in which words are organized into synonym sets (synsets) that represent the specific meaning of a word. Although WordNet also includes verbs, adjectives, and adverbs, we focus on nouns because the structure of WordNet does not afford a simple way of evaluating similarity across parts of speech. Nouns with multiple meanings are organized in different synsets. For example, whereas beagle has only one sense (“a small short-legged smooth-coated breed of hound”), racket has four senses: (1) a “loud and disturbing noise,” (2) a “fraudulent scheme,” (3) “the auditory experience of sound that lacks musical quality,” and (4) “a sports implement”.^{Footnote 2}

WordNet can be seen as implementing an organization that developed from spreading activation models such as that of Collins and Loftus (1975). Nouns are organized in a hierarchy of concepts, one for each sense. These synsets are connected to each other via explicit semantic relations. For example, because beagle is a kind of hound (“any of several breeds of dog used for hunting typically having large drooping ears”), it is a hyponym of hound and hound is a hypernym of beagle. By following these isa links, the path length between synsets can be computed. For example, the path length between beagle and greyhound is 3 (both are types of hounds), and the path length between beagle and sense 1 of dog (“domestic dog, Canis familiaris”) is 4 (a hound is a hyponym of dog, and dog is a hypernym of hound). The shortest path length is 1 (between beagle and beagle) and the longest path lengths are in the 20s.

The most simple WordNet-based measure of semantic relatedness is WordNet path length (WNPL), which is the shortest path between any sense of either word. In addition to this measure, two additional measures scale simple path length. Leacock and Chodorow (1998)’s measure, LCH, scales path length by the maximum path length found in the hierarchy containing the two words. Wu and Palmer (1994)’s measure, WUP, scales path length using the depth of the least common subsumer. Three other measures, JCN (Jiang & Conrath, 1997), LIN (Lin, 1998), and RES (Resnik, 1995), take into account not only path length but also information content, which is inversely related to how frequently a concept is encountered (see Meng et al., 2013, for details on each measure and for formulas). For the example lists, all of these WordNet-based measures indicate the alien list is more related than the bout list.

For the example list, then, all of the WordNet-based measures as well as LSA indicate a difference, with the alien list being more related, whereas neither GloVe nor fastText indicate a difference. It should be noted that these measures were not developed to predict the effect of semantic relatedness on episodic memory tasks. Because we could find no published assessment of which, if any, of these measures best indicates whether two sets of stimuli will be differentially recalled because of differences in semantic relatedness, we began this series of studies.

As a starting point, we used simple WordNet path length (WNPL) to construct the stimuli in the first experiment. One reason is its simplicity: The primary assumption is that the longer the path between two nouns, the less the two items are semantically related. A second reason is that this idea accords well with notions of how spreading activation may help during encoding and retrieval in immediate memory tasks (e.g., Tehan, 2010). For example, relative to a list of words that are less related, processing the first item in a list of related words would cause activation to spread to related items. When the second item in the list is processed, it is already partially activated. This activation is reinforced by the remaining items. Similar activation may also aid retrieval. A third reason is that this measure indicated no difference in semantic relatedness in the age of acquisition stimuli of Macmillan et al. (in press), and there was also no difference in recall. The reason this is potentially informative is that if there had been a difference in semantic relatedness between their two conditions, there should also have been a recall difference.

The first experiment tests whether a difference in immediate serial recall will be observed when all but one of the measures indicates that one set of words is more semantically related than a second set. Experiments 2 and 3 then test the measures for our intended primary use case, assessing whether two sets of words vary in semantic relatedness. Experiment 2 shows that when a primary manipulation is confounded with semantic relatedness, the main effect of interest may not occur, whereas Experiment 3 shows that when the two sets of words are equated for semantic relatedness, the main effect of interest is now observable.

Experiment 1

Performance on immediate serial recall tests is generally better when the words are related than unrelated, even when relatedness is not defined by category membership (e.g., Tehan, 2010; Tse, 2009; Tse et al., 2011). Experiment 1 was designed to examine this effect. A set of stimuli was created such that the related and unrelated lists were equated on a number of other dimensions known to affect immediate serial recall (see Table A1 in the appendix). The related words all had a path length of 3 from each other; therefore, WNPL = 3.00 (SD = 0.00). The unrelated words had a WNPL of 9.33 (SD = 1.60). This difference is approximately the same as that found in the literature for categorized lists. For example, exemplars in 58 of the 64 categorized lists used by Chubala et al. (2019) are in WordNet. The mean WNPL of these categorized lists is 4.82 (SD = 1.24) compared to 11.97 (SD = 2.25) for 58 lists constructed by randomly sampling exemplars from different categories. As shown in Table 1, all but two of the measures of semantic relatedness indicate that the two lists differ. The exceptions are that LSA indicates no difference between the lists and GloVe indicates only a borderline difference.

Table 1 Measures of semantic relatedness for each condition in each experiment

Full size table

Method

Subjects

Forty-four volunteers from ProlificAC were paid £8.00 per hour (prorated) for their participation. The inclusion criteria were: (1) native speaker of English; (2) age between 19 and 39; and (3) at least a 90% approval rating on prior participation. The mean age was 29.48 (SD = 5.73, range 19–38) and 26 people self-identified as female while 18 self-identified as male. A power analysis determined the sample size: A sample of 44 has power of 0.90 to detect an effect size of d = 0.5 (Faul et al., 2009).

Stimuli

The related words are: air, clarity, difference, ease, good, mobility, nature, romance, stuff, sweetness, tone, ultimate, utility. The unrelated words are: art, clear, coverage, fortune, friendship, idealism, lack, like, loan, monitor, notable, notice, selection. See Table A1 for details.

Procedure

After reading an informed consent form and agreeing to participate, subjects were reminded of the instructions. They were informed that after clicking on the “Start next trial button” they would see a fixation point in the center of the screen for 1 s. Then, one second after the fixation point disappeared, six words would be shown in the same location, one at time, for 1 s each in 24-point Helvetica. One second after the end of the list, the subjects were prompted to type in the words they had just seen in strict serial order. They were informed that they needed to type in the first word first, the second word second, and so on. If they were unsure of a response, they were encouraged to guess or click on a button labeled “skip”. The subjects initiated the next trial when ready, but this could be done only after six responses had been made. A self-timed break could occur by simply waiting to initiate the next trial.

There were 24 trials, 12 with related and 12 with unrelated words. The words for each list were randomly selected for each trial and were randomly ordered, and the order of the lists, related or unrelated, was randomly determined for each subject.

Results and discussion

The data were analyzed using both frequentist and Bayesian techniques using JASP (JASP Team, 2019). For the latter, a Bayes factor (BF) is reported. BF₁₀ between 3 and 20 indicates positive evidence for the alternate hypothesis (and therefore evidence against the null hypothesis), BF₁₀ between 20 and 150 indicates strong evidence, and BF₁₀ greater than 150 indicates very strong evidence (Kass & Raftery, 1995). Similarly, BF₀₁ indicates evidence in favor of the null hypothesis using the same scale. Default priors were used.

There was a semantic relatedness effect: The proportion of words correctly recalled in order was 0.607 (SD = 0.149) for the related lists compared to 0.565 (SD = 0.149) for the unrelated lists, t(43) = 3.056, p = 0.004, d = 0.461, BF₁₀ = 9.01.^{Footnote 3} Twenty-nine subjects recalled more related than unrelated words, 11 recalled more unrelated than related words, and 4 were tied (p = 0.006 by a two-tailed sign test). The effect size is comparable to d = 0.54 observed by Tse (2009). Unlike that study, and those of Tse et al. (2011) and Tehan (2010), the stimuli in the current study were equated on more dimensions, thus ruling out potential confounds. All but two of the measures indicated that the related list was indeed more related than the unrelated list and therefore can be taken as predicting the observed results. The two exceptions are LSA, which indicated no difference in relatedness, and GloVe, whose measure was equivocal.

Experiment 2

The primary use case considered in this paper is assessing whether two sets of words differ in semantic relatedness such that this factor can be ruled out as a reason for differential recall on an episodic memory test. For example, an experimenter interested in the concreteness effect in serial recall would use a set of abstract words and a set of concrete words; the usual result is a recall advantage for concrete items, even when using a small, closed set of words (e.g., Neath & Surprenant, 2019; Walker & Hulme, 1999). However, if the abstract and concrete words also differ on some other dimension, such as semantic relatedness, the concreteness effect might be either magnified (if the concrete words are more related) or eliminated (if the abstract words are more related). In this experiment, we examine whether the concreteness effect can be eliminated if the abstract words are more semantically related than the concrete words. Table 1 shows that all measures except for JCN indicate the abstract and concrete words differ in semantic relatedness, with the abstract words being more related than the concrete.

Method

Subjects

Forty-four different volunteers from ProlificAC were paid £8.00 per hour (prorated) for their participation. The mean age was 26.97 (SD = 6.24, range 19–39) and 29 people self-identified as female while 15 self-identified as male.

Stimuli

The abstract words were: betrayal, bliss, foresight, greed, hardship, loyalty, malice, mercy, psyche, revenge, riddance, risk, urge, wisdom. The concrete words were: band, blazer, cadet, cauldron, creature, gauntlet, hike, ledge, liqueur, manor, ranch, rubble, throttle, ulcer.^{Footnote 4} Using the Brysbaert et al. (2014) concreteness norms (where 1 = abstract and 5 = concrete), the mean rating for the concrete words is 4.52 (SD = 0.25; range 4.07–4.79) and the mean rating for the abstract words is 1.64 (SD = 0.17; range 1.34–1.93). This difference in mean concreteness rating is comparable to numerous manipulations that resulted in a concreteness effect (see Table 1 of Neath & Surprenant, 2020). The abstract and concrete words were equated on a number of other dimensions known to affect memory (see Table A2 for details), but according to all measures except JCN, the abstract words are more semantically related than the concrete words (see Table 1). With this set of stimuli, then, it is possible that the usual concreteness effect may be diminished or even abolished because the recall of the abstract words is boosted by their higher degree of relatedness.

Procedure

The procedure was identical to that of Experiment 1.

Results and discussion

There was no concreteness effect: The proportion of words correctly recalled in order was 0.521 (SD = 0.201) for the concrete words compared to 0.533 (SD = 0.211) for the abstract words, t(43) = 0.720, p = 0.476, d = 0.108, BF₀₁ = 4.80.^{Footnote 5} Eighteen subjects recalled more concrete than abstract words, 24 recalled more abstract than concrete words, and two were tied (p = 0.441 by a two-tailed sign test). The results are consistent with the interpretation that the usual recall advantage of concrete words was not observed because the enhanced semantic relatedness of the abstract words offset the advantage that usually accrues to concrete words. All measures except JCN indicated the abstract words were more related and therefore all except JCN predicted this result.

Experiment 3

Experiment 3 used a new set of abstract and concrete stimuli. Unlike in Experiment 2, the words were equated for semantic relatedness as measured by WNPL. The WNPL for the abstract words was 10.68 (SD = 1.36) and that of the concrete words was 10.79 (SD = 1.30), which do not differ, t(26) = 0.219, p = 0.829. As can be seen in Table 1, the words also did not differ using LCH or JCN. However, the two sets of words do differ in semantic relatedness according to LSA, GloVe, and fastText, all of which indicate that the abstract words are more related than the concrete words. In contrast, WUP, RES, and LIN indicate the reverse, that the concrete words are more related than the abstract. The prediction depends on which measure is better capturing the effect of semantic relatedness in immediate serial recall. According to WNPL, LCH, and JCN, there should be a small concreteness effect because the only known difference between the two sets of words is concreteness. According to LSA, GloVe, and fastText, the situation is unchanged from Experiment 2: there should be no concreteness effect because the advantage that usually accrues to concrete words will be offset by the advantage of enhanced relatedness of the abstract words. Finally, according to WUP, RES, and LIN, there should be an enhanced concreteness effect because the advantage that usually accrues to concrete words will be supplemented by their also being more related than the abstract words.

Method

Subjects

Forty-four different volunteers from ProlificAC were paid £8.00 per hour (prorated) for their participation. The mean age was 27.75 (SD = 5.24, range 19–38) and 32 subjects self-identified as female while 12 self-identified as male.

Stimuli

The abstract words were: adage, almighty, entail, gradual, holiness, hopeful, infamy, leeway, merit, noble, outlook, rapport, trifling, whimsy. The concrete words were: athlete, bachelor, calcium, diagram, diary, entree, mogul, noggin, ointment, pagoda, plaza, saffron, silencer, veneer. According to the Brysbaert et al. (2014) concreteness norms, the mean rating for the concrete words was 4.31 (SD = 0.19; range 4.04–4.67) and the mean concreteness rating for the abstract words was 1.81 (SD = 0.13; range 1.56 –1.97), comparable to the values in Experiment 2. See Table A3 for details.

Procedure

The procedure was identical to that of Experiments 1 and 2.

Results and discussion

There was a concreteness effect: The proportion of concrete words correctly recalled in order was 0.497 (SD = 0.177) compared to 0.463 (SD = 0.191) for the abstract words, t(43) = 3.243, p = 0.002, d = 0.489, BF₁₀ = 14.21.^{Footnote 6} Twenty-eight subjects recalled more concrete than abstract words, 14 recalled more abstract than concrete words, and two were tied (p = 0.044 by a two-tailed sign test). The concreteness effect was small in terms of the difference in proportion correct, but the effect size was comparable to that observed in Experiment 1 when semantic relatedness was manipulated. The results are not consistent with the predictions based on LSA, GloVe, or fastText; according to these measures, the results should have resembled the null effect observed in Experiment 2 because recall of the abstract words should have been boosted by their higher degree of relatedness. The results are consistent with predictions based on WNPL, LCH, and JCN, which indicated no difference in semantic relatedness and therefore the only difference between the lists should be concreteness. The results are problematic for WUP, RES, and LIN, which indicated that the concrete words were more related than the abstract words. In effect, this predicts an additional advantage for concrete words.

General discussion

There has recently been a huge increase in the number of databases that allow memory researchers to better equate two sets of stimuli. Although there exist many measures of semantic relatedness, none have been tested to assess which best reflects the effects of semantic relatedness on immediate memory tests. Experiment 1 replicated previous findings of a semantic relatedness effect in immediate serial recall. All measures except LSA indicated the words differed, with the caveat that the prediction from GloVe was equivocal. Experiment 2 found that when abstract words are more related than concrete words, the concreteness effect disappears, a result consistent with all measures except JCN. Finally, Experiment 3 found a small concreteness effect, consistent with the predictions based on WNPL, LCH, and JCN, which indicated no difference in semantic relatedness. In contrast LSA, GloVe, and fastText predicted no concreteness effect because the abstract words were deemed more related than the concrete; in effect, these measures predicted that the result found in Experiment 2 should also have been observed in Experiment 3. WUP, RES, and LIN predicted an enhanced concreteness effect because the concrete words had two advantages: Not only were they more concrete, but they were also more semantically related. Only two measures were consistent with the results from all three experiments: WNPL and LCH. Both are based on simple WordNet path length and differ only in that the latter measure is scaled by the maximum path length found in the hierarchy containing the two words.

One simultaneous strength and weakness of WNPL is that it is derived from WordNet. The primary strength is that WordNet is based on psycholinguistic research and theory (Fellbaum, 2017; Miller et al., 1990). As a result, it is more closely aligned with the type of processing that likely occurs in the memory tests assessed in this paper. As Tehan (2010) suggested, an item in a more semantically related list would receive a boost from the activation that spreads from the other closely related words in the list whereas an item in a list where the words are more distantly related would not get such a boost. Both WNPL and JCH would capture this because activation is thought to be directly related to path length. The scaling done by JCH involves the overall path length which, in the current case, would not differ. Although WUP is also based on path length, it uses a quite different scaling process, the depth of the least common subsumer. This will vary more than the overall path length, hence divergence in predictions. The other three WordNet-based measures all include information content, but it is not clear how that would be involved in processes such as those used in immediate serial recall.

In contrast to measures based on simple WordNet path length, measures based on co-occurrence of words in a large corpus did not fare well in predicting the pattern of results. One reason may be that co-occurrence could involve more than just semantic relations and therefore these measures may be responding to additional factors. Some admittedly post hoc evidence consistent with this view is the finding that GloVe and fastText both consistently indicated that abstract words were more semantically related than concrete words. As noted earlier, they similarly indicated that the early acquired words used by MacMillan et al. (in press) are more related than the late acquired words, whereas WNPL indicated no difference. A second reason may be that none of these models were created to predict the effect of semantic relatedness on episodic memory tasks. For example, the goal of GloVe was to capture fine-grained syntactic and semantic regularities in word vectors and then to assess the properties of these vectors (Pennington et al., 2014). Although GloVe excels at some tasks, such as analogical reasoning, word similarity judgements, and named entity recognition tasks, it is not necessarily the case that the semantic information it is capturing will be appropriate for episodic memory processing.

Although simple WordNet path length appears, based on the limited available data, the better measure for episodic memory tasks, it is not without its own limitations. First, the measure applies only to nouns because in WordNet, nouns and other parts of speech are organized differently. This means that WNPL cannot be computed for lists that include verbs or adjectives. Second, the implied interval scale means that a pair of nouns with a given path length is considered to have identical semantic relatedness to any other noun pairs with the same path length. This is likely not the case. A third potential weakness inherited from WordNet is that WNPL, like other measures of relatedness, may miss some forms of relatedness. For example, Li and Kohanyi (2017) note that whereas needle is expected to be associated with haystack, the path length in WordNet is 14. However, WordNet does capture other similar examples; for example, the path length between mountain and molehill is 5. A final limitation is that whereas WNPL and JCH may currently be the most useful measure for predicting human performance on episodic memory tasks, they are clearly far more limited in their generality than measures such as GloVe and fastText.

It should also be noted that in this work, we took a significant difference between two scores as indicative that there was a difference in semantic relatedness. That is, we interpreted the measures dichotomously. In reality, it is possible that a significant difference that is numerically small might not lead to a difference in recall. Evaluating each measure's scale properties requires far more data, and also requires that researchers publish their stimuli. A number of studies on memory for semantically related and unrelated lists do not include their stimuli.

Research methods are continually evolving, and practices that were acceptable not so long ago are now seen to be problematic. For current purposes, it is now clearer than ever that numerous lexical and psycholinguistic factors can affect performance on a wide number of tasks and researchers need to take this into account when creating sets of stimuli. The recent increase in the number of norms available makes it possible to equate two sets of words on over two dozen dimensions. One conspicuous omission was an easy way to compute semantic relatedness with evidence to support the choice of the measure. We have shown that two measures based on simple WordNet path length, WNPL and JCH, aligned with the results from all three experiments. No other measure did so. Moreover, both measures are consistent with theoretical accounts of how semantic relatedness might benefit memory. The other measures all failed to account for at least one experimental result, and some failed to account for two results.

Pending more research on the appropriateness of the various measures of semantic relatedness for episodic memory tasks, the tentative recommendation is to use WNPL or JCH. Although the only data to support this recommendation comes from the studies reported here (and also by the two serial recall studies reported by MacMillan et al., in press), this is more data than was available prior to this study. What is needed is further empirical work to assess the degree to which measures of semantic relatedness predict performance on memory tasks.

Computing WNPL There are a number of ways to compute WNPL. One way is to download and install WordNet (wordnet.princeton.edu), and then download and install any of several packages designed to access WordNet. One such package is the collection of Perl tools called WordNet::Similarity (Pedersen et al., 2004; see metacpan.org/pod/WordNet::Similarity). Another package is available in Python as part of the natural language toolkit (see https://www.nltk.org/howto/wordnet.html). A third way that does not involve programming is to use a database and software we created, which reads in a list of words and outputs a matrix of path lengths (see doi.org/10.17605/OSF.IO/3XTQD).

Author Notes

We thank Ted Pedersen for making the WordNet::Similarity database available. This research was supported, in part, by grants from the Natural Sciences and Engineering Research Council of Canada to IN and AMS. The authors are listed alphabetically. Correspondence may be addressed to ineath@mun.ca

Notes

The term semantic relatedness is used differently in different literatures. In the immediate memory literature, which is the focus of this paper, it is typically synonymous with semantic similarity and indicates only that items are somehow related, with the specific type of semantic relation (e.g., category membership, meaning, etc.) unspecified. In other literatures, different types of relatedness are distinguished, including, for example, taxonomic vs. thematic relatedness (e.g., Mirman et al., 2017). We use the term as it is used in the immediate memory literature.
According to WordNet, “the sense numbers and ordering of senses in WordNet should be considered random for research purposes.” All quotations are from WordNet (nd).
The typed responses were spell-checked. Out of 6336 total responses, 135 (74 related and 61 unrelated) words were flagged as misspelled, approximately 2.1%. Correcting these changed 56 responses to correct in the related condition and 34 to correct in the unrelated condition. Because the conclusions are the same regardless of whether the raw or the spell-checked data are analyzed, and because there is some ambiguity in how to correct spelling mistakes, we present only the analyses for the uncorrected data.
The stimuli are a subset of those from Experiment 2 of Pollock (2018).
The typed responses were spell-checked. Out of 6336 total responses, 395 (189 abstract and 206 concrete) words were flagged as misspelled, approximately 6.2%. Correcting these changed 107 responses to correct in the abstract condition and 121 to correct in the concrete condition. Because the conclusions are the same regardless of whether the raw or the spell-checked data are analyzed, and because there is some ambiguity in how to correct spelling mistakes, we present only the analysis for the uncorrected data.
The typed responses were spell-checked. Out of 6336 total responses, 304 (151 abstract and 153 concrete) words were flagged as misspelled, approximately 4.8%. Correcting these changed 78 responses to correct in the abstract condition and 105 to correct in the concrete condition. Because the conclusions are the same regardless of whether the raw or the spell-checked data are analyzed, we present only the analysis for the uncorrected data.

References

Balota, D. A., Yap, M. J., Cortese, M. J., Hutchison, K. A., Kessler, B., Loftis, B., & Treiman, R. (2007). The English Lexicon Project. Behavior Research Methods, 39, 445–459. https://doi.org/10.3758/BF03193014
Article PubMed Google Scholar
Battig, W. F., & Montague, W. E. (1969). Category norms of verbal items in 56 categories A replication and extension of the Connecticut category norms. Journal of Experimental Psychology, 80, 1–46. https://doi.org/10.1037/h0027577
Article Google Scholar
Brysbaert, M, Warriner, A. B., & Kuperman, V. (2014). Concreteness ratings for 40 thousand generally known English word lemmas. Behavior Research Methods, 46, 904–911. https://doi.org/10.3758/s13428-013-0403-5
Article PubMed Google Scholar
Brysbaert, M., & Biemiller, A. (2017). Test-based age-of-acquisition norms for 44 thousand English word meanings. Behavior Research Methods, 49, 1520–1523. https://doi.org/10.3758/s13428-016-0811-4
Article PubMed Google Scholar
Brysbaert, M., & New, B. (2009). Moving beyond Kučera and Francis: A critical evaluation of current word frequency norms and the introduction of a new and improved word frequency measure for American English. Behavior Research Methods, 41, 977–900. https://doi.org/10.3758/BRM.41.4.977
Article PubMed Google Scholar
Chubala, C. M., Neath, I., & Surprenant, A. M. (2019). A comparison of immediate serial recall and immediate serial recognition. Canadian Journal of Experimental Psychology, 73, 5–27. https://doi.org/10.1037/cep0000158
Article PubMed Google Scholar
Collins, A. M., & Loftus, E. F. (1975). A spreading-activation theory of semantic processing. Psychological Review, 82, 407–428. https://doi.org/10.1037/0033-295X.82.6.407
Article Google Scholar
De Deyne, S., Navarro, D. J., Perfors, A., Brysbaert, M., & Storms, G. (2019). The “Small World of Words” English word association norms for over 12,000 cue words. Behavior Research Methods, 51, 987–1006. https://doi.org/10.3758/s13428-018-1115-7
Article PubMed Google Scholar
FastText (2021). fastText (Version 0.9.2) [Computer software] https://fasttext.cc
Faul, F., Erdfelder, E., Buchner, A., & Lang, A.-G. (2009). Statistical power analyses using G*Power 3.1: Tests for correlation and regression analyses. Behavior Research Methods, 41, 1149–1160. https://doi.org/10.3758/BRM.41.4.1149
Article PubMed Google Scholar
Fellbaum, C. (2017). WordNet: An electronic lexical resource. In S. E. F. Chipman (Ed.), The Oxford handbook of cognitive science. (pp. 301–313). New York: Oxford University Press.
Grave, E., Bojanowski, P., Gupta, P., Joulin, A., & Mikolov T. (2018). Learning word vectors for 157 languages. Proceedings of the International Conference on Language Resources and Evaluation. Retrieved from https://arxiv.org/pdf/1802.06893.pdf
JASP Team (2019). JASP (Version 0.11.1) [Computer software] https://jasp-stats.org/
Jiang, J. J., & Conrath, D. W. (1997). Semantic similarity based on corpus statistics and lexical taxonomy. Proceedings of the 10th Research on Computational Linguistics Conference, 19–33. Retrieved from www.aclweb.org/anthology/O97-1002
Kass, R. E., & Raftery, A. E. (1995). Bayes factors. Journal of the American Statistical Association, 90, 773–795. https://doi.org/10.1080/01621459.1995.10476572
Article Google Scholar
Landauer, T. K., & Dumais, S. T. (1997). A solution to Plato’s problem: The latent semantic analysis theory of the acquisition, induction, and representation of knowledge. Psychological Review, 104, 211–240. https://doi.org/10.1037/0033-295X.104.2.211
Article Google Scholar
Leacock, C., & Chodorow, M. (1998). Combining local context and WordNet similarity for word sense identification. In C. Fellbaum (Ed.), WordNet: An electronic lexical database, pp. 265–283. MIT Press
Li, J., & Kohanyi, E. (2017). Towards modelling false memory with computational knowledge bases. Topics in Cognitive Science, 9, 102–116. https://doi.org/10.1111/tops.12245
Article PubMed Google Scholar
Lin, D. (1998). An information-theoretic definition of similarity. In Proceedings of the International Conference on Machine Learning, Madison, August. Retrieved from https://www.cse.iitb.ac.in/~cs626-449/Papers/WordSimilarity/3.pdf
MacMillan, M. B., Neath, I., & Surprenant, A. M. (in press). Re-assessing age of acquisition effects in recognition, free recall, and serial recall. Memory & Cognition. https://doi.org/10.3758/s13421-021-01137-6
Miller, G. A., Beckwith, R., Fellbaum, C., Gross, D., & Miller, K. J. (1990). Introduction to WordNet: An online lexical database. International Journal of Lexicography, 3, 235–244. https://doi.org/10.1093/ijl/3.4.235
Article Google Scholar
Medler, D. A., & Binder, J. R. (2005). MCWord: An on-line orthographic database of the English language. Retrieved from http://www.neuro.mcw.edu/mcword/
Meng, L., Huang, R., & Gu, J. (2013). A review of semantic similarity measures in WordNet. International Journal of Hybrid Information Technology, 6, 1–13. Retrieved from www.earticle.net/Article/A208245
Google Scholar
Mirman, D., Landrigan, J.-F., & Britt, A. E. (2017). Taxonomic and thematic semantic systems. Psychological Bulletin, 143, 499–520. https://doi.org/10.1037/bul0000092
Article PubMed PubMed Central Google Scholar
Murdock, B. B. (1976). Item and order information in short-term serial memory. Journal of Experimental Psychology: General, 105, 191–216. https://doi.org/10.1037/0096-3445.105.2.191
Article Google Scholar
Murdock, B. B., Jr., & Vom Saal, W. (1967). Transpositions in short-term memory. Journal of Experimental Psychology, 74, 137–143. https://doi.org/10.1037/h0024507
Article PubMed Google Scholar
Neath, I., & Surprenant, A. M. (2019). Set size and long-term memory/lexical effects in immediate serial recall: Testing the impurity principle. Memory & Cognition, 47, 455–472. https://doi.org/10.3758/s13421-018-0883-8
Article Google Scholar
Neath, I., & Surprenant, A. M. (2020). Concreteness and disagreement: Comment on Pollock (2018). Memory & Cognition, 48, 683–690. https://doi.org/10.3758/s13421-019-00992-8
Article Google Scholar
Nelson, D. L., McEvoy, C. L., & Schreiber, T. A. (2004). The University of South Florida free association, rhyme, and word fragment norms. Behavior Research Methods, Instruments & Computers, 36, 402–407. https://doi.org/10.3758/BF03195588
Article Google Scholar
Pedersen, T., Patwardhan, S., & Michelizzi, J. (2004). WordNet::Similarity - Measuring the relatedness of concepts. In S. Dumais, D. Marcu, & S. Roukos (Eds.), Proceedings of Fifth Annual Meeting of the North American Chapter of the Association for Computational Linguistics (NAACL-2004), 38–41. Retrieved from www.aclweb.org/anthology/N04-3012
Pennington, J., Socher, R., & Manning, C. D. (2014). GloVe: Global vectors for word representation. Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP). Retrieved from www.aclweb.org/anthology/D14-1162.pdf
Poirier, M., & Saint-Aubin, J. (1995). Memory for related and unrelated words: Further evidence on the influence of semantic factors in immediate serial recall. Quarterly Journal of Experimental Psychology, 48A, 384–404. https://doi.org/10.1080/14640749508401396
Article Google Scholar
Pollock, L. (2018). Statistical and methodological problems with concreteness and other semantic variables: A list memory experiment case study. Behavior Research Methods, 50, 1198-1216. https://doi.org/10.3758/s13428-017-0938-y
Article PubMed Google Scholar
Resnik, P. (1995). Using information content to evaluate semantic similarity in a taxonomy. In Proceedings of the 14th International Joint Conference on Artificial Intelligence, 448–453. Retrieved from https://www.ijcai.org/Proceedings/95-1/Papers/059.pdf
Storkel, H. L. (2004). Methods for minimizing the confounding effects of word length in the analysis of phonotactic probability and neighborhood density. Journal of Speech, Language, and Hearing Research, 47, 1454–1468. https://doi.org/10.1044/1092-4388(2004/108)
Tehan, G. (2010). Associative relatedness enhances recall and produces false memories in immediate serial recall. Canadian Journal of Experimental Psychology, 64, 266–272. https://doi.org/10.1037/a0021375
Article PubMed Google Scholar
Tse, C-S. (2009). The role of associative strength in the semantic relatedness effect on immediate serial recall. Memory, 17, 874–891. https://doi.org/10.1080/09658210903376250
Article PubMed Google Scholar
Tse, C.-S., Li, Y., & Altarriba, J. (2011). The effect of semantic relatedness on immediate serial recall and serial recognition. Quarterly Journal of Experimental Psychology: Human Experimental Psychology, 64, 2425–2437. https://doi.org/10.1080/17470218.2011.604787
Article Google Scholar
van Heuven, W. B., Mandera, P., Keuleers, E., & Brysbaert, M. (2014). SUBTLEX-UK: A new and improved word frequency database for British English. Quarterly Journal of Experimental Psychology, 67, 1176–1190. https://doi.org/10.1080/17470218.2013.850521
Article Google Scholar
Van Overschelde, J. P., Rawson, K. A., & Dunlosky, J. (2004). Category norms: An updated and expanded version of the Battig and Montague (1969) norms. Journal of Memory and Language, 50, 289–335. https://doi.org/10.1016/j.jml.2003.10.003
Article Google Scholar
Walker, I., & Hulme, C. (1999). Concrete words are easier to recall than abstract words: Evidence for a semantic contribution to short-term serial recall. Journal of Experimental Psychology: Learning, Memory, and Cognition, 2, 1256–1271. https://doi.org/10.1037/0278-7393.25.5.1256
Article Google Scholar
Warriner, A. B., Kuperman, V., & Brysbaert, M. (2013). Norms of valence, arousal, and dominance for 13,915 English lemmas. Behavior Research Methods, 45, 1191–1207. https://doi.org/10.3758/s13428-012-0314-x
Article PubMed Google Scholar
WordNet | A Lexical Database for English. (n.d.). Retrieved from https://wordnet.princeton.edu/
Wu, Z., & Palmer, M. (1994). Verb semantics and lexical selection. Proceedings of the 32nd Annual Meeting of the Association for Computational Linguistics, 133–138. Retrieved from https://doi.org/10.3115/981732.981751

Download references

Author information

Authors and Affiliations

California State University Bakersfield, Bakersfield, CA, USA
Tyler M. Ensor
Department of Psychology, Memorial University of Newfoundland, St. John’s, NL, A1B 3X9, Canada
Molly B. MacMillan, Ian Neath & Aimée M. Surprenant

Authors

Tyler M. Ensor
View author publications
You can also search for this author in PubMed Google Scholar
Molly B. MacMillan
View author publications
You can also search for this author in PubMed Google Scholar
Ian Neath
View author publications
You can also search for this author in PubMed Google Scholar
Aimée M. Surprenant
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Ian Neath.

Additional information

Open Practices Statement

All files and raw data are available from https://doi.org/10.17605/OSF.IO/3XTQD

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Appendix

Table 2 Details of the related and unrelated words used in Experiment 1

Full size table

Table 3 Details of the abstract and concrete words used in Experiment 2

Full size table

Table 4 Details of the abstract and concrete words used in Experiment 3

Full size table

Rights and permissions

Reprints and permissions

About this article

Cite this article

Ensor, T.M., MacMillan, M.B., Neath, I. et al. Calculating semantic relatedness of lists of nouns using WordNet path length. Behav Res 53, 2430–2438 (2021). https://doi.org/10.3758/s13428-021-01570-0

Download citation

Accepted: 22 February 2021
Published: 12 April 2021
Issue Date: December 2021
DOI: https://doi.org/10.3758/s13428-021-01570-0

Keywords

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

Calculating semantic relatedness of lists of nouns using WordNet path length

Abstract

Similar content being viewed by others

A comparison of homonym meaning frequency estimates derived from movie and television subtitles, free association, and explicit ratings

Semantic knowledge influences whether novel episodic associations are represented symmetrically or asymmetrically

Set size and long-term memory/lexical effects in immediate serial recall: Testing the impurity principle

Experiment 1

Method

Subjects

Stimuli

Procedure

Results and discussion

Experiment 2

Method

Subjects

Stimuli

Procedure

Results and discussion

Experiment 3

Method

Subjects

Stimuli

Procedure

Results and discussion

General discussion

Author Notes

Notes

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher’s note

Appendix

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Calculating semantic relatedness of lists of nouns using WordNet path length

Abstract

Similar content being viewed by others

A comparison of homonym meaning frequency estimates derived from movie and television subtitles, free association, and explicit ratings

Semantic knowledge influences whether novel episodic associations are represented symmetrically or asymmetrically

Set size and long-term memory/lexical effects in immediate serial recall: Testing the impurity principle

Experiment 1

Method

Subjects

Stimuli

Procedure

Results and discussion

Experiment 2

Method

Subjects

Stimuli

Procedure

Results and discussion

Experiment 3

Method

Subjects

Stimuli

Procedure

Results and discussion

General discussion

Author Notes

Notes

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher’s note

Appendix

Appendix

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation