Skip to main content

A Multidisciplinary Method for Constructing and Validating Word Similarity Datasets

  • Conference paper
  • First Online:
Advances in Computational Intelligence Systems (UKCI 2017)

Part of the book series: Advances in Intelligent Systems and Computing ((AISC,volume 650))

Included in the following conference series:

  • 1031 Accesses

Abstract

Measuring semantic similarity is essential to many natural language processing (NLP) tasks. One widely used method to evaluate the similarity calculating models is to test their consistency with humans using human-scored gold-standard datasets, which consist of word pairs with corresponding similarity scores judged by human subjects. However, the descriptions on how such datasets are constructed are often not sufficient previously. Many problems, e.g. how the word pairs are selected, whether or not the scores are reasonable, etc., are not clearly addressed. In this paper, we proposed a multidisciplinary method for building and validating semantic similarity standard datasets, which is composed of 3 steps. Firstly, word pairs are selected based on computational linguistic resources. Secondly, similarities for the selected word pairs are scored by human subjects. Finally, Event-Related Potentials (ERPs) experiments are conducted to test the soundness of the constructed dataset. Using the proposed method, we finally constructed a Chinese gold-standard word similarity dataset with 260 word pairs and validated its soundness via ERP experiments. Although the paper only focused on constructing Chinese standard dataset, the proposed method is applicable to other languages.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 129.00
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 169.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

  1. 1.

    http://www.sogou.com/labs/resource/ca.php.

  2. 2.

    https://code.google.com/archive/p/word2vec/.

References

  1. Mikolov, T., Chen, K., Corrado, G., Dean, J.: Efficient estimation of word representations in vector space. In: Proceedings of the International Conference on Learning Representations (ICLR), Scottsdale, Arizona, May 2013

    Google Scholar 

  2. Rubenstein, H., Goodenough, J.B.: Contextual correlates of synonymy. Commun. ACM 8(10), 627–633 (1965)

    Article  Google Scholar 

  3. Miller, G.A., Charles, W.G.: Contextual correlates of semantic similarity. Lang. Cogn. Process. 6(1), 1–28 (1991)

    Article  MathSciNet  Google Scholar 

  4. Finkelstein, L., Gabrilovich, E., Matias, Y., Rivlin, E., Solan, Z., Wolfman, G., Ruppin, E.: Placing search in context: the concept revisited. In: Proceedings of the 10th International World Wide Web Conference (WWW10), Hongkong, China, pp. 406–414, May 2001

    Google Scholar 

  5. Wang, X., Jia, Y., Zhou, B., Ding, Z., Liang, Z.: Computing semantic relatedness using Chinese Wikipedia links and taxonomy. J. Chin. Comput. Syst. 32(11), 2237–2242 (2011)

    Google Scholar 

  6. Jin, P., Wu, Y.: Semeval-2012 task 4: evaluating Chinese word similarity. In: Proceedings of the Joint Conference on Lexical and Computational Semantics, Montréal, Canada, pp. 374–377, June 2012

    Google Scholar 

  7. Hauk, O., Pulvermüller, F.: Effects of word length and frequency on the human event-related potential. Clin. Neurophysiol. 115(5), 1090–1103 (2004)

    Article  Google Scholar 

  8. Agirre, E., Alfonseca, E., Hall, K., Kravalova, J., Paşca, M., Soroa, A.: A Study on similarity and relatedness using distributional and WordNet-based approaches. In: Proceedings of North American Chapter of the Association for Computational Linguistics - Human Language Technologies (NAACL - HLT 2009), Colorado, pp. 19–27, June 2009

    Google Scholar 

  9. Dong, Z., Dong, Q.: Hownet, March 1999. http://www.keenage.com

  10. Dong, Z., Dong, Q., Hao, C.: HowNet and its computation of meaning. In: Proceedings of the 23rd International Conference on Computational Linguistics (COLING 2010), Beijing, China, pp. 53–56, August 2010

    Google Scholar 

  11. Liu, Q., Li, S.: Word similarity computing based on HowNet. In: Proceedings of the Third Chinese Lexical Semantics Workshop, pp. 59–76 (2002)

    Google Scholar 

  12. Chen, C., Lee, S., Stevenson, H.W.: Response style and cross-cultural comparisons of rating scales among East Asian and North American students. Psychol. Sci. 6(3), 170–175 (1995)

    Article  Google Scholar 

  13. Kutas, M., Federmeier, K.D.: Thirty years and counting: finding meaning in the N400 component of the event related brain potential (ERP). Annu. Rev. Psychol. 62, 621–647 (2011)

    Article  Google Scholar 

  14. Kutas, M., Hillyard, S.A.: Reading senseless sentences: brain potentials reflect semantic incongruity. Science 207(4427), 203–205 (1980)

    Article  Google Scholar 

  15. Deacon, D., Hewitt, S., Yang, C., Nagata, M.: Event-related potential indices of semantic priming using masked and unmasked words: evidence that the N400 does not reflect a post-lexical process. Cogn. Brain. Res. 9(2), 137–146 (2000)

    Article  Google Scholar 

  16. Kiefer, M.: The N400 is modulated by unconsciously perceived masked words: further evidence for an automatic spreading activation account of N400 priming effects. Cogn. Brain. Res. 13(1), 27–39 (2002)

    Article  Google Scholar 

  17. Mao, W., Wang, Y.: Various conflicts from ventral and dorsal streams are sequentially processed in a common system. Exp. Brain Res. 177, 113–121 (2007)

    Article  Google Scholar 

  18. Bennett, M.A., Duke, P.A., Fuggetta, G.: Event-related potential N270 delayed and enhanced by the conjunction of relevant and irrelevant perceptual mismatch. Psychophysiology 51(5), 456–463 (2014)

    Article  Google Scholar 

  19. Moss, H.E., Ostrin, R.K., Tyler, L.K., Marslen, W.D.: Accessing different types of lexical semantic information: evidence from priming. J. Exp. Psychol. Learn. Mem. Cogn. 21(4), 863–883 (1995)

    Article  Google Scholar 

Download references

Acknowledgments

This work was supported by National Natural Science Foundation of China (No. 61573294), National Social Science Foundation of China (No. 16AZD049) and Fujian Province 2011 Collaborative Innovation Center of TCM Health Management.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Yidong Chen .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2018 Springer International Publishing AG

About this paper

Cite this paper

Wan, Y., Chen, Y., Shi, X., Cai, G., Cai, L. (2018). A Multidisciplinary Method for Constructing and Validating Word Similarity Datasets. In: Chao, F., Schockaert, S., Zhang, Q. (eds) Advances in Computational Intelligence Systems. UKCI 2017. Advances in Intelligent Systems and Computing, vol 650. Springer, Cham. https://doi.org/10.1007/978-3-319-66939-7_4

Download citation

  • DOI: https://doi.org/10.1007/978-3-319-66939-7_4

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-319-66938-0

  • Online ISBN: 978-3-319-66939-7

  • eBook Packages: EngineeringEngineering (R0)

Publish with us

Policies and ethics