An Approach to Improving the Classification of the New York Times Annotated Corpus

Mozzherina, Elena

doi:10.1007/978-3-642-41360-5_7

Elena Mozzherina³

Part of the book series: Communications in Computer and Information Science ((CCIS,volume 394))

Included in the following conference series:

International Conference on Knowledge Engineering and the Semantic Web

953 Accesses
4 Citations

Abstract

The New York Times Annotated Corpus contains over 1.5 million of manually tagged articles. It could become a useful source for evaluation of algorithms for documents clustering. Since documents have been labeled over twenty years, it is argued that the classification may contains errors due to a possible dissent between experts and the necessity to add tags over time. This paper presents an approach to improving the classification quality by using assigned tags as a starting point.

It is assumed that tags can be described by a set of features. These features are selected based on the value of mutual information between the tag and stems from documents with it. An algorithm for reassigning tags in case the document does not contain features of its labels is presented. Experiments were performed on about ninety thousand articles published by the New York Times in 2005. Results of applying the algorithm to the collection are discussed.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Sandhaus, E.: The New York Times annotated corpus. Linguistic Data Consortium, Philadelphia (2008)
Google Scholar
Torkkola, K.: Discriminative features for text document classification. Pattern Analysis and Applications 6(4), 301–308 (2003)
MathSciNet Google Scholar
Reuters-21578 Test Collection, http://www.daviddlewis.com/resources/testcollections/reuters21578/
Text Retrieval Conference, http://trec.nist.gov/
Manning, C., Raghavan, P., Schtze, H.: Introduction to Information Retrieval. Cambridge University Press, NY (2008)
Book MATH Google Scholar
Weiss, M.S., Indurkhya, N., Zhang, T., Damerau, F.J.: Text Mining: Predictive Methods for Analyzing Unstructured Information. Springer Science+Business Media, Inc., NY (2005)
Book Google Scholar
Neto, J.L., Santos, A.D., Kaestner, C.A.A., Freitas, A.A.: Document clustering and text summarization. In: International Conference Practical Applications of Knowledge Discovery and Data Mining, pp. 41–55 (2000)
Google Scholar
Bakus, J., Kamel, M.S., Carey, T.: Extraction of text phrases using hierarchical grammar. In: Cohen, R., Spencer, B. (eds.) Canadian AI 2002. LNCS (LNAI), vol. 2338, pp. 319–324. Springer, Heidelberg (2002)
Chapter Google Scholar
Steinbach, M., Karypis, G., Kumar, V.: A comparison of document clustering techniques. In: KDD Workshop on Text Mining (2000)
Google Scholar
Larsen, B., Aone, C.: Fast and effective text mining using linear-time document clustering. In: ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 16–22 (1999)
Google Scholar
Steinbach, M., Karypis, G., Kumar, V.: A comparison of document clustering techniques. In: KDD Workshop on Text Mining (2000)
Google Scholar
Liu, Y., Li, Z., Xiong, H., Gao, X., Wue, J.: Understanding of internal clustering validation measures. In: IEEE International Conference on Data Mining, pp. 911–916 (2010)
Google Scholar
Porter stemmer, http://tartarus.org/martin/PorterStemmer/

Download references

Author information

Authors and Affiliations

Saint Petersburg State University, 7-9, Universitetskaya nab., St.Petersburg, 199034, Russia
Elena Mozzherina

Authors

Elena Mozzherina
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Institute of Artificial Intelligence, University of Ulm, 89069, Ulm, Germany
Pavel Klinov
Intelligence Systems Laboratory, Saint Petersburg National Research University of Information Technologies, Mechanics and Optics, Kronverksky prospekt 49, office 380,, 197101, St. Petersburg, Russia
Dmitry Mouromtsev

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Mozzherina, E. (2013). An Approach to Improving the Classification of the New York Times Annotated Corpus. In: Klinov, P., Mouromtsev, D. (eds) Knowledge Engineering and the Semantic Web. KESW 2013. Communications in Computer and Information Science, vol 394. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-41360-5_7

Download citation

DOI: https://doi.org/10.1007/978-3-642-41360-5_7
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-41359-9
Online ISBN: 978-3-642-41360-5
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics