Skip to main content

Approximating Multi-class Text Classification Via Automatic Generation of Training Examples

  • Conference paper
  • First Online:
Computational Linguistics and Intelligent Text Processing (CICLing 2017)

Part of the book series: Lecture Notes in Computer Science ((LNTCS,volume 10762))

Abstract

Text classification is among the most broadly used machine learning tools in computational linguistic. Web information retrieval is one of the most important sectors that took advantage from this technique. Applications range from page classification, used by search engines, to URL classification used for focus crawling and on-line time-sensitive applications [2]. Due to the pressing need for the highest possible accuracy, a supervised learning approach is always preferred when an adequately large set of training examples is available. Nonetheless, since building such an accurate and representative training set often becomes impractical when the number of classes increases over a few units, alternative unsupervised or semi-supervised approaches have come out. The use of standard web directories as a source of examples can be prone to undesired effects due, for example, to the presence of maliciously misclassified web pages. In addition, this option is subjected to the existence of all the desired classes in the directory hierarchy.

Taking as input a textual description of each class and a set of URLs, in this paper we propose a new framework to automatically build a representative training set able to reasonably approximate the classification accuracy obtained by means of a manually-curated training set. Our approach leverages on the observation that a not negligible fraction of website names is the result of the juxtaposition of few keywords. Yet, the entire URL can often be converted into a meaningful text snippet. When this happens, we can label the URL by measuring its degree of similarity with each class description. The text contained in the pages corresponding to labelled URLs can be used as a training set for any subsequent classification task (not necessarily on the web). Experiments on a set of 20 thousand web pages belonging to 9 categories have shown that our auto-labelling framework is able to attain an approximation factor over 88% of the accuracy of a pure supervised classification trained with manually-curated examples.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

  1. 1.

    http://www.dmoz.org/World/Italiano/.

References

  1. Adami, G., Avesani, P., Sona, D.: Clustering documents into a web directory for bootstrapping a supervised classification. Data Knowl. Eng. 54(3), 301–325 (2005). https://doi.org/10.1016/j.datak.2004.11.003

    Article  Google Scholar 

  2. Baykan, E., Henzinger, M., Marian, L., Weber, I.: Purely URL-based topic classification. In: Proceedings of the 18th International Conference on World Wide Web, pp. 1109–1110. ACM (2009)

    Google Scholar 

  3. Baykan, E., Henzinger, M., Marian, L., Weber, I.: A comprehensive study of features and algorithms for URL-based topic classification. ACM Trans. Web 5(3), 15:1–15:29 (2011). https://doi.org/10.1145/1993053.1993057

    Article  Google Scholar 

  4. Baykan, E., Henzinger, M., Weber, I.: A comprehensive study of techniques for URL-based web page language classification. ACM Trans. Web 7(1), 3:1–3:37 (2013). https://doi.org/10.1145/2435215.2435218

    Article  Google Scholar 

  5. Bennett, G., Scholer, F., Uitdenbogerd, A.: A comparative study of probabilistic and language models for information retrieval. In: Proceedings of the Nineteenth Conference on Australasian Database, vol. 75, pp. 65–74. Australian Computer Society Inc. (2008)

    Google Scholar 

  6. Boyd, D., Crawford, K.: Critical questions for big data: provocations for a cultural, technological, and scholarly phenomenon. Inf. Commun. Soc. 15(5), 662–679 (2012)

    Article  Google Scholar 

  7. Breunig, M.M., Kriegel, H.P., Ng, R.T., Sander, J.: LOF: identifying density-based local outliers. In: ACM Sigmod Record, vol. 29, pp. 93–104. ACM (2000)

    Google Scholar 

  8. Broder, A.Z., Ciccolo, P., Fontoura, M., Gabrilovich, E., Josifovski, V., Riedel, L.: Search advertising using web relevance feedback. In: Proceedings of the 17th ACM Conference on Information and Knowledge Management, CIKM 2008, pp. 1013–1022. ACM, New York (2008). https://doi.org/10.1145/1458082.1458217

  9. Carpineto, C., Osiński, S., Romano, G., Weiss, D.: A survey of web clustering engines. ACM Comput. Surv. 41(3), 17:1–17:38 (2009)

    Article  Google Scholar 

  10. Castellanos, M., Daniel, F., Garrigós, I., Mazón, J.N.: Business intelligence and the web. Inf. Syst. Front. 15(3), 307–309 (2013)

    Article  Google Scholar 

  11. Charikar, M.S.: Similarity estimation techniques from rounding algorithms. In: Proceedings of STOC-02, 34th Annual ACM Symposium on the Theory of Computing, , Montreal, CA, pp. 380–388 (2002)

    Google Scholar 

  12. Dong, H., Hussain, F.: Focused crawling for automatic service discovery, annotation, and classification in industrial digital ecosystems. IEEE Trans. Ind. Electron. 58(6), 2106–2116 (2011)

    Article  Google Scholar 

  13. Eickhoff, C., Serdyukov, P., de Vries, A.P.: Web page classification on child suitability. In: Proceedings of the 19th ACM International Conference on Information and Knowledge Management, CIKM 2010, pp. 1425–1428. ACM, New York (2010). https://doi.org/10.1145/1871437.1871638

  14. Erdélyi, M., Garzó, A., Benczúr, A.A.: Web spam classification: a few features worth more. In: Proceedings of the 2011 Joint WICOW/AIRWeb Workshop on Web Quality, WebQuality 2011, pp. 27–34. ACM, New York (2011). https://doi.org/10.1145/1964114.1964121

  15. Forman, G.: Quantifying counts and costs via classification. Data Min. Knowl. Discov. 17(2), 164–206 (2008)

    Article  MathSciNet  Google Scholar 

  16. Fürnkranz, J.: Exploiting structural information for text classification on the WWW. In: Hand, D.J., Kok, J.N., Berthold, M.R. (eds.) IDA 1999. LNCS, vol. 1642, pp. 487–497. Springer, Heidelberg (1999). https://doi.org/10.1007/3-540-48412-4_41

    Chapter  Google Scholar 

  17. de Groc, C.: Babouk: Focused web crawling for corpus compilation and automatic terminology extraction. In: 2011 IEEE/WIC/ACM International Conference on Web Intelligence and Intelligent Agent Technology (WI-IAT), vol. 1, pp. 497–498 (2011)

    Google Scholar 

  18. Halvorson, T., et al.: The BIZ top-level domain: ten years later. In: Taft, N., Ricciato, F. (eds.) PAM 2012. LNCS, vol. 7192, pp. 221–230. Springer, Heidelberg (2012). https://doi.org/10.1007/978-3-642-28537-0_22

    Chapter  Google Scholar 

  19. Hao, H.W., Mu, C.X., Yin, X.C., Li, S., Wang, Z.B.: An improved topic relevance algorithm for focused crawling. In: 2011 IEEE International Conference on Systems, Man, and Cybernetics (SMC), pp. 850–855, October 2011

    Google Scholar 

  20. Hernández, I., Rivero, C.R., Ruiz, D., Corchuelo, R.: A statistical approach to URL-based web page clustering. In: Proceedings of the 21st International Conference Companion on World Wide Web, WWW 2012 Companion, pp. 525–526. ACM, New York (2012). https://doi.org/10.1145/2187980.2188109

  21. Hernández, I., Rivero, C.R., Ruiz, D., Corchuelo, R.: CALA: an unsupervised URL-based web page classification system. Knowl.-Based Syst. 57, 168–180 (2014). http://www.sciencedirect.com/science/article/pii/S0950705113003997

    Article  Google Scholar 

  22. Kriegel, H.P., Schubert, M.: Classification of websites as sets of feature vectors. In: Databases and Applications, pp. 127–132 (2004)

    Google Scholar 

  23. Liu, T.Y.: Learning to rank for information retrieval. Found. Trends Inf. Retr. 3(3), 225–331 (2009). https://doi.org/10.1561/1500000016

    Article  Google Scholar 

  24. Liu, T.Y., Xu, J., Qin, T., Xiong, W., Li, H.: LETOR: Benchmark dataset for research on learning to rank for information retrieval. In: Proceedings of SIGIR 2007 Workshop on Learning to Rank for Information Retrieval, pp. 3–10 (2007)

    Google Scholar 

  25. Mangai, J.A., Kumar, V.S., Alias Balamurugan, S.A.: A novel feature selection framework for automatic web page classification. Int. J. Autom. Comput. 9(4), 442–448 (2012)

    Article  Google Scholar 

  26. Marath, S.T., Shepherd, M., Milios, E., Duffy, J.: Large-scale web page classification. In: 2014 47th Hawaii International Conference on System Sciences (HICSS), pp. 1813–1822. IEEE (2014)

    Google Scholar 

  27. Milli, L., Monreale, A., Rossetti, G., Giannotti, F., Pedreschi, D., Sebastiani, F.: Quantification trees. In: 2013 IEEE 13th International Conference on Data Mining (ICDM), pp. 528–536. IEEE (2013)

    Google Scholar 

  28. Özel, S.A.: A web page classification system based on a genetic algorithm using tagged-terms as features. Expert Syst. Appl. 38(4), 3407–3415 (2011). https://doi.org/10.1016/j.eswa.2010.08.126

    Article  Google Scholar 

  29. Patil, A.S., Pawar, B.: Automated classification of web sites using naive Bayesian algorithm. In: Proceedings of the International Multi-Conference of Engineers and Computer Scientists, vol. 1, pp. 14–16 (2012)

    Google Scholar 

  30. Qi, X., Davison, B.D.: Web page classification: features and algorithms. ACM Comput. Surv. (CSUR) 41(2), 12 (2009)

    Article  Google Scholar 

  31. Rajalakshmi, R., Aravindan, C.: Naive Bayes approach for website classification. In: Das, V.V., Thomas, G., Lumban Gaol, F. (eds.) AIM 2011. CCIS, vol. 147, pp. 323–326. Springer, Heidelberg (2011). https://doi.org/10.1007/978-3-642-20573-6_55

    Chapter  Google Scholar 

  32. Robertson, S.E.: Overview of the okapi projects. J. Doc. 53(1), 3–7 (1997)

    Article  Google Scholar 

  33. Rose, D.E., Levinson, D.: Understanding user goals in web search. In: Proceedings of the 13th International Conference on World Wide Web, pp. 13–19. ACM (2004)

    Google Scholar 

  34. Saad, M.K., Hewahi, N.M.: A comparative study of outlier mining and class outlier mining. Comput. Sci. Lett. 1(1) (2009)

    Google Scholar 

  35. Sebastiani, F.: Text quantification. In: de Rijke, M. (ed.) ECIR 2014. LNCS, vol. 8416, pp. 819–822. Springer, Cham (2014). https://doi.org/10.1007/978-3-319-06028-6_104

    Chapter  Google Scholar 

  36. Smith, M., Martinez, T.: Improving classification accuracy by identifying and removing instances that should be misclassified. In: The 2011 International Joint Conference on Neural Networks (IJCNN), pp. 2690–2697, July 2011

    Google Scholar 

  37. Tang, L., Gao, H., Liu, H.: Network quantification despite biased labels. In: Proceedings of the Eighth Workshop on Mining and Learning with Graphs, pp. 147–154. ACM (2010)

    Google Scholar 

  38. Taylan, D., Poyraz, M., Akyokus, S., Ganiz, M.: Intelligent focused crawler: learning which links to crawl. In: 2011 International Symposium on Innovations in Intelligent Systems and Applications (INISTA), pp. 504–508, June 2011

    Google Scholar 

  39. Trivedi, A., Rai, P., Daumé III, H., DuVall, S.L.: Leveraging social bookmarks from partially tagged corpus for improved web page clustering. ACM Trans. Intell. Syst. Technol. (TIST) 3(4), 67 (2012)

    Google Scholar 

  40. Wilkinson, R., Zobel, J., Sacks-davis, R.: Similarity measures for short queries. In: Fourth text Retrieval Conference (TREC-4), pp. 277–285 (1995)

    Google Scholar 

  41. Xu, Z., Yan, F., Qin, J., Zhu, H.: A web page classification algorithm based on link information. In: 2011 Tenth International Symposium on Distributed Computing and Applications to Business, Engineering and Science (DCABES), pp. 82–86, October 2011

    Google Scholar 

  42. Yu, H., Han, J., Chang, K.C.: PEBL: web page classification without negative examples. IEEE Trans. Knowl. Data Eng. 16(1), 70–81 (2004)

    Article  Google Scholar 

  43. Zhong, S., Zou, D.: Web page classification using an ensemble of support vector machine classifiers. J. Netw. 6(11), 1625–1630 (2011)

    Google Scholar 

Download references

Funding

This work was supported by: the Regione Toscana of Italy under the grant POR CRO 2007/2013 Asse IV Capitale Umano; the Italian Registry of the ccTLD “it” Registro.it.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Filippo Geraci .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2018 Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Geraci, F., Papini, T. (2018). Approximating Multi-class Text Classification Via Automatic Generation of Training Examples. In: Gelbukh, A. (eds) Computational Linguistics and Intelligent Text Processing. CICLing 2017. Lecture Notes in Computer Science(), vol 10762. Springer, Cham. https://doi.org/10.1007/978-3-319-77116-8_44

Download citation

  • DOI: https://doi.org/10.1007/978-3-319-77116-8_44

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-319-77115-1

  • Online ISBN: 978-3-319-77116-8

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics