Approximating Multi-class Text Classification Via Automatic Generation of Training Examples

Geraci, Filippo; Papini, Tiziano

doi:10.1007/978-3-319-77116-8_44

Filippo Geraci¹⁴ &
Tiziano Papini¹⁵

Part of the book series: Lecture Notes in Computer Science ((LNTCS,volume 10762))

Included in the following conference series:

International Conference on Computational Linguistics and Intelligent Text Processing

1067 Accesses
2 Citations

Abstract

Text classification is among the most broadly used machine learning tools in computational linguistic. Web information retrieval is one of the most important sectors that took advantage from this technique. Applications range from page classification, used by search engines, to URL classification used for focus crawling and on-line time-sensitive applications [2]. Due to the pressing need for the highest possible accuracy, a supervised learning approach is always preferred when an adequately large set of training examples is available. Nonetheless, since building such an accurate and representative training set often becomes impractical when the number of classes increases over a few units, alternative unsupervised or semi-supervised approaches have come out. The use of standard web directories as a source of examples can be prone to undesired effects due, for example, to the presence of maliciously misclassified web pages. In addition, this option is subjected to the existence of all the desired classes in the directory hierarchy.

Taking as input a textual description of each class and a set of URLs, in this paper we propose a new framework to automatically build a representative training set able to reasonably approximate the classification accuracy obtained by means of a manually-curated training set. Our approach leverages on the observation that a not negligible fraction of website names is the result of the juxtaposition of few keywords. Yet, the entire URL can often be converted into a meaningful text snippet. When this happens, we can label the URL by measuring its degree of similarity with each class description. The text contained in the pages corresponding to labelled URLs can be used as a training set for any subsequent classification task (not necessarily on the web). Experiments on a set of 20 thousand web pages belonging to 9 categories have shown that our auto-labelling framework is able to attain an approximation factor over 88% of the accuracy of a pure supervised classification trained with manually-curated examples.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

1.
http://www.dmoz.org/World/Italiano/.

References

Adami, G., Avesani, P., Sona, D.: Clustering documents into a web directory for bootstrapping a supervised classification. Data Knowl. Eng. 54(3), 301–325 (2005). https://doi.org/10.1016/j.datak.2004.11.003
Article Google Scholar
Baykan, E., Henzinger, M., Marian, L., Weber, I.: Purely URL-based topic classification. In: Proceedings of the 18th International Conference on World Wide Web, pp. 1109–1110. ACM (2009)
Google Scholar
Baykan, E., Henzinger, M., Marian, L., Weber, I.: A comprehensive study of features and algorithms for URL-based topic classification. ACM Trans. Web 5(3), 15:1–15:29 (2011). https://doi.org/10.1145/1993053.1993057
Article Google Scholar
Baykan, E., Henzinger, M., Weber, I.: A comprehensive study of techniques for URL-based web page language classification. ACM Trans. Web 7(1), 3:1–3:37 (2013). https://doi.org/10.1145/2435215.2435218
Article Google Scholar
Bennett, G., Scholer, F., Uitdenbogerd, A.: A comparative study of probabilistic and language models for information retrieval. In: Proceedings of the Nineteenth Conference on Australasian Database, vol. 75, pp. 65–74. Australian Computer Society Inc. (2008)
Google Scholar
Boyd, D., Crawford, K.: Critical questions for big data: provocations for a cultural, technological, and scholarly phenomenon. Inf. Commun. Soc. 15(5), 662–679 (2012)
Article Google Scholar
Breunig, M.M., Kriegel, H.P., Ng, R.T., Sander, J.: LOF: identifying density-based local outliers. In: ACM Sigmod Record, vol. 29, pp. 93–104. ACM (2000)
Google Scholar
Broder, A.Z., Ciccolo, P., Fontoura, M., Gabrilovich, E., Josifovski, V., Riedel, L.: Search advertising using web relevance feedback. In: Proceedings of the 17th ACM Conference on Information and Knowledge Management, CIKM 2008, pp. 1013–1022. ACM, New York (2008). https://doi.org/10.1145/1458082.1458217
Carpineto, C., Osiński, S., Romano, G., Weiss, D.: A survey of web clustering engines. ACM Comput. Surv. 41(3), 17:1–17:38 (2009)
Article Google Scholar
Castellanos, M., Daniel, F., Garrigós, I., Mazón, J.N.: Business intelligence and the web. Inf. Syst. Front. 15(3), 307–309 (2013)
Article Google Scholar
Charikar, M.S.: Similarity estimation techniques from rounding algorithms. In: Proceedings of STOC-02, 34th Annual ACM Symposium on the Theory of Computing, , Montreal, CA, pp. 380–388 (2002)
Google Scholar
Dong, H., Hussain, F.: Focused crawling for automatic service discovery, annotation, and classification in industrial digital ecosystems. IEEE Trans. Ind. Electron. 58(6), 2106–2116 (2011)
Article Google Scholar
Eickhoff, C., Serdyukov, P., de Vries, A.P.: Web page classification on child suitability. In: Proceedings of the 19th ACM International Conference on Information and Knowledge Management, CIKM 2010, pp. 1425–1428. ACM, New York (2010). https://doi.org/10.1145/1871437.1871638
Erdélyi, M., Garzó, A., Benczúr, A.A.: Web spam classification: a few features worth more. In: Proceedings of the 2011 Joint WICOW/AIRWeb Workshop on Web Quality, WebQuality 2011, pp. 27–34. ACM, New York (2011). https://doi.org/10.1145/1964114.1964121
Forman, G.: Quantifying counts and costs via classification. Data Min. Knowl. Discov. 17(2), 164–206 (2008)
Article MathSciNet Google Scholar
Fürnkranz, J.: Exploiting structural information for text classification on the WWW. In: Hand, D.J., Kok, J.N., Berthold, M.R. (eds.) IDA 1999. LNCS, vol. 1642, pp. 487–497. Springer, Heidelberg (1999). https://doi.org/10.1007/3-540-48412-4_41
Chapter Google Scholar
de Groc, C.: Babouk: Focused web crawling for corpus compilation and automatic terminology extraction. In: 2011 IEEE/WIC/ACM International Conference on Web Intelligence and Intelligent Agent Technology (WI-IAT), vol. 1, pp. 497–498 (2011)
Google Scholar
Halvorson, T., et al.: The BIZ top-level domain: ten years later. In: Taft, N., Ricciato, F. (eds.) PAM 2012. LNCS, vol. 7192, pp. 221–230. Springer, Heidelberg (2012). https://doi.org/10.1007/978-3-642-28537-0_22
Chapter Google Scholar
Hao, H.W., Mu, C.X., Yin, X.C., Li, S., Wang, Z.B.: An improved topic relevance algorithm for focused crawling. In: 2011 IEEE International Conference on Systems, Man, and Cybernetics (SMC), pp. 850–855, October 2011
Google Scholar
Hernández, I., Rivero, C.R., Ruiz, D., Corchuelo, R.: A statistical approach to URL-based web page clustering. In: Proceedings of the 21st International Conference Companion on World Wide Web, WWW 2012 Companion, pp. 525–526. ACM, New York (2012). https://doi.org/10.1145/2187980.2188109
Hernández, I., Rivero, C.R., Ruiz, D., Corchuelo, R.: CALA: an unsupervised URL-based web page classification system. Knowl.-Based Syst. 57, 168–180 (2014). http://www.sciencedirect.com/science/article/pii/S0950705113003997
Article Google Scholar
Kriegel, H.P., Schubert, M.: Classification of websites as sets of feature vectors. In: Databases and Applications, pp. 127–132 (2004)
Google Scholar
Liu, T.Y.: Learning to rank for information retrieval. Found. Trends Inf. Retr. 3(3), 225–331 (2009). https://doi.org/10.1561/1500000016
Article Google Scholar
Liu, T.Y., Xu, J., Qin, T., Xiong, W., Li, H.: LETOR: Benchmark dataset for research on learning to rank for information retrieval. In: Proceedings of SIGIR 2007 Workshop on Learning to Rank for Information Retrieval, pp. 3–10 (2007)
Google Scholar
Mangai, J.A., Kumar, V.S., Alias Balamurugan, S.A.: A novel feature selection framework for automatic web page classification. Int. J. Autom. Comput. 9(4), 442–448 (2012)
Article Google Scholar
Marath, S.T., Shepherd, M., Milios, E., Duffy, J.: Large-scale web page classification. In: 2014 47th Hawaii International Conference on System Sciences (HICSS), pp. 1813–1822. IEEE (2014)
Google Scholar
Milli, L., Monreale, A., Rossetti, G., Giannotti, F., Pedreschi, D., Sebastiani, F.: Quantification trees. In: 2013 IEEE 13th International Conference on Data Mining (ICDM), pp. 528–536. IEEE (2013)
Google Scholar
Özel, S.A.: A web page classification system based on a genetic algorithm using tagged-terms as features. Expert Syst. Appl. 38(4), 3407–3415 (2011). https://doi.org/10.1016/j.eswa.2010.08.126
Article Google Scholar
Patil, A.S., Pawar, B.: Automated classification of web sites using naive Bayesian algorithm. In: Proceedings of the International Multi-Conference of Engineers and Computer Scientists, vol. 1, pp. 14–16 (2012)
Google Scholar
Qi, X., Davison, B.D.: Web page classification: features and algorithms. ACM Comput. Surv. (CSUR) 41(2), 12 (2009)
Article Google Scholar
Rajalakshmi, R., Aravindan, C.: Naive Bayes approach for website classification. In: Das, V.V., Thomas, G., Lumban Gaol, F. (eds.) AIM 2011. CCIS, vol. 147, pp. 323–326. Springer, Heidelberg (2011). https://doi.org/10.1007/978-3-642-20573-6_55
Chapter Google Scholar
Robertson, S.E.: Overview of the okapi projects. J. Doc. 53(1), 3–7 (1997)
Article Google Scholar
Rose, D.E., Levinson, D.: Understanding user goals in web search. In: Proceedings of the 13th International Conference on World Wide Web, pp. 13–19. ACM (2004)
Google Scholar
Saad, M.K., Hewahi, N.M.: A comparative study of outlier mining and class outlier mining. Comput. Sci. Lett. 1(1) (2009)
Google Scholar
Sebastiani, F.: Text quantification. In: de Rijke, M. (ed.) ECIR 2014. LNCS, vol. 8416, pp. 819–822. Springer, Cham (2014). https://doi.org/10.1007/978-3-319-06028-6_104
Chapter Google Scholar
Smith, M., Martinez, T.: Improving classification accuracy by identifying and removing instances that should be misclassified. In: The 2011 International Joint Conference on Neural Networks (IJCNN), pp. 2690–2697, July 2011
Google Scholar
Tang, L., Gao, H., Liu, H.: Network quantification despite biased labels. In: Proceedings of the Eighth Workshop on Mining and Learning with Graphs, pp. 147–154. ACM (2010)
Google Scholar
Taylan, D., Poyraz, M., Akyokus, S., Ganiz, M.: Intelligent focused crawler: learning which links to crawl. In: 2011 International Symposium on Innovations in Intelligent Systems and Applications (INISTA), pp. 504–508, June 2011
Google Scholar
Trivedi, A., Rai, P., Daumé III, H., DuVall, S.L.: Leveraging social bookmarks from partially tagged corpus for improved web page clustering. ACM Trans. Intell. Syst. Technol. (TIST) 3(4), 67 (2012)
Google Scholar
Wilkinson, R., Zobel, J., Sacks-davis, R.: Similarity measures for short queries. In: Fourth text Retrieval Conference (TREC-4), pp. 277–285 (1995)
Google Scholar
Xu, Z., Yan, F., Qin, J., Zhu, H.: A web page classification algorithm based on link information. In: 2011 Tenth International Symposium on Distributed Computing and Applications to Business, Engineering and Science (DCABES), pp. 82–86, October 2011
Google Scholar
Yu, H., Han, J., Chang, K.C.: PEBL: web page classification without negative examples. IEEE Trans. Knowl. Data Eng. 16(1), 70–81 (2004)
Article Google Scholar
Zhong, S., Zou, D.: Web page classification using an ensemble of support vector machine classifiers. J. Netw. 6(11), 1625–1630 (2011)
Google Scholar

Download references

Funding

This work was supported by: the Regione Toscana of Italy under the grant POR CRO 2007/2013 Asse IV Capitale Umano; the Italian Registry of the ccTLD “it” Registro.it.

Author information

Authors and Affiliations

Istituto di Informatica e Telematica, CNR, Via G. Moruzzi, 1, 56124, Pisa, Italy
Filippo Geraci
Quest-it.com, Via Firenze 33, 53048, Sinalunga, Italy
Tiziano Papini

Authors

Filippo Geraci
View author publications
You can also search for this author in PubMed Google Scholar
Tiziano Papini
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Filippo Geraci .

Editor information

Editors and Affiliations

CIC, Instituto Politécnico Nacional, Mexico City, Mexico
Alexander Gelbukh

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Geraci, F., Papini, T. (2018). Approximating Multi-class Text Classification Via Automatic Generation of Training Examples. In: Gelbukh, A. (eds) Computational Linguistics and Intelligent Text Processing. CICLing 2017. Lecture Notes in Computer Science(), vol 10762. Springer, Cham. https://doi.org/10.1007/978-3-319-77116-8_44

Download citation

DOI: https://doi.org/10.1007/978-3-319-77116-8_44
Published: 10 October 2018
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-77115-1
Online ISBN: 978-3-319-77116-8
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics