Abstract
The amount of software that hosts spyware has increased dramatically. To avoid legal repercussions, the vendors need to inform users about inclusion of spyware via end user license agreements (EULAs) during the installation of an application. However, this information is intentionally written in a way that is hard for users to comprehend. We investigate how to automatically discriminate between legitimate software and spyware associated software by mining EULAs. For this purpose, we compile a data set consisting of 996 EULAs out of which 9.6% are associated to spyware. We compare the performance of 17 learning algorithms with that of a baseline algorithm on two data sets based on a bag-of-words and a meta data model. The majority of learning algorithms significantly outperform the baseline regardless of which data representation is used. However, a non-parametric test indicates that bag-of-words is more suitable than the meta model. Our conclusion is that automatic EULA classification can be applied to assist users in making informed decisions about whether to install an application without having read the EULA. We therefore outline the design of a spyware prevention tool and suggest how to select suitable learning algorithms for the tool by using a multi-criteria evaluation approach.
Similar content being viewed by others
References
Androutsopoulos I, Paliouras G, Karkaletsis V, Sakkis G, Spyropoulos CD, Stamatopoulos P (2000) Learning to filter spam E-mail: a comparison of a naive bayesian and a memory-based approach. In: 4th European conference on principles and practice of knowledge discovery in databases: workshop on machine learning and textual information access, Springer, Berlin, pp 1–13
Arnett KP, Schmidt MB (2005) Busting the ghost in the machine. Communications of the ACM 48(8)
Boldt M (2007) Privacy-invasive software—exploring effects and countermeasures, Licentiate Thesis Series, No 2007:01, Blekinge Institute of Technology
Boldt M, Carlsson B (2006) Analysing countermeasures against privacy-invasive software. In: 1st IEEE international conference on systems and networks communications
Boldt M, Carlsson B, Jacobsson A (2004) Exploring spyware effects. In: Eight nordic workshop on secure IT systems, Helsinki University of Technology, Espoo, Finland, no. TML-A10 in publications in telecommunication and software multimedia, pp 23–30
Breiman L (1996) Bagging predictors. Mach Learn 24(2): 123–140
Carreras X, Màrquez L (2001) Boosting trees for anti-spam email filtering. In: Mitkov R, Angelova G, Bontcheva K, Nicolov N, Nikolov N (eds) European conference on recent advances in natural language processing. Tzigov Chark, Bulgaria, pp 58–64
Caruana R, Niculescu-Mizil A (2006) An empirical comparison of supervised learning algorithms. In: 23rd international conference on machine learning. ACM Press, New York City, pp 161–168
Cohen W (1996) Learning Rules that Classify E-Mail. In: Advances in inductive logic programming. IOS Press, Amsterdam
Coleman M, Liau TL (1975) A computer readability formula designed for machine scoring. J Appl Psychol 60: 283–284
Demzar J (2006) Statistical comparisons of classifiers over multiple data sets. Mach Learn Res 7: 1–30
Denoyer L, Zaragoza H, Gallinari P (2001) HMM-based passage models for document classification and ranking. In: 23rd European colloquium on information retrieval research
Drucker H, Wu D, Vapnik V (1999) Support vector machines for spam categorization. IEEE Trans Neural Netw 10(5): 1048–1054
Fawcett T (2001) Using rule sets to maximize ROC performance. In: IEEE international conference on data mining. IEEE Press, New York City, pp 131–138
Fawcett T (2003) ROC graphs—notes and practical considerations for data mining researchers. Tech. Rep. HPL-2003-4, Intelligent enterprise technologies laboratories, Palo Alto
Feldman R, Sanger J (2007) The text mining handbook. Cambridge University Press, Cambridge
Flesch R (1948) A new readability yardstick. J Appl Psychol 32: 221–233
Fox S (2005) Spyware—the threat of unwanted software programs is changing the way people use the Internet. http://www.pewinternet.org/pdfs/PIP_Spyware_Report_July_05.pdf
Good N, Grossklags J, Thaw D, Perzanowski A, Mulligan DK, Konstan J (2006) User choices and regret: understanding users’ decision process about consensually acquired spyware. I/S Law Policy Inf Soc 2(2): 283–344
Kang N, Domeniconi C, Barbara D (2005) Categorization and keyword identification of unlabeled documents. In: Fifth IEEE international conference on data mining. IEEE Press, New York City, pp 677–680
Kibriya AM, Frank E, Pfahringer B, Holmes G (2004) Multinomial naive bayes for text categorization revisited. In: Seventh Australian joint conference on artificial intelligence, Springer, Berlin, pp 488–499
Koprinska I, Poon J, Clark J, Chan J (2007) Learning to classify E-mail. Inf Sci 177: 2167–2187
Lavesson N, Davidsson P (2008) Generic methods for multi-criteria evaluation. In: Eighth SIAM international conference on data mining. SIAM Press, Philadelphia, pp 541–546
Lavesson N, Davidsson P, Boldt M, Jacobsson A (2008) Spyware Prevention by Classifying End User License Agreements. In: New challenges in applied intelligence technologies, studies in computational intelligence, vol 134. Springer, Berlin
McFedries P (2005) The spyware nightmare. IEEE Spectr 42(8): 72–72
Metzler D, Croft WB (2005) A markov random field model for term dependencies. In: 28th ACM SIGIR conference on research and development in information retrieval, pp 472–479
Moshchuk A, Bragin T, Gribble SD, Levy HM (2006) A crawler-based study of spyware on the web. In: 13th annual symposium on network and distributed systems security, Internet Society, Reston
Provost F, Fawcett T, Kohavi R (1998) The case against accuracy estimation for comparing induction algorithms. In: 15th international conference on machine learning. Morgan Kaufmann Publishers, San Francisco, pp 445–453
Sakkis G, Androutsopoulos I, Paliouras G, Karkaletsis V, Spyropoulos CD, Stamatopoulos P (2001) Stacking classifiers for anti-spam filtering of E-mail. In: Sixth conference on empirical methods in natural language processing, Carnegie Mellon University, Pittsburgh
Sebastiani F (2002) Machine learning in automated text categorization. ACM Comput Surv 34(1): 1–47
Shukla S, Nah F (2005) Web browsing and spyware intrusion. Communications of the ACM 48(8)
Smith EA, Kincaid P (1970) Derivation and validation of the automated readability index for use with technical materials. Human Factors 12: 457–464
Townsend K (2003) Spyware, Adware, and Peer-to-Peer networks—the hidden threat to corporate security, Technical White Paper, Pest Patro
Wang P, Hu J, Zeng H-J, Chen Z (2009) Using wikipedia knowledge to improve text classification. Knowl Inf Syst 19: 265–281
Wang BX, Japkowicz N (2009) Boosting support vector machines for imbalanced data Sets. Knowl Inf Syst (Online First)
Weiss A (2005) Spyware be gone. ACM Netw 9(1): 18–25
Witten IH, Frank E (2005) Data mining: practical machine learning tools and techniques. Morgan Kaufmann Publishers, San Francisco
Zhang X (2005) What do consumers really know about spyware. Commun ACM 48(8): 44–48
Zhao H (2008) Instance weighting versus threshold adjusting for cost-sensitive classification. Knowl Inf Syst 15: 321–334
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Lavesson, N., Boldt, M., Davidsson, P. et al. Learning to detect spyware using end user license agreements. Knowl Inf Syst 26, 285–307 (2011). https://doi.org/10.1007/s10115-009-0278-z
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10115-009-0278-z