Skip to main content

Document Analysis and Retrieval Tasks in Scientific Digital Libraries

  • Chapter
  • First Online:
Information Retrieval (RuSSIR 2014)

Part of the book series: Communications in Computer and Information Science ((CCIS,volume 505))

Included in the following conference series:

  • 2012 Accesses

Abstract

Machine Learning (ML) algorithms have opened up new possibilities for the acquisition and processing of documents in Information Retrieval (IR) systems. Indeed, it is now possible to automate several labor-intensive tasks related to documents such as categorization and entity extraction. Consequently, the application of machine learning techniques for various large-scale IR tasks has gathered significant research interest in both the ML and IR communities. This tutorial provides a reference summary of our research in applying machine learning techniques to diverse tasks in Digital Libraries (DL). Digital library portals are specialized IR systems that work on collections of documents related to particular domains. We focus on open-access, scientific digital libraries such as CiteSeer\(^x\), which involve several crawling, ranking, content analysis, and metadata extraction tasks. We elaborate on the challenges involved in these tasks and highlight how machine learning methods can successfully address these challenges.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Institutional subscriptions

Notes

  1. 1.

    http://citeseerx.ist.psu.edu/.

  2. 2.

    http://scholar.google.com/.

  3. 3.

    http://academic.research.microsoft.com/.

  4. 4.

    http://romip.ru/russir2014/.

  5. 5.

    http://dl.acm.org/.

  6. 6.

    http://www.ncbi.nlm.nih.gov/pubmed.

  7. 7.

    http://www.cs.cmu.edu/afs/cs/project/theo20/www/data/.

  8. 8.

    The WebKB dataset was created in 1997.

  9. 9.

    http://wordnet.princeton.edu/.

  10. 10.

    http://nlp.stanford.edu/ner/index.shtml.

References

  1. Hood, W.W., Wilson, C.S.: The literature of bibliometrics, scientometrics, and informetrics. Scientometrics 52(2), 291–314 (2001)

    Article  Google Scholar 

  2. Blei, D.M., Ng, A.Y., Jordan, M.I.: Latent dirichlet allocation. J. Mach. Learn. Res. 3, 993–1022 (2003)

    MATH  Google Scholar 

  3. Boudin, F.: A comparison of centrality measures for graph-based keyphrase extraction. In: IJCNLP (2013)

    Google Scholar 

  4. Burges, C.J.C.: A tutorial on support vector machines for pattern recognition. Data Min. Knowl. Discov. 2(2), 121–167 (1998)

    Article  Google Scholar 

  5. Caragea, C., Wu, J., Williams, K., Gollapalli, S.D., Khabsa, M., Teregowda, P., Giles, C.L.: Automatic identification of research articles from crawled documents. In: Web-Scale Classification: Classifying Big Data from the Web, Co-Located with WSDM (2014)

    Google Scholar 

  6. Chakrabarti, S.: Mining the Web: Discovering Knowledge from Hypertext Data. Morgan-Kauffman, Burlington (2002)

    Google Scholar 

  7. Chakrabarti, S., van den Berg, M., Dom, B.: Focused crawling: a new approach to topic-specific web resource discovery. Comput. Netw. 31(11–16), 1623–1640 (1999)

    Article  Google Scholar 

  8. Chen, B., Zhu, L., Kifer, D., Lee, D.: What is an opinion about? exploring political standpoints using opinion scoring model. In: AAAI (2010)

    Google Scholar 

  9. Councill, I.G., Giles, C.L., Kan, M.-Y.: Parscit: an open-source crf reference string parsing package. In: LREC (2008)

    Google Scholar 

  10. Deerwester, S.C., Dumais, S.T., Landauer, T.K., Furnas, G.W., Harshman, R.A.: Indexing by latent semantic analysis. JASIS 41(6), 391–407 (1990)

    Article  Google Scholar 

  11. Deng, H., King, I., Lyu, M.R.: Formal models for expert finding on dblp bibliography data. In: Proceedings of the 2008 Eighth IEEE International Conference on Data Mining, ICDM 2008, pp. 163–172. IEEE Computer Society, Washington, DC, USA (2008)

    Google Scholar 

  12. Druck, G., Mann, G., McCallum, A.: Learning from labeled features using generalized expectation criteria. In: Proceedings of the 31st Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR 2008, pp. 595–602. ACM, New York (2008)

    Google Scholar 

  13. Firdhous, M.: Automating legal research through data mining. CoRR, abs/1211.1861 (2012)

    Google Scholar 

  14. Frank, E., Paynter, G.W., Witten, I.H., Gutwin, G., Nevill-Manning, C.G.: Domain-specific keyphrase extraction. In: IJCAI (1999)

    Google Scholar 

  15. Ganchev, K., Graça, J., Gillenwater, J., Taskar, B.: Posterior regularization for structured latent variable models. J. Mach. Learn. Res. 11, 2001–2049 (2010)

    MathSciNet  MATH  Google Scholar 

  16. Gollapalli, S.D., Caragea, C.: Extracting keyphrases from research papers using citation networks. In: AAAI, pp. 1629–1635 (2014)

    Google Scholar 

  17. Gollapalli, S.D., Caragea, C., Mitra, P., Giles, C.L.: Researcher homepage classification using unlabeled data. In: Proceedings of the 22nd International Conference on World Wide Web, WWW 2013, pp. 471–482. International World Wide Web Conferences Steering Committee, Republic and Canton of Geneva, Switzerland (2013)

    Google Scholar 

  18. Gollapalli, S.D., Giles, C.L., Mitra, P., Caragea, C.: On identifying academic homepages for digital libraries. In: Proceedings of the 11th Annual International ACM/IEEE Joint Conference on Digital Libraries, JCDL 2011, pp. 123–132. ACM, New York (2011)

    Google Scholar 

  19. Gollapalli, S.D., Mitra, P., Giles, C.L.: Learning to rank homepages for researcher-name queries. In: SIGIR Workshop on Entity Oriented Search (2011)

    Google Scholar 

  20. Gollapalli, S.D., Mitra, P., Giles, C.L.: Ranking experts using author-document-topic graphs. In: Proceedings of the 13th ACM/IEEE-CS Joint Conference on Digital libraries, JCDL 2013, pp. 87–96, ACM, New York (2011)

    Google Scholar 

  21. Gollapalli, S.D., Qi, Y., Mitra, P., Giles, C.L.: Extracting researcher metadata with labeled features. In: SDM, pp. 740–748 (2014)

    Google Scholar 

  22. Griffiths, T.L., Steyvers, M.: Finding scientific topics. Proc. Natl. Acad. Sci. U.S.A. 101(Suppl 1), 5228–5235 (2004)

    Article  Google Scholar 

  23. Hammouda, K.M., Matute, D.N., Kamel, M.S.: Corephrase: keyphrase extraction for document clustering. In: Machine Learning and Data Mining in Pattern Recognition (2005)

    Google Scholar 

  24. Han, H., Giles, C.L., Manavoglu, E., Zha, H., Zhang, Z., Fox, E.A.: Automatic document metadata extraction using support vector machines. In: Proceedings of the 3rd ACM/IEEE-CS Joint Conference on Digital libraries, JCDL 2003, pp. 37–48. IEEE Computer Society, Washington, DC, USA (2003)

    Google Scholar 

  25. Han, J.: Data Mining: Concepts and Techniques. Morgan Kaufmann Publishers Inc., Burlington (2005)

    Google Scholar 

  26. Haveliwala, T., Kamvar, S., Klein, D., Manning, C., Golub, G.: Computing pagerank using power extrapolation. Number 2003–45. Stanford (2003)

    Google Scholar 

  27. He, Q., Chen, B., Pei, J., Qiu, B., Mitra, P., Giles, C.L.: Detecting topic evolution in scientific literature: how can citations help? In: CIKM, pp. 957–966 (2009)

    Google Scholar 

  28. Heinrich, G.: Parameter estimation for text analysis. Technical report (2008)

    Google Scholar 

  29. Hofmann, T.: Probabilistic latent semantic indexing. In: Proceedings of the 22nd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR 1999, pp. 50–57. ACM, New York (1999)

    Google Scholar 

  30. Hulth, A.: Improved automatic keyword extraction given more linguistic knowledge. In: EMNLP, pp. 216–223 (2003)

    Google Scholar 

  31. Jakulin, A., Buntine, W., La Pira, T., Brasher, H.: Analyzing the U.S. senate in 2003: similarities, clusters, and blocs. Polit. Anal. 17(3), 10 (2009)

    Article  Google Scholar 

  32. Jones, S., Staveley, M.S.: Phrasier: a system for interactive document retrieval using keyphrases. In: SIGIR (1999)

    Google Scholar 

  33. Kataria, S., Kumar, K.S., Rastogi, R., Sen, P., Sengamedu, S.H.: Entity disambiguation with hierarchical topic models. In: KDD, pp. 1037–1045 (2011)

    Google Scholar 

  34. Kataria, S., Mitra, P., Bhatia, S.: Utilizing context in generative bayesian models for linked corpus. In: AAAI (2010)

    Google Scholar 

  35. Kataria, S., Mitra, P., Caragea, C., Giles, C.L.: Context sensitive topic models for author influence in document networks. In: IJCAI, pp. 2274–2280 (2011)

    Google Scholar 

  36. Kim, S.N., Kan, M.-Y.: Re-examining automatic keyphrase extraction approaches in scientific articles. In: Proceedings of the Workshop on Multiword Expressions: Identification, Interpretation, Disambiguation and Applications, MWE 2009 (2009)

    Google Scholar 

  37. Kim, S.N., Medelyan, O., Kan, M.-Y., Baldwin, T.: Automatic keyphrase extraction from scientific articles. Lang. Resour. Eval. 47(3), 723–742 (2013)

    Article  Google Scholar 

  38. Lafferty, J.D., McCallum, A., Pereira, F.C.N.: Conditional random fields: probabilistic models for segmenting and labeling sequence data. In: Proceedings of the Eighteenth International Conference on Machine Learning, ICML 2001, pp. 282–289, Morgan Kaufmann Publishers Inc., San Francisco (2001)

    Google Scholar 

  39. Li, H., Councill, I.G., Bolelli, L., Zhou, D., Song, Y., Lee, W.-C., Sivasubramaniam, A., Giles, C.L.: Citeseerx: a scalable autonomous scientific digital library. In: Proceedings of the 1st International Conference on Scalable Information Systems, InfoScale 2006. ACM, New York (2006)

    Google Scholar 

  40. Li, X., Ng, S.-K., Wang, J.T.L.: Biological Data Mining and Its Applications in Healthcare, 1st edn. World Scientific Publishing Co., Inc., Singapore (2013)

    Google Scholar 

  41. Liu, B.: Web Data Mining: Exploring Hyperlinks, Contents, and Usage Data (Data-Centric Systems and Applications). Springer-Verlag New York Inc., New York (2006)

    Google Scholar 

  42. Liu, F., Pennell, D., Liu, F., Liu, Y.: Unsupervised approaches for automatic keyword extraction using meeting transcripts. In: Proceedings of NAACL 2009, pp. 620–628 (2009)

    Google Scholar 

  43. Liu, X., Croft, W.B.: Statistical language modeling for information retrieval. ARIST 39(1), 1–31 (2005)

    Google Scholar 

  44. Mann, G.S., McCallum, A.: Generalized expectation criteria for semi-supervised learning with weakly labeled data. J. Mach. Learn. Res. 11, 955–984 (2010)

    MathSciNet  MATH  Google Scholar 

  45. Manning, C.D., Raghavan, P., Schütze, H.: Introduction to Information Retrieval. Cambridge University Press, New York (2008)

    Book  MATH  Google Scholar 

  46. Marujo, L., Ribeiro, R., de Matos, D.M., Neto, J.P., Gershman, A., Carbonell, J.G.: Key phrase extraction of lightly filtered broadcast news. CoRR (2013)

    Google Scholar 

  47. Nguyen, T.D., Kan, M.-Y.: Keyphrase extraction in scientific publications. In: Goh, D.H.-L., Cao, T.H., Sølvberg, I.T., Rasmussen, E. (eds.) ICADL 2007. LNCS, vol. 4822, pp. 317–326. Springer, Heidelberg (2007)

    Google Scholar 

  48. Ortega-Priego, J.-L., Aguillo, I.F., Prieto-Valverde, J.A.: Longitudinal study of contents and elements in the scientific web environment. J. Inf. Sci. 32(4), 344–351 (2006)

    Article  Google Scholar 

  49. Page, L., Brin, S., Motwani, R., Winograd, T.: The pagerank citation ranking: Bringing order to the web. Technical report (1999)

    Google Scholar 

  50. Pudota, N., Dattolo, A., Baruzzo, A., Ferrara, F., Tasso, C.: Automatic keyphrase extraction and ontology mining for content-based tag recommendation. Int. J. Intell. Syst. 25(12), 1158–1186 (2010)

    Article  MATH  Google Scholar 

  51. Salton, G., McGill, M.J.: Introduction to Modern Information Retrieval. McGraw-Hill Inc., New York (1986)

    MATH  Google Scholar 

  52. Sarawagi, S.: Information extraction. Found. Trends Databases 1(3), 261–377 (2008)

    Article  MATH  Google Scholar 

  53. Tang, J., Jin, R., Zhang, J.: A topic modeling approach and its integration into the random walk framework for academic search. In: Proceedings of the 2008 Eighth IEEE International Conference on Data Mining, ICDM 2008, pp. 1055–1060. IEEE Computer Society, Washington, DC, USA (2008)

    Google Scholar 

  54. Tang, J., Zhang, J., Yao, L., Li, J., Zhang, L., Su, Z.: Arnetminer: extraction and mining of academic social networks. In: Proceedings of the 14th ACM SIGKDD International Conference on Knowledge Discovery nd Data Mining, KDD 2008, pp. 990–998. ACM, New York (2008)

    Google Scholar 

  55. Teregowda, P.B., Councill, I.G., Fernández, R.J.P., Khabsa, M., Zheng, S., Giles, C.L.: Seersuite: developing a scalable and reliable application framework for building digital libraries by crawling the web. In: Proceedings of the 2010 USENIX Conference on Web Application Development WebApps 2010 (2010)

    Google Scholar 

  56. Tuarob, S., Pouchard, L.C., Giles, C.L.: Automatic tag recommendation for metadata annotation using probabilistic topic modeling. In: Proceedings of the 13th ACM/IEEE-CS Joint Conference on Digital Libraries, JCDL 2013, pp. 239–248. ACM (2013)

    Google Scholar 

  57. Wu, J., Williams, K., Chen, H.-H., Khabsa, M., Caragea, C., Ororbia, A., Jordan, D., Giles, C.L.: Citeseerx: Ai in a digital library search engine. In: IAAI (2014)

    Google Scholar 

  58. Zha, H.: Generic summarization and keyphrase extraction using mutual reinforcement principle and sentence clustering. In: SIGIR (2002)

    Google Scholar 

  59. Zheng, S., Zhou, D., Li, J., Giles, C.L.: Extracting author meta-data from web using visual features. In: Proceedings of the Seventh IEEE International Conference on Data Mining Workshops, ICDMW 2007, pp. 33–40. IEEE Computer Society, Washington, DC, USA (2007)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Sujatha Das Gollapalli .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2015 Springer International Publishing Switzerland

About this chapter

Cite this chapter

Gollapalli, S.D., Caragea, C., Li, X., Giles, C.L. (2015). Document Analysis and Retrieval Tasks in Scientific Digital Libraries. In: Braslavski, P., Karpov, N., Worring, M., Volkovich, Y., Ignatov, D.I. (eds) Information Retrieval. RuSSIR 2014. Communications in Computer and Information Science, vol 505. Springer, Cham. https://doi.org/10.1007/978-3-319-25485-2_1

Download citation

  • DOI: https://doi.org/10.1007/978-3-319-25485-2_1

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-319-25484-5

  • Online ISBN: 978-3-319-25485-2

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics