Skip to main content

Abstract

We consider the problem of statistical pattern recognition in a heterogeneous, high-dimensional setting. In particular, we consider the search for meaningful cross-category associations in a heterogeneous text document corpus. Our approach involves “iterative denoising ” — that is, iteratively extracting (corpus-dependent) features and partitioning the document collection into sub-corpora. We present an anecdote wherein this methodology discovers a meaningful cross-category association in a heterogeneous collection of scientific documents.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 129.00
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 169.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Berry M.W., editor (2004). Survey of text mining: clustering, classification, and retrieval. Springer-Verlag.

    Google Scholar 

  2. Borg I., Groenen P. (1997). Modern multidimensional scaling: theory and applications. Springer-Verlag.

    Google Scholar 

  3. Cowen L.J., Priebe C.E. (1997). Randomized nonlinear projections uncover high-dimensional structure. Advances in Applied Mathematics 9, 319–331.

    Article  MathSciNet  Google Scholar 

  4. Deerwester S., Dumais S.T., Furnas G.W., Landauer T.K., Harshman R. (1990). Indexing by latent semantic analysis. Journal of the American Society for Information Science 41(6), 391–407.

    Article  Google Scholar 

  5. Jolliffe I.T. (1986). Principal component analysis. Springer-Verlag.

    Google Scholar 

  6. Lin D., Pantel P. (2002). Concept discovery from text. In Proceedings of Conference on Computational Linguistics 2002, Taipei, Taiwan, 577–583.

    Google Scholar 

  7. Maa J.-F., Pearl D.K., Bartoszynsky R. (1996). Reducing multidimensional two-sample data to one-dimensional interpoint comparisons. The Annals of Statistics 24, 1069–1074.

    Article  MATH  MathSciNet  Google Scholar 

  8. Pantel P., Lin D. (2002). Discovering word senses from text. In Proceedings of ACM SIGKDD Conference on Knowledge Discovery and Data Mining 2002, Edmonton, Canada, 613–619.

    Google Scholar 

  9. Priebe C.E., Marchette D.J., Healy D.M. (2004). Integrated sensing and processing decision trees. IEEE Trans. PAMI, to appear.

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2004 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Priebe, C.E. et al. (2004). Iterative Denoising for Cross-Corpus Discovery. In: Antoch, J. (eds) COMPSTAT 2004 — Proceedings in Computational Statistics. Physica, Heidelberg. https://doi.org/10.1007/978-3-7908-2656-2_31

Download citation

  • DOI: https://doi.org/10.1007/978-3-7908-2656-2_31

  • Publisher Name: Physica, Heidelberg

  • Print ISBN: 978-3-7908-1554-2

  • Online ISBN: 978-3-7908-2656-2

  • eBook Packages: Springer Book Archive

Publish with us

Policies and ethics