Iterative Denoising for Cross-Corpus Discovery

Priebe, Carey E.; Marchette, David J.; Park, Youngser; Wegman, Edward J.; Solka, Jeffrey L.; Socolinsky, Diego A.; Karakos, Damianos; Church, Ken W.; Guglielmi, Roland; Coifman, Ronald R.; Lin, Dekang; Healy, Dennis M.; Jacobs, Marc Q.; Tsao, Anna

doi:10.1007/978-3-7908-2656-2_31

Carey E. Priebe²,
David J. Marchette³,
Youngser Park⁴,
Edward J. Wegman²,
Jeffrey L. Solka³,
Diego A. Socolinsky²,
Damianos Karakos⁴,
Ken W. Church²,
Roland Guglielmi²,
Ronald R. Coifman²,
Dekang Lin²,
Dennis M. Healy⁵,
Marc Q. Jacobs² &
…
Anna Tsao²

674 Accesses
12 Citations

Abstract

We consider the problem of statistical pattern recognition in a heterogeneous, high-dimensional setting. In particular, we consider the search for meaningful cross-category associations in a heterogeneous text document corpus. Our approach involves “iterative denoising ” — that is, iteratively extracting (corpus-dependent) features and partitioning the document collection into sub-corpora. We present an anecdote wherein this methodology discovers a meaningful cross-category association in a heterogeneous collection of scientific documents.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 129.00; Price excludes VAT (USA)

Softcover Book: USD 169.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Berry M.W., editor (2004). Survey of text mining: clustering, classification, and retrieval. Springer-Verlag.
Google Scholar
Borg I., Groenen P. (1997). Modern multidimensional scaling: theory and applications. Springer-Verlag.
Google Scholar
Cowen L.J., Priebe C.E. (1997). Randomized nonlinear projections uncover high-dimensional structure. Advances in Applied Mathematics 9, 319–331.
Article MathSciNet Google Scholar
Deerwester S., Dumais S.T., Furnas G.W., Landauer T.K., Harshman R. (1990). Indexing by latent semantic analysis. Journal of the American Society for Information Science 41(6), 391–407.
Article Google Scholar
Jolliffe I.T. (1986). Principal component analysis. Springer-Verlag.
Google Scholar
Lin D., Pantel P. (2002). Concept discovery from text. In Proceedings of Conference on Computational Linguistics 2002, Taipei, Taiwan, 577–583.
Google Scholar
Maa J.-F., Pearl D.K., Bartoszynsky R. (1996). Reducing multidimensional two-sample data to one-dimensional interpoint comparisons. The Annals of Statistics 24, 1069–1074.
Article MATH MathSciNet Google Scholar
Pantel P., Lin D. (2002). Discovering word senses from text. In Proceedings of ACM SIGKDD Conference on Knowledge Discovery and Data Mining 2002, Edmonton, Canada, 613–619.
Google Scholar
Priebe C.E., Marchette D.J., Healy D.M. (2004). Integrated sensing and processing decision trees. IEEE Trans. PAMI, to appear.
Google Scholar

Download references

Author information

Authors and Affiliations

AlgoTek, Inc., 3811 N. Fairfax Dr., Suite 700, USA
Carey E. Priebe, Edward J. Wegman, Diego A. Socolinsky, Ken W. Church, Roland Guglielmi, Ronald R. Coifman, Dekang Lin, Marc Q. Jacobs & Anna Tsao
NSWCDD B10, Dahlgren, VA, USA
David J. Marchette & Jeffrey L. Solka
Johns Hopkins U., Balt., MD, USA
Youngser Park & Damianos Karakos
DARPA, Arlington, VA, 22203, USA
Dennis M. Healy

Authors

Carey E. Priebe
View author publications
You can also search for this author in PubMed Google Scholar
David J. Marchette
View author publications
You can also search for this author in PubMed Google Scholar
Youngser Park
View author publications
You can also search for this author in PubMed Google Scholar
Edward J. Wegman
View author publications
You can also search for this author in PubMed Google Scholar
Jeffrey L. Solka
View author publications
You can also search for this author in PubMed Google Scholar
Diego A. Socolinsky
View author publications
You can also search for this author in PubMed Google Scholar
Damianos Karakos
View author publications
You can also search for this author in PubMed Google Scholar
Ken W. Church
View author publications
You can also search for this author in PubMed Google Scholar
Roland Guglielmi
View author publications
You can also search for this author in PubMed Google Scholar
Ronald R. Coifman
View author publications
You can also search for this author in PubMed Google Scholar
Dekang Lin
View author publications
You can also search for this author in PubMed Google Scholar
Dennis M. Healy
View author publications
You can also search for this author in PubMed Google Scholar
Marc Q. Jacobs
View author publications
You can also search for this author in PubMed Google Scholar
Anna Tsao
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Faculty of Mathematics and Physics Department of Statistics and Probability, Charles University, Sokolovská 83, 18675, Prague 8 - Karlin, Czech Republic
Jaromir Antoch

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Priebe, C.E. et al. (2004). Iterative Denoising for Cross-Corpus Discovery. In: Antoch, J. (eds) COMPSTAT 2004 — Proceedings in Computational Statistics. Physica, Heidelberg. https://doi.org/10.1007/978-3-7908-2656-2_31

Download citation

DOI: https://doi.org/10.1007/978-3-7908-2656-2_31
Publisher Name: Physica, Heidelberg
Print ISBN: 978-3-7908-1554-2
Online ISBN: 978-3-7908-2656-2
eBook Packages: Springer Book Archive

Publish with us

Policies and ethics