Skip to main content

Uncertainty Estimation and Analysis of Categorical Web Data

  • Conference paper
  • First Online:
Uncertainty Reasoning for the Semantic Web III (URSW 2012, URSW 2011, URSW 2013)

Abstract

Web data often manifest high levels of uncertainty. We focus on categorical Web data and we represent these uncertainty levels as first- or second-order uncertainty. By means of concrete examples, we show how to quantify and handle these uncertainties using the Beta-Binomial and the Dirichlet-Multinomial models, as well as how take into account possibly unseen categories in our samples by using the Dirichlet process. We conclude by exemplifying how these higher-order models can be used as a basis for analyzing datasets, once at least part of their uncertainty has been taken into account. We demonstrate how to use the Battacharyya stastistical distance to quantify the similarity between Dirichlet distributions, and use such results to analyze a Web dataset of piracy attacks both visually and automatically.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

  1. 1.

    http://semanticweb.cs.vu.nl/lop

  2. 2.

    http://www.icc-ccs.org/

  3. 3.

    The code is available at http://trustingwebdata.org/books/URSW_III/DP.zip.

References

  1. Agresti, A.: Categorical Data Analysis, 3rd edn. Wiley, Hoboken (2013)

    MATH  Google Scholar 

  2. Alexander, K., Cyganiak, R., Hausenblas, M., Zhao, J.: Describing linked datasets with the void vocabulary. Technical report, W3C (2011)

    Google Scholar 

  3. Auer, S., Demter, J., Martin, M., Lehmann, J.: LODStats – an extensible framework for high-performance dataset analytics. In: ten Teije, A., Völker, J., Handschuh, S., Stuckenschmidt, H., d’Acquin, M., Nikolov, A., Aussenac-Gilles, N., Hernandez, N. (eds.) EKAW 2012. LNCS, vol. 7603, pp. 353–362. Springer, Heidelberg (2012)

    Chapter  Google Scholar 

  4. Bhattacharyya, A.: On a measure of divergence between two statistical populations defined by their probability distributions. Bull. Calcutta Math. Soc. 35, 99–109 (1943)

    MATH  MathSciNet  Google Scholar 

  5. Ceolin, D., Moreau, L., O’Hara, K., van Hage, W.R., Fokkink, W.J., Maccatrozzo, V., Schreiber, G., Shadbolt, N.: Two procedures for estimating the reliability of open government data. In: Laurent, A., Strauss, O., Bouchon-Meunier, B., Yager, R.R. (eds.) IPMU 2014. CCIS, vol. 442, pp. 15–24. Springer, Heidelberg (2014)

    Chapter  Google Scholar 

  6. Ceolin, D., van Hage, W.R., Fokkink, W.J., Schreiber, G.: Estimating Uncertainty of Categorical Web Data. In: URSW, pp. 15–26, November 2011. CEUR-WS.org

  7. Koch, G., Davis, C.: Categorical Data Analysis Using SAS, 3rd edn. SAS Institute, Norwood (2012)

    Google Scholar 

  8. Cyganiak, R., Reynolds, D., Tennison, J.: The RDF data cube vocabulary. Technical report, W3C (2014)

    Google Scholar 

  9. Davy, M., Tourneret, J.: Generative supervised classification using dirichlet process priors. IEEE Trans. Pattern Anal. Mach. Intell. 32, 1781–1794 (2010)

    Article  Google Scholar 

  10. Dirac, P.: Principles of Quantum Mechanics. Oxford at the Clarendon Press, Oxford (1958)

    MATH  Google Scholar 

  11. Andersen, E.: Sufficiency and exponential families for discrete sample spaces. J. Am. Stat. Assoc. 65, 1248–1255 (1970)

    Article  MATH  Google Scholar 

  12. Elkan, C.: Clustering documents with an exponential-family approximation of the Dirichlet compound multinomial distribution. In: ICML, pp. 289–296. ACM (2006)

    Google Scholar 

  13. Escobar, M.D., West, M.: Bayesian density estimation and inference using mixtures. J. Am. Stat. Assoc. 90, 577–588 (1994)

    Article  MathSciNet  Google Scholar 

  14. Ferguson, T.S.: A Bayesian analysis of some nonparametric problems. Ann. Stat. 1(2), 209–230 (1973)

    Article  MATH  MathSciNet  Google Scholar 

  15. Fink, D.: A compendium of conjugate priors. Technical report, Cornell University (1995)

    Google Scholar 

  16. Fokoue, A., Srivatsa, M., Young, R.: Assessing trust in uncertain information. In: Patel-Schneider, P.F., Pan, Y., Hitzler, P., Mika, P., Zhang, L., Pan, J.Z., Horrocks, I., Glimm, B. (eds.) ISWC 2010, Part I. LNCS, vol. 6496, pp. 209–224. Springer, Heidelberg (2010)

    Chapter  Google Scholar 

  17. Schlaifer, R., Raiffa, H.: Applied Statistical Decision Theory. M.I.T Press, Cambridge (1968)

    MATH  Google Scholar 

  18. Hausenblas, M., Halb, W., Raimond, Y., Feigenbaum, L., Ayers, D.: SCOVO: using statistics on the Web of data. In: Aroyo, L., Traverso, P., Ciravegna, F., Cimiano, P., Heath, T., Hyvönen, E., Mizoguchi, R., Oren, E., Sabou, M., Simperl, E. (eds.) ESWC 2009. LNCS, vol. 5554, pp. 708–722. Springer, Heidelberg (2009)

    Google Scholar 

  19. Hazewinkel, M.: Encyclopaedia of Mathematics. In: Chapter De Finetti theorem. Springer, New York (2001)

    MATH  Google Scholar 

  20. Hilgevoord, J., Uffink, J.: Uncertainty in prediction and in inference. Found. Phys. 21, 323–341 (1991)

    Article  Google Scholar 

  21. Killick, R., Eckley, I.A.: Changepoint: An R Package for Changepoint Analysis (2013). http://cran.r-project.org/package=changepoint

  22. Krause, E.F.: Taxicab Geometry. Dover, New York (1987)

    Google Scholar 

  23. Kvam, P., Day, D.: The multivariate polya distribution in combat modeling. Naval Res. Logistics (NRL) 48(1), 1–17 (2001)

    Article  MATH  MathSciNet  Google Scholar 

  24. Madsen, R.E., Kauchak, D., Elkan, C.: Modeling word burstiness using the Dirichlet distribution. In: ICML, pp. 545–552. ACM (2005)

    Google Scholar 

  25. Neal, R.M.: Markov chain sampling methods for Dirichlet process mixture models. J. Comput. Graph. Stat. 9(2), 249–265 (2000)

    MathSciNet  Google Scholar 

  26. Pitman, J.: Exchangeable and partially exchangeable random partitions. Probab. Theor. Relat. Fields 102(2), 145–158 (1995)

    Article  MATH  MathSciNet  Google Scholar 

  27. Rasmussen, C.E.: The Infinite Gaussian Mixture Model. Advances in Neural Information Processing Systems, vol. 12, pp. 554–560. MIT Press, Cambridge (2000)

    Google Scholar 

  28. Rauber, T.W., Conci, A., Braun, T., Berns, K.: Bhattacharyya probabilistic distance of the dirichlet density and its application to split-and-merge image segmentation. In: WSSIP08, pp. 145–148 (2008)

    Google Scholar 

  29. Rodriguez, A., Dunson, D.B., Gelfand, A.E.: The nested Dirichlet process. J. Am. Stat. Assoc. 103(483), 1131–1144 (2008)

    Article  MATH  MathSciNet  Google Scholar 

  30. Teh, Y.W., Jordan, M.I., Beal, M.J., Blei, D.M.: Hierarchical Dirichlet processes. J. Am. Stat. Assoc. 101(476), 1566–1581 (2006)

    Article  MATH  MathSciNet  Google Scholar 

  31. van Hage, W.R., van Erp, M., Malaisé, V.: Linked open piracy: a story about e-science, linked data, and statistics. J. Data Seman. 1(3), 187–201 (2012)

    Article  Google Scholar 

  32. W3C. OWL Reference, August 2011. http://www.w3.org/TR/owl-ref/

  33. W3C. Resource Definition Framework, August 2011. http://www.w3.org/RDF/

  34. W3C. SPARQL, August 2011. http://www.w3.org/TR/rdf-sparql-query/

  35. Wilcoxon, F.: Individual comparisons by ranking methods. Biometrics Bull. 1(6), 80–83 (1945)

    Article  Google Scholar 

  36. Xing, E.: Bayesian haplotype inference via the Dirichlet process. In: ICML, pp. 879–886. ACM Press (2004)

    Google Scholar 

Download references

Acknowledgments

This research was partially supported by the Data2Semantics Media project in the Dutch national program COMMIT.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Davide Ceolin .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2014 Springer International Publishing Switzerland

About this paper

Cite this paper

Ceolin, D., van Hage, W.R., Fokkink, W., Schreiber, G. (2014). Uncertainty Estimation and Analysis of Categorical Web Data. In: Bobillo, F., et al. Uncertainty Reasoning for the Semantic Web III. URSW URSW URSW 2012 2011 2013. Lecture Notes in Computer Science(), vol 8816. Springer, Cham. https://doi.org/10.1007/978-3-319-13413-0_14

Download citation

  • DOI: https://doi.org/10.1007/978-3-319-13413-0_14

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-319-13412-3

  • Online ISBN: 978-3-319-13413-0

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics