Information Networks Mining and Analysis

Yu, Philip S.

doi:10.1007/978-3-642-20291-9_1

Philip S. Yu²¹

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 6612))

Included in the following conference series:

Asia-Pacific Web Conference

1067 Accesses
1 Citations

Abstract

With the ubiquity of information networks and their broad applications, there have been numerous studies on the construction, online analytical processing, and mining of information networks in multiple disciplines, including social network analysis, World-Wide Web, database systems, data mining, machine learning, and networked communication and information systems. Moreover, with a great demand of research in this direction, there is a need to understand methods for analysis of information networks from multiple disciplines. In this talk, we will present various issues and solutions on scalable mining and analysis of information networks. These include data integration, data cleaning and data validation in information networks, summarization, OLAP and multidimensional analysis in information networks. Finally, we illustrate how to apply network analysis technique to solve classical frequent item-set mining in a more efficient top-down fashion.

More specifically, on the data integration, data cleaning and data validation, we discuss two problems about correctness of information on the web. The first one is Veracity, i.e., conformity to truth, which addresses how to find true facts from a large amount of conflicting information on many subjects provided by various web sites. A general framework for the Veracity problem will be presented, which utilizes the relationships between web sites and their information, i.e., a web site is trustworthy if it provides many pieces of true information, and a piece of information is likely to be true if it is provided by many trustworthy web sites. The second problem is object distinction, i.e., how to distinguish different people or objects sharing identical names. This is a nontrivial task, especially when only very limited information is associated with each person or object. A general object distinction methodology is presented, which combines two complementary measures for relational similarity: set resemblance of neighbor tuples and random walk probability, and analyzes subtle linkages effectively.

OLAP (On-Line Analytical Processing) is an important notion in data analysis. There exists a similar need to deploy graph analysis from different perspectives and with multiple granularities. However, traditional OLAP technology cannot handle such demands because it does not consider the links among individual data tuples. Here, we examine a novel graph OLAP framework, which presents a multi-dimensional and multi-level view over graphs. We also look into different semantics of OLAP operations, and discuss two major types of graph OLAP: informational OLAP and topological OLAP.

We next examine summarization of massive graphs. We will use the connectivity problem of determining the minimum-cut between any pair of nodes in the network to illustrate the need for graph summarization. The problem is well solved in the classical literature for memory resident graphs. However, large graphs may often be disk resident, and such graphs cannot be efficiently processed for connectivity queries. We will discuss edge and node sampling based approaches to create compressed representations of the underlying disk resident graphs. Since the compressed representations can be held in main memory, they can be used to derive efficient approximations for the connectivity problem.

Finally, we examine how to apply information network analysis technique to perform frequent item-set mining. We note that almost all state-of-the-art algorithms are based on the bottom-up approach growing patterns step by step, hence cannot mine very long patterns in a large database. Here we focus on mining top-k long maximal frequent patterns because long patterns are in general more interesting ones. Different from traditional bottom-up strategies, the network approach works in a top-down manner. The approach pulls large maximal cliques from a pattern graph constructed after some fast initial processing, and directly uses such large-sized maximal cliques as promising candidates for long frequent patterns. A separate refinement stage is then applied to further transform these candidates into true maximal patterns.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Author information

Authors and Affiliations

University of Illinois at Chicago, Chicago, Illinois, USA
Philip S. Yu

Authors

Philip S. Yu
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

School of Information, Renmin University of China, 100872, Beijing, China
Xiaoyong Du
LFCS, School of Informatics, University of Edinburgh, 10 Crichton Street, EH8 9AB, Edinburgh, Scotland, UK
Wenfei Fan
School of Software, Tsinghua University, Room 819, Main Building, 100084, Beijing, China
Jianmin Wang
Computer School, Wuhan University, Luojiashan Road, 430072, Wuhan, Hubei, China
Zhiyong Peng
School of Information Technology and Electrical Engineering, The University of Queensland, QLD 4072, St. Lucia, Australia
Mohamed A. Sharaf

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Yu, P.S. (2011). Information Networks Mining and Analysis. In: Du, X., Fan, W., Wang, J., Peng, Z., Sharaf, M.A. (eds) Web Technologies and Applications. APWeb 2011. Lecture Notes in Computer Science, vol 6612. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-20291-9_1

Download citation

DOI: https://doi.org/10.1007/978-3-642-20291-9_1
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-20290-2
Online ISBN: 978-3-642-20291-9
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics