Data from Multiple Web Sources: Crawling, Integrating, Preprocessing, and Designing Applications

Batista, Natércia A.; Brandão, Michele A.; Pinheiro, Michele B.; Dalip, Daniel H.; Moro, Mirella M.

doi:10.1007/978-3-030-35102-1_8

Natércia A. Batista⁴,
Michele A. Brandão⁵,
Michele B. Pinheiro⁴,
Daniel H. Dalip⁶ &
…
Mirella M. Moro⁴

406 Accesses

Abstract

Data from the Web are increasingly heterogeneous and unstructured, representing challenges for data crawling, integration, and preprocessing. There are studies that are “data oriented,” i.e., their work is developed to deal with some problem generated by available data, hence their results are restricted to the respective data. In contrast, there are various problems prior to identifying what data is needed to a specific study, and often multiple data sources are needed. This chapter covers such problems with definitions, current solutions, possible issues, and future work. Especially, the first issue in dealing with data coming from the Web is to define the crawling strategy, which can be classified according to the period and how to start it. The second issue is to define a strategy for integrating data from different sources to have a uniform view for users or applications, and to store them in a way that allows efficient consultation. Note that a possibility is to collect data from each source and store them separately for later integration, or to store all data in a single location in an integrated fashion as each collection is performed. The third issue is data preprocessing, which takes place before or after the data integration, and involves solving missing and duplicate data, normalization, data veracity, etc. Overall, this chapter addresses these three issues in an integrated way with a focus on practical and research questions.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 149.00; Price excludes VAT (USA)

Softcover Book: USD 199.99; Price excludes VAT (USA)

Hardcover Book: USD 199.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

1.
Open Knowledge Foundation: https://okfn.org/.
2.
Brazilian Open Data Portal: http://dados.gov.br.
3.
data.gov: https://www.data.gov.
4.
European Data Portal: http://europeandataportal.eu.
5.
CKAN:https://ckan.org.
6.
Socrata: https://socrata.com.
7.
GeoServer: http://geoserver.org.
8.
MapServer: https://mapserver.org.
9.
OGC: http://www.opengeospatial.org.
10.
FOAF: http://xmlns.com/foaf/spec/.
11.
W3C: https://www.w3.org/standards/semanticWeb/ontology.
12.
https://www.w3.org/TR/rdf11-primer/#section-graph-syntax.
13.
LOD Cloud: https://lod-cloud.net/.
14.
HTML Microdata: https://www.w3.org/TR/microdata/.
15.
Plain Old Semantic HTML: http://microformats.org/wiki/posh.
16.
Microformats wiki: http://microformats.org/wiki/Main_Page.
17.
Web Services Architecture: https://www.w3.org/TR/ws-arch/.
18.
Difference between ETL and data integration: https://www.passionned.com/is-data-integration-becoming-the-new-etl/. Accessed on February 4, 2019.
19.
In graph theory, a clique is a set of vertices in a graph where each vertex is connected to all others by an edge, so it is a fully connected graph.
20.
Density is the ratio between the number of edges in the graph and the maximal number of edges.
21.
Project Apoena: http://bit.ly/proj-apoena.
22.
Lab CSX: http://www.labcsx.dcc.ufmg.br.
23.
Piim-Lab: http://piim-lab.decom.cefetmg.br.

References

Alves, G.B., Brandão, M.A., Santana, D.M., da Silva, A.P.C., Moro, M.M.: The Strength of Social Coding Collaboration on GitHub. In: Simpósio Brasileiro de Banco de Dados (SBBD), pp. 247–252. Salvador, Brasil (2016)
Google Scholar
Azeroual, O., Saake, G., Schallehn, E.: Analyzing data quality issues in research information systems via data profiling. Int. J. Inf. Manag. 41, 50–56 (2018). https://doi.org/10.1016/j.ijinfomgt.2018.02.007
Article Google Scholar
Baeza-Yates, R., Ribeiro-Neto, B.: Modern Information Retrieval: The Concepts and Technology Behind Search, 2nd edn. Addison-Wesley Publishing Company, New York (2011)
Google Scholar
Bansal, S.K.: Towards a semantic extract-transform-load (ETL) framework for big data integration. In: Proceedings of IEEE International Congress on Big Data (BigData Congress), Anchorage, AK, USA, pp. 522–529 (2014)
Google Scholar
Batista, N.A., Brandão, M.A., Alves, G.B., da Silva, A.P.C., Moro, M.M.: Collaboration strength metrics and analyses on GitHub. In: Proceedings of the International Conference on Web Intelligence, Leipzig, Germany, pp. 170–178 (2017). https://doi.org/10.1145/3106426.3106480
Bouzeghoub, M., Lóscio, B.F., Kedad, Z., Soukane, A.: Heterogeneous data source integration and evolution. In: Proceedings of International Conference on Database and Expert Systems Applications (DEXA), Aix-en-Provence, France, pp. 751–757 (2002). https://doi.org/10.1007/3-540-46146-9_74
Google Scholar
de Souza Silva, L., Murai, F., da Silva, A.P.C., Moro, M.M.: Automatic identification of best attributes for indexing in data deduplication. In: Proceedings of the 12th Alberto Mendelzon International Workshop on Foundations of Data Management, Cali, Colombia (2018)
Google Scholar
Doan, A., Konda, P., Ardalan, A., Ballard, J.R., Das, S., Govind, Y., Li, H., Martinkus, P., Mudgal, S., Paulson, E., et al.: Toward a system building agenda for data integration (and data science). IEEE Data Eng. Bull. 41(2), 35–46 (2018)
Google Scholar
Farnadi, G., Tang, J., De Cock, M., Moens, M.F.: User profiling through deep multimodal fusion. In: Proceedings of the Eleventh ACM International Conference on Web Search and Data Mining, pp. 171–179 (2018). https://doi.org/10.1145/3159652.3159691
Fielding, R.T.: Architectural styles and the design of network-based software architectures. Ph.D. thesis, University of California, Irvine (2000)
Google Scholar
Freitas, R., Rocha, C., Braga, O., Lopes, G., Monteiro, O., Oliveira, M.: Using linked data in the data integration for maternal and infant death risk of the SUS in the GISSA Project. In: Proceedings of the 23rd Brazilian Symposium on Multimedia and the Web, Gramado, RS, Brazil, pp. 193–196 (2017). https://doi.org/10.1145/3126858.3131606
Geerts, F., Missier, P., Paton, N.: Editorial: Special issue on improving the veracity and value of big data. J. Data Inf. Qual. 9(3), 13:1–13:2 (2018). https://doi.org/10.1145/3174791
Golshan, B., Halevy, A., Mihaila, G., Tan, W.C.: Data integration: after the teenage years. In: Proceedings of the 36th ACM SIGMOD-SIGACT-SIGAI Symposium on Principles of Database Systems, Chicago, Illinois, USA, pp. 101–106 (2017). https://doi.org/10.1145/3034786.3056124
Goodman, L.A.: Snowball sampling. Ann. Math. Stat. 32, 148–170 (1961)
Article MathSciNet Google Scholar
Laender, A.H.F., Moro, M.M., Nascimento, C., Martins, P.: An X-ray on web-available XML schemas. SIGMOD Rec. 38(1), 37–42 (2009). https://doi.org/10.1145/1558334.1558338
Article Google Scholar
Liu, J., Ram, S.: Using big data and network analysis to understand Wikipedia article quality. Data Knowl. Eng. 115, 80–93 (2018). https://doi.org/10.1016/j.datak.2018.02.004
Article Google Scholar
Ma, F., Meng, C., Xiao, H., Li, Q., Gao, J., Su, L., Zhang, A.: Unsupervised discovery of drug side-effects from heterogeneous data sources. In: Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 967–976 (2017). https://doi.org/10.1145/3097983.3098129
Moro, M.M., Braganholo, V., Dorneles, C.F., Duarte, D., de Matos Galante, R., dos Santos Mello, R.: XML: some papers in a haystack. SIGMOD Rec. 38(2), 29–34 (2009). https://doi.org/10.1145/1815918.1815924
Article Google Scholar
Sikos, L.: Mastering Structured Data on the Semantic Web: From HTML5 Microdata to Linked Open Data. Apress, New York (2015)
Book Google Scholar
Tyagi, N.K., Solanki, A., Tyagi, S.: An algorithmic approach to data preprocessing in web usage mining. Int. J. Inf. Technol. Knowl. Manag. 2(2), 279–283 (2010)
Google Scholar
Vasilescu, B., Serebrenik, A., Filkov, V.: A data set for social diversity studies of GitHub teams. In: Proceedings of the 12th Working Conference on Mining Software Repositories, pp. 514–517 (2015). https://doi.org/10.1109/MSR.2015.77
Wang, L., Pan, R., Wang, X., Fan, W., Xuan, J.: A Bayesian reliability evaluation method with different types of data from multiple sources. Reliab. Eng. Syst. Saf. 167, 128–135 (2017). https://doi.org/10.1016/j.ress.2017.05.039
Article Google Scholar
Wang, R., Ji, W., Liu, M., Wang, X., Weng, J., Deng, S., Gao, S., Yuan, C.a.: Review on mining data from multiple data sources. Pattern Recogn. Lett. 109, 120–128 (2018). https://doi.org/10.1016/j.patrec.2018.01.013
Article Google Scholar
Zhao, B., Rubinstein, B.I.P., Gemmell, J., Han, J.: A Bayesian approach to discovering truth from conflicting sources for data integration. Proc. VLDB Endow. 5(6), 550–561 (2012). https://doi.org/10.14778/2168651.2168656
Article Google Scholar

Download references

Acknowledgements

The research that resulted in the writing of this chapter was funded by CAPES, CNPq and FAPEMIG.

Author information

Authors and Affiliations

Federal University of Minas Gerais (UFMG), Minas Gerais, Brazil
Natércia A. Batista, Michele B. Pinheiro & Mirella M. Moro
Federal Institute of Minas Gerais (IFMG), Belo Horizonte, Brazil
Michele A. Brandão
Federal Center of Technological Education of Minas Gerais (CEFET-MG), Minas Gerais, Brazil
Daniel H. Dalip

Authors

Natércia A. Batista
View author publications
You can also search for this author in PubMed Google Scholar
Michele A. Brandão
View author publications
You can also search for this author in PubMed Google Scholar
Michele B. Pinheiro
View author publications
You can also search for this author in PubMed Google Scholar
Daniel H. Dalip
View author publications
You can also search for this author in PubMed Google Scholar
Mirella M. Moro
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Michele A. Brandão .

Editor information

Editors and Affiliations

Department of Computer Science, Federal University of Rio Grande do Sul, Porto Alegre, Rio Grande do Sul, Brazil
Valter Roesler
Department of Computer Science, Federal University of Juiz de Fora, Juiz de Fora, Minas Gerais, Brazil
Eduardo Barrére
Department of Computer Science, Federal University of Santa Catarina, Florianopolis, Santa Catarina, Brazil
Roberto Willrich

Rights and permissions

Reprints and permissions

Copyright information

About this chapter

Cite this chapter

Batista, N.A., Brandão, M.A., Pinheiro, M.B., Dalip, D.H., Moro, M.M. (2020). Data from Multiple Web Sources: Crawling, Integrating, Preprocessing, and Designing Applications. In: Roesler, V., Barrére, E., Willrich, R. (eds) Special Topics in Multimedia, IoT and Web Technologies. Springer, Cham. https://doi.org/10.1007/978-3-030-35102-1_8

Download citation

DOI: https://doi.org/10.1007/978-3-030-35102-1_8
Published: 03 March 2020
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-35101-4
Online ISBN: 978-3-030-35102-1
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics