Abstract
Data from the Web are increasingly heterogeneous and unstructured, representing challenges for data crawling, integration, and preprocessing. There are studies that are “data oriented,” i.e., their work is developed to deal with some problem generated by available data, hence their results are restricted to the respective data. In contrast, there are various problems prior to identifying what data is needed to a specific study, and often multiple data sources are needed. This chapter covers such problems with definitions, current solutions, possible issues, and future work. Especially, the first issue in dealing with data coming from the Web is to define the crawling strategy, which can be classified according to the period and how to start it. The second issue is to define a strategy for integrating data from different sources to have a uniform view for users or applications, and to store them in a way that allows efficient consultation. Note that a possibility is to collect data from each source and store them separately for later integration, or to store all data in a single location in an integrated fashion as each collection is performed. The third issue is data preprocessing, which takes place before or after the data integration, and involves solving missing and duplicate data, normalization, data veracity, etc. Overall, this chapter addresses these three issues in an integrated way with a focus on practical and research questions.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Notes
- 1.
Open Knowledge Foundation: https://okfn.org/.
- 2.
Brazilian Open Data Portal: http://dados.gov.br.
- 3.
data.gov: https://www.data.gov.
- 4.
European Data Portal: http://europeandataportal.eu.
- 5.
CKAN:https://ckan.org.
- 6.
Socrata: https://socrata.com.
- 7.
GeoServer: http://geoserver.org.
- 8.
MapServer: https://mapserver.org.
- 9.
- 10.
FOAF: http://xmlns.com/foaf/spec/.
- 11.
- 12.
- 13.
LOD Cloud: https://lod-cloud.net/.
- 14.
HTML Microdata: https://www.w3.org/TR/microdata/.
- 15.
Plain Old Semantic HTML: http://microformats.org/wiki/posh.
- 16.
Microformats wiki: http://microformats.org/wiki/Main_Page.
- 17.
Web Services Architecture: https://www.w3.org/TR/ws-arch/.
- 18.
Difference between ETL and data integration: https://www.passionned.com/is-data-integration-becoming-the-new-etl/. Accessed on February 4, 2019.
- 19.
In graph theory, a clique is a set of vertices in a graph where each vertex is connected to all others by an edge, so it is a fully connected graph.
- 20.
Density is the ratio between the number of edges in the graph and the maximal number of edges.
- 21.
Project Apoena: http://bit.ly/proj-apoena.
- 22.
Lab CSX: http://www.labcsx.dcc.ufmg.br.
- 23.
Piim-Lab: http://piim-lab.decom.cefetmg.br.
References
Alves, G.B., Brandão, M.A., Santana, D.M., da Silva, A.P.C., Moro, M.M.: The Strength of Social Coding Collaboration on GitHub. In: Simpósio Brasileiro de Banco de Dados (SBBD), pp. 247–252. Salvador, Brasil (2016)
Azeroual, O., Saake, G., Schallehn, E.: Analyzing data quality issues in research information systems via data profiling. Int. J. Inf. Manag. 41, 50–56 (2018). https://doi.org/10.1016/j.ijinfomgt.2018.02.007
Baeza-Yates, R., Ribeiro-Neto, B.: Modern Information Retrieval: The Concepts and Technology Behind Search, 2nd edn. Addison-Wesley Publishing Company, New York (2011)
Bansal, S.K.: Towards a semantic extract-transform-load (ETL) framework for big data integration. In: Proceedings of IEEE International Congress on Big Data (BigData Congress), Anchorage, AK, USA, pp. 522–529 (2014)
Batista, N.A., Brandão, M.A., Alves, G.B., da Silva, A.P.C., Moro, M.M.: Collaboration strength metrics and analyses on GitHub. In: Proceedings of the International Conference on Web Intelligence, Leipzig, Germany, pp. 170–178 (2017). https://doi.org/10.1145/3106426.3106480
Bouzeghoub, M., Lóscio, B.F., Kedad, Z., Soukane, A.: Heterogeneous data source integration and evolution. In: Proceedings of International Conference on Database and Expert Systems Applications (DEXA), Aix-en-Provence, France, pp. 751–757 (2002). https://doi.org/10.1007/3-540-46146-9_74
de Souza Silva, L., Murai, F., da Silva, A.P.C., Moro, M.M.: Automatic identification of best attributes for indexing in data deduplication. In: Proceedings of the 12th Alberto Mendelzon International Workshop on Foundations of Data Management, Cali, Colombia (2018)
Doan, A., Konda, P., Ardalan, A., Ballard, J.R., Das, S., Govind, Y., Li, H., Martinkus, P., Mudgal, S., Paulson, E., et al.: Toward a system building agenda for data integration (and data science). IEEE Data Eng. Bull. 41(2), 35–46 (2018)
Farnadi, G., Tang, J., De Cock, M., Moens, M.F.: User profiling through deep multimodal fusion. In: Proceedings of the Eleventh ACM International Conference on Web Search and Data Mining, pp. 171–179 (2018). https://doi.org/10.1145/3159652.3159691
Fielding, R.T.: Architectural styles and the design of network-based software architectures. Ph.D. thesis, University of California, Irvine (2000)
Freitas, R., Rocha, C., Braga, O., Lopes, G., Monteiro, O., Oliveira, M.: Using linked data in the data integration for maternal and infant death risk of the SUS in the GISSA Project. In: Proceedings of the 23rd Brazilian Symposium on Multimedia and the Web, Gramado, RS, Brazil, pp. 193–196 (2017). https://doi.org/10.1145/3126858.3131606
Geerts, F., Missier, P., Paton, N.: Editorial: Special issue on improving the veracity and value of big data. J. Data Inf. Qual. 9(3), 13:1–13:2 (2018). https://doi.org/10.1145/3174791
Golshan, B., Halevy, A., Mihaila, G., Tan, W.C.: Data integration: after the teenage years. In: Proceedings of the 36th ACM SIGMOD-SIGACT-SIGAI Symposium on Principles of Database Systems, Chicago, Illinois, USA, pp. 101–106 (2017). https://doi.org/10.1145/3034786.3056124
Goodman, L.A.: Snowball sampling. Ann. Math. Stat. 32, 148–170 (1961)
Laender, A.H.F., Moro, M.M., Nascimento, C., Martins, P.: An X-ray on web-available XML schemas. SIGMOD Rec. 38(1), 37–42 (2009). https://doi.org/10.1145/1558334.1558338
Liu, J., Ram, S.: Using big data and network analysis to understand Wikipedia article quality. Data Knowl. Eng. 115, 80–93 (2018). https://doi.org/10.1016/j.datak.2018.02.004
Ma, F., Meng, C., Xiao, H., Li, Q., Gao, J., Su, L., Zhang, A.: Unsupervised discovery of drug side-effects from heterogeneous data sources. In: Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 967–976 (2017). https://doi.org/10.1145/3097983.3098129
Moro, M.M., Braganholo, V., Dorneles, C.F., Duarte, D., de Matos Galante, R., dos Santos Mello, R.: XML: some papers in a haystack. SIGMOD Rec. 38(2), 29–34 (2009). https://doi.org/10.1145/1815918.1815924
Sikos, L.: Mastering Structured Data on the Semantic Web: From HTML5 Microdata to Linked Open Data. Apress, New York (2015)
Tyagi, N.K., Solanki, A., Tyagi, S.: An algorithmic approach to data preprocessing in web usage mining. Int. J. Inf. Technol. Knowl. Manag. 2(2), 279–283 (2010)
Vasilescu, B., Serebrenik, A., Filkov, V.: A data set for social diversity studies of GitHub teams. In: Proceedings of the 12th Working Conference on Mining Software Repositories, pp. 514–517 (2015). https://doi.org/10.1109/MSR.2015.77
Wang, L., Pan, R., Wang, X., Fan, W., Xuan, J.: A Bayesian reliability evaluation method with different types of data from multiple sources. Reliab. Eng. Syst. Saf. 167, 128–135 (2017). https://doi.org/10.1016/j.ress.2017.05.039
Wang, R., Ji, W., Liu, M., Wang, X., Weng, J., Deng, S., Gao, S., Yuan, C.a.: Review on mining data from multiple data sources. Pattern Recogn. Lett. 109, 120–128 (2018). https://doi.org/10.1016/j.patrec.2018.01.013
Zhao, B., Rubinstein, B.I.P., Gemmell, J., Han, J.: A Bayesian approach to discovering truth from conflicting sources for data integration. Proc. VLDB Endow. 5(6), 550–561 (2012). https://doi.org/10.14778/2168651.2168656
Acknowledgements
The research that resulted in the writing of this chapter was funded by CAPES, CNPq and FAPEMIG.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2020 Springer Nature Switzerland AG
About this chapter
Cite this chapter
Batista, N.A., Brandão, M.A., Pinheiro, M.B., Dalip, D.H., Moro, M.M. (2020). Data from Multiple Web Sources: Crawling, Integrating, Preprocessing, and Designing Applications. In: Roesler, V., Barrére, E., Willrich, R. (eds) Special Topics in Multimedia, IoT and Web Technologies. Springer, Cham. https://doi.org/10.1007/978-3-030-35102-1_8
Download citation
DOI: https://doi.org/10.1007/978-3-030-35102-1_8
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-35101-4
Online ISBN: 978-3-030-35102-1
eBook Packages: Computer ScienceComputer Science (R0)