Keywords

1 Introduction

Throughout history, the structures collapse is a factor that generates the most material and human losses when earthquakes occur. This is mainly due to the use of low quality materials, deficiencies in construction processes and non-compliance with standards, among some other causes [1].

Data Mining (DM) support the diagnosis and analysis of the structures performance. According to the data characteristics collected after an earthquake, the DM techniques most used is the Association Rules (AR) [2]. An example of its application in earthquake is presented by Martínez-Álvarez et al. [3] who apply descriptive techniques for obtaining Quantitative AR (QAR), and uses a regression method (M5P Algorithm) for predict the earthquakes occurrence based on the relationship between Frequency and magnitude in order to observe the earthquakes variation. The QAR obtained showed that for earthquakes of moderate magnitude (between 3.5–4.4) earthquakes occur after short time intervals, while for high magnitudes (earthquakes of magnitude 4.4 to 6.2) the time intervals in relation to the Frequency and Magnitude had a significant decrease.

In similar way Galán Montaño F. [4] uses DM techniques to describe the behavior of earthquakes according to their magnitude too. In their study, Montaño uses a genetic algorithm capable of finding frequent patterns and obtain behavior models of the time series according to the occurrence of earthquakes. The result shown that before an earthquake of magnitude greater than 4.5 occurs, it is high probability that an earthquake of magnitude 4.4 occur.

On September 7th 2017, an earthquake of magnitude 8.2 was registered in Oaxaca, Mexico, which caused damage to 57,621 homes, 1,988 schools, 102 cultural buildings and 104 public buildings [5]. Few days after, on September 19th 2017 occurs another earthquake of magnitude 7.1 with epicenter in Puebla, which damaged more than 150 thousand homes in Oaxaca, Chiapas, Guerrero, Puebla, Morelos and State of Mexico, with an estimated repair cost up to 38,150 million of Mexican pesos [6].

After these two earthquakes, in the education sector were registered 12,931 schools with damages: 577 will require a total reconstruction, 1,847 a partial reconstruction and the remains with minor damages. The repair costs was estimated around to the 13,650 million of Mexican pesos [7]. Other structures with affectation were historical and culturally valuable buildings, such as the archaeological zone of Chiapa de Corzo, Zocalo of Mexico City, the National Museum of Art, among others, whose repair has an estimated cost of 8000 million of Mexican pesos [8]. In this paper we apply association rule mining for estimate the repair cost required for reconstructing school buildings damaged by the earthquakes of September 7th and 19th, using the data provided by the Fund of Budget Transparency (the Mexican FONDEN).

It is important regarding that the data bases obtained during the earthquakes of September 7th and 19th are the first of that type that have been obtained in Mexico. For structural engineering purposes the data bases are not full completed, but the data contained is valuable to do by first time a study of earthquake engineering based on DM. One of the main contribution of the work is to explore the use of DM in earthquake engineering. The rules obtained are valuable because they describe the distribution of earthquake damage costs as a function of some basic structural characteristics of the buildings. These rules are exclusive for the earthquakes of September 7th and 19th in México and for the studied structures. Currently, there are not any parallel study of structural engineering that allows to validate the rules.

2 Association Rules

The AR is a technique for discovering interesting associations or correlations from a transactional data set, where the rows represent the transactions and the column the items [9]. Let \(I=\{i_1,i_2,...,i_n\}\) be a set of n attributes or items and \(D=\{t_1,t_2,...,t_n\}\) a set of transactions in a data set. \(\forall t_i \in D\), \(\exists \) Tid as an unique identifier where \(Tid \subset I\) [10].

An association rule can be defined as an implication of the form \(X \Rightarrow Y\), where X is the antecedent and Y is the consequent of the rule. For example the rule \(\lbrace Bread,Cheese \rbrace \Rightarrow \lbrace Ham \rbrace \), means that, when Bread and Cheese occur, Ham also occurs. For validate the quality of the rules and the probability that they reflect real relationships, two of the most used metrics for this purpose are [11]: Support: Probability P that a set of items appears in several transaction, \(support(X\Rightarrow Y)=P(X\cup Y)\), and Confidence: Fraction of transactions in which X and Y appears, \(confidence (X \Rightarrow Y) = \frac{support(X \cup Y)}{support(X)}\).

There are different algorithms for obtaining AR, however in this paper we use the Apriori algorithm and the PSO-GES metaheuristic.

2.1 Apriori Algorithm

The Apriori algorithm is one of the first methods developed for association rules mining and is currently one of the most used. Apriori consider two stages [11]: firstly the algorithm identifies all the frequent itemsets and then convert them to an AR (see Algorithm 1).

figure a

2.2 PSO-GES

PSO-GES (see Algorithm 2) is a metaheuristic that generates quality AR with relatively low execution times. The algorithm consists of two stages: in the first stage, the dataset is transformed into a binary matrix and non-frequent items are eliminated. In the second stage, rule mining is carried out through a Guided Exploration Strategy. In which only those items that positively influence the fitness of a rule are added during the particle evolution process. This is done by computing fast estimating, using the summary matrix, values for the support and confidence of the rule represented by a new particle [9].

3 Experimental Set-Up

3.1 Database Description

Dataset used in this work corresponds to infrastructure affected of schools during earthquakes of September 2017 in Mexico (ES17M dataset)Footnote 1. ES17M consists of 19,194 records and 76 features, which was designed by Structuralist Experts in seismic risk analysis [1]. Starting from the total features in ES17M, only the most relevant to seismic risk analysis were chosen, leading 36 of the 76 available features (Table 1). The last four rows (funding source) have six values: cost, total amount of attention, amount exercised and description, which gives a total of 36 items.

figure b
Table 1. Summary of the main database characteristics

3.2 Preprocessing Data

The following procedure was performed in order to properly format the ES17M dataset as an input for the Apriori and PSO-GES algorithms:

  1. 1.

    Standardized data. All data were transformed into categorical data, which are represented for integer numbers from 0–9. For example, cost was categorized in amounts like 1 of (0 to 1000), 2 of (1001 to 10,000), among others. Strings like “no aplica” were replaced for categorical numbers as 0. Similar string meaning were clustered, for example, “piso”, “pisos”, “superficie”, were included in the same category represented for an integer number.

  2. 2.

    Structural features generation. These were obtained from pdf documents, jpg or png images from official documents. Distance to the epicenter was computed and its relationship with the closeness to geologic faults lines.

  3. 3.

    Discretized and transactional dataset. The dataset was transformed in a binary matrix where 1 value represents the presence of an item and 0 value determines the absence of such a item; where items were obtained on step 1. Final ES17M dataset (transactional dataset) was generated by keeping only the column number where the presence of an item to be 1.

3.3 Algorithms Configuration

Python libraryFootnote 2 was used to run the Apriori algorithm, and PSO-GES was executed with the author’s proprietary codeFootnote 3. The minimum support threshold was 0.00002 for both algorithms and the confidence threshold\(= 0.75\). Additional PSO-GES parameters used in this work were set as it was presented in Ref. [9]: Population of 20 particles, constants \(c_1\) and \(c_2 = 2\), inertia \(w = 1\), 10 epoch and one transaction per partition (\(K=1\)).

4 Results

Due to the databases used do not provide enough information to formally conduct a comprehensive and detailed study of the risk and seismic behavior, the analysis was focused on analyze the relationship between the repair costs and type of construction with variables as: epicentral distance, construction type, damage and seismic zone. For the experiments, the algorithms were adapted in their execution, that is to say, the resulting RA should consider: the epicentral distance and type of construction in the antecedent and type of damage, detail of the damage, total repair cost and seismic zone in the consequent. Derived from the execution of the aforementioned algorithms, the rules generated by the PSO-GES algorithm are presented, since the obtained results provided better RA for the seismic analysis.

4.1 Earthquake of September 7th, 2017

From the AR obtained (Table 2) is possible to determine that most of the schools affected by the earthquake are located in the seismic zone C (intermediate zone, where not so frequent earthquakes) and present moderate damage, mainly in the structure and finishes stand out. Bar graph of Figure 1 shows the relation between the repair costs, the epicentral distance and type of construction. In this graph we can seen that the greatest damage occurred in those schools with masonry walls and concrete slab. Similarly, it is generally appreciated that repair costs decrease as schools move away from the epicenter of the earthquake. Thus, at a lower epicentral distance, repair costs are greater than the repair costs of schools that are located at a greater epicental distance.

Table 2. Earthquake association rules of September 7th 2017

4.2 Earthquake of September 19th, 2017

Table 3 shows the most important AR obtained with earthquake data from September 19th. The AR obtained show that most of the affected schools are located in seismic zones C and B (intermediate zones, where earthquakes are not so frequently recorded), which presented moderate and severe damage, mainly were structural, in finishes and outdoors. Figure 2 shows a bar graph that relates the distance from the earthquake epicenter to damaged schools with the repair cost and the type of structure. In this case we can see that the greatest damages occurred in schools that are at an epicentral distance of 101 to 150 km. As for the type of construction, the schools most affected were those built based on masonry walls and light technology.

It is important to note that, in contrast with the AR obtained form the earthquake of September 7th, the highest repair costs are not found for the shortest epicentral distances. In this case, the highest costs were presented by intermediate epicentral distances. This is explained because the Mexico City is located just at this epicentral distance (100 to 150 km), so having a large number of schools, the repair costs were higher. From the seismic point of view, this is explained due to the soil characteristics in certain areas of Mexico City, were the seismic waves could be amplified. From a practical point of view, this effect was because the accelerations produced in these areas of Mexico City are comparable to those that occur at sites closer to the earthquake epicenter.

Fig. 1.
figure 1

Values included in the range of estimated costs of damage caused by the earthquake of September 7, 2017, depending on the epicentral distance and the type of construction.

Table 3. Earthquake Association Rules of September 19th 2017

5 Discussion

Historically, the provision of public data related to the damage caused by the earthquakes in Mexico was almost null until 2017. The difficulty of field surveys, data truthfulness, capture and processing times are some of the causes that in the past decades they prevented obtain this information. However, given the advances in technology and the wide use of ICTs, the recording of the damages caused by the earthquakes of September 7th and 19th, 2017 constitutes an invaluable source of information for study the behavior, impact and seismic risk. It is still pending to homologate these databases and rethink the type of data that is collected, in order to being able to make more formal studies about the behavior and seismic risk of the constructions. However, having the databases used in this work is already a great advance, because commonly other researchers are based on simulations or using only data as: latitude and longitude of the earthquake, magnitude, depth and location. However, this study can be used for different sectors such as health, housing, cultural heritage, among others, provided that the structure of the data is the same as that used in the case of studies.

In this sense, one of the main contributions of the research presented in this paper is the provision of a standardized knowledge base for using machine learning and DM algorithms, making it available to the scientific community for its exploitation and study. The initial results shown in this paper and with the support of an expert in Structural Engineering, was possible to show an overview of the behavior of the repair costs of the schools affected by these earthquakes, depending on the epicentral distance and the type of construction. It this way, it was found that the largest damage was produced by the earthquake of September 19th because the higher reconstruction costs exceed $43’000,000, while the earthquake of September 7th needed costs below this amount. Also was possible to identify that for the earthquake of September 7th (with an epicenter on the Oaxaca coast), at greater epicentral distances the damages were lower compared with the costs recorded for schools located at an epicentral distances closest to the earthquake.

Fig. 2.
figure 2

Values in the range of estimated costs of damage caused by the earthquake of September 19, 2017, depending on the epicentral distance and the type of construction.

On the other hand, in the September 19th earthquake data, was observed how the repair costs increased for distances between 100 Km and 150 km. This behavior was congruent with the existing seismological models, that recognize an amplification of the accelerations in certain areas of Mexico City, which is located in this range of epicentral distances. This explains the increase in repair costs. Regarding the type of construction, it was possible to establish that the structures with masonry walls and light roof were the ones that presented the greatest damage in both earthquakes. With respect to the type of damage that occurred most, it correspond to the damage to the structure and finishes of the schools.

6 Conclusions and Future Works

With the results obtained, it is possible to conclude that the MD can be a useful tool to perform a seismic risk analysis, since it was capable to find relations among different variables related with studied earthquakes and structures. These variables were the distance from the earthquake epicenter to the schools, the type of structure (materials), the type of damage and the reparation costs. It is important to emphasize that the quality of the RA was stablished a confidence degree equal or greater than 75%. The study performed here was focused on obtain a description of the damages caused by the earthquakes of September 7th and 9th to the affected schools.

The open lines of study are initially oriented to study a priori the seismic risk of the constructions, that is, before an earthquake occurs, so the prediction is being worked on to determine the cost of a school if an earthquake of a certain magnitude occurs, having certain characteristics, located at a defined epicentral distance. In same way, is our interest to mining Association Rules for different scenarios of seismic risk analysis, taking specific values, either by the type of damage, seismic zone, federative entity, among other parameters of interest. In the same way, it is contemplated to replicate the preprocessing strategy to other types of buildings, such as hospitals and factories where the main challenge is to include the human losses that unfortunately occurred.