Keywords

1 Introduction

Performance analysis and tuning of parallel applications is becoming a more and more complicated task, even for expert developers, because the increasing heterogeneity and complexity of current HPC systems. Performance problems in such systems may be produced by several different, and sometimes hard to relate, causes that make it difficult to find the way to solve them. Logically, this difficulty is exacerbated when performance analysis and tuning process is done automatically and dynamically during the application execution.

Identifying performance problems requires to gather the appropriate metrics to find the causes of the bottleneck. At the processor level, the hardware performance counters are a powerful source of information. This mechanism provides metrics about the utilization of different system resources, such as access pattern to the memory hierarchy, executed instructions and their type, etc.

The main hypothesis of this work is that, at the processor level, the values of the performance counters can be used to identify and characterize a parallel region during execution time. This set of values can be defined as the signature of a parallel region. This signature can be used at a later time to identify which kind of region the application is executing and to apply the appropriate tuning strategy depending on the behaviour explained by the signature.

In the case of OpenMP applications, hardware performance counters can be a good way to find which resources are being stressed and find possible solutions to improve performance [6]. We believe that hardware performance counters, such as cache misses, cycles per instruction, number of instructions executed, and others, can be used to identify and describe the execution of a parallel region.

However, current processors include an elevate number of hardware performance counters, for example, the Intel® i7 7700 includes up to 170 different counters, but only a few can be recorded simultaneously. Consequently, getting the values of all available counters for every parallel region can be costly or even unfeasible.

In this paper, we propose a method to reduce the number of hardware performance counters and characterize regions in OpenMP parallel applications at execution time with the help of counter multiplexing. This methodology will be based on (i) correlation analysis to find redundancy in the metrics provided by different counters, and (ii) principal component analysis to show that the signature composed by the values of a set of hardware performance counters can be used to characterize different parallel regions.

The remainder of this work is organized as follows. Section 2 introduces the mechanisms and techniques that are used in the proposed methodology. Next, Sect. 3 describes the methodology, which is the main contribution of this paper. Then, Sect. 4 shows the experimentation conducted to assess the methodology. Section 5 discusses relevant related work. Finally, Sect. 6 concludes this work and introduces future lines.

2 Background

This section introduces the mechanisms and techniques used in this paper to obtain the metrics to compute the signature for characterizing parallel regions and to reduce the number of hardware performance counters needed for computing this signature.

2.1 Hardware Performance Counters

Hardware performance counters are a set of special-purpose registers built into the processor to store the counts of hardware-related activities within the system, such as branch operations (branches taken or not, successfully predicted or not, etc.), memory accesses, cache misses, cycles stalled, instructions executed, and other metrics.

There are factors, such as the number of available special-purpose registers, that limit the number and groups of hardware performance counters that can be read at the same time. To overcome these limitations and collect the values of more counters, the application can be executed multiple times or counter multiplexing can be applied [3].

On the one hand, if the application is executed multiple times to measure hardware counters by groups, the measurement accuracy is high but the total execution time is multiplied by the number of executions needed. In the case of applications with long execution time, this approach is not feasible because of the required time. Moreover, this approach cannot be applied in the case of dynamic tuning as the performance problems have to be detected and solved at run-time.

On the other hand, this limitation can be overcome by multiplexing the usage of the counter registers over time (timesharing) among a large number of performance events. This approach has the advantage of executing the application once, but introduces some overhead due to counter swapping and recording. In addition, the metrics’ precision is reduced because the final value of each counter is estimated using the partial values obtained in each time interval.

The most used tool to read hardware performance counters is Performance Application Programming Interface (PAPI) [5]. It grants an easy way to access hardware performance counters and allows for application profiling with counter multiplexing. In addition, it has been integrated [2], among many other tools, in MATE [11], which is a dynamic analysis and tuning environment that we are planning to use in the near future to implement a tuning strategy relaying in a counter-based application characterization.

2.2 Principal Component Analysis

In some cases, it can be difficult to obtain visual information from a set of observations of a big number of, possibly correlated, variables. Principal Component Analysis (PCA) is an analysis method that applies an orthogonal transformation that converts these observations into a set of linearly uncorrelated values called principal components. The first component explains the greatest possible fraction of the data variability, the second component the second greatest fraction, and so on and so forth. In this way, the components that explain smaller fractions of the data variability can be ignored, thus, reducing the data dimensionality [9].

Consequently, this method projects the data in a new coordinate system that highlights its variability and allows for eliminating the less informative dimensions, facilitating the exploration of this data.

2.3 Linear Correlation Analysis

Linear Correlation Analysis is a statistical evaluation to measure relationships or connections between two numerical and continuous variables.

This analysis finds a pair of linear transformations where the correlation coefficient between the variables is maximized [1].

The output of the correlation analysis is a correlation coefficient in the range \([-1,1]\). There are three perfect scenarios depending on the value of the correlation coefficient:

  • Correlated (value 1). The two variables are in a perfect increasing linear relationship.

  • Not correlated (value 0). There is no linear relationship between the two variables.

  • Anti-correlated (value \(-1\)). There is a perfect decreasing linear relationship.

In the case of two variables with a perfect linear relationship, be it increasing or decreasing, the value of one variable can be calculated if the appropriate linear transformation is applied to the value of the other variable.

3 Methodology

In this section, based on the mechanisms and techniques explained in Sect. 2, we propose a methodology for reducing the number of hardware performance counters used to characterize OpenMP parallel regions.

Figure 1 shows a schematic representation of this methodology, which consists of the following steps:

  1. 1.

    Hardware performance data collection (Sect. 3.1). In this step the data to analyze is obtained and saved in a database.

  2. 2.

    Data exploration (Sect. 3.2). PCA is used to check if the data can be classified visually.

  3. 3.

    Hardware performance counter reduction (Sect. 3.3). Correlation analysis is applied and variables with a high correlation coefficient are discarded. Then, we go back to step 2 to validate if the space reduction still allows for correctly characterizing the parallel region.

Fig. 1.
figure 1

Reduction of hardware performance counter space.

The advantages of eliminating redundancy and, hence, reducing the number of variables, are:

  • Higher hardware counter measuring precision. If there are less hardware counters to measure, less groups are created for multiplexing, resulting in more measuring time for each group.

  • Improved learning accuracy and reduced overfitting potential [12]. In machine learning and data mining, models tend to require more input data to avoid overfitting as the number of variables increases.

  • Lower computational cost. Collecting less variables reduces the overhead generated by the data collection and the time required for the analysis.

3.1 Step 1: Hardware Performance Data Collection

We have decided to use PAPI’s preset events because these hardware performance counters are typically available in processors for multiple platforms. Therefore, in the first place, the available preset events are obtained with the papi_avail command.

Next, groups of hardware counters are created taking into account the maximum number of events that can be read at the same time in one processor and if they can be accessed simultaneously. The command papi_event_chooser is used to check the compatibility of each group of events.

A set of code templates, representing different parallel region structures, has been developed with the objective of gathering data for a wide range of OpenMP parallel regions representative of real cases.

Each created group of hardware counters is measured for multiple executions of these templates using different combinations of compilation flags and input data sizes (template’s configuration). In this way we are gathering data for different object code translations and memory access patterns associated to the same parallel region structure. The total number of executions for generating the performance database can be calculated using expression 1.

$$\begin{aligned} executions = created\_groups * data\_sizes * flag\_combinations * repetitions \end{aligned}$$
(1)

Each variable in the database should be normalized before the exploration and reduction steps in order to facilitate data visualization and future usage of machine learning techniques.

We have adjusted the values of each variable in the range [0,1], dividing each recorded value by the maximum value of the corresponding variable.

Bottom line, the result of this step is a database containing the normalized data obtained after executing the templates.

3.2 Step 2: Data Exploration

We use PCA for visualizing data and validating the reduction of hardware performance counters done with correlation analysis.

PCA is applied to the normalized data resulting from step 1, which produces a new data set where variables have been transformed into principal components. With this transformation, we can check how much variance of the data is represented by each principal component and determine the minimum dimensionality needed to visualize the data without losing significant information.

With the help of PCA’s dimensionality reduction we can plot the new data representation and easily check if the resulting data of the execution of each code template is visually distinguishable from the others. In addition, if a new point is inserted into the plot, it should be easy to identify to which code template and template’s configuration the new point belongs to.

Moreover, PCA may also hint relationships between different hardware counters. Analyzing the weights of each variable in a principal component may indicate that some counters contribute evenly to the component (if they have similar weights), this could mean that both variables are related. Consequently, special attention shall be given to these hardware counters in the reduction step.

Summarizing, in this step we obtain a new representation of the data with fewer dimensions. The adequate visualization of this representation indicates if the available data can be used to distinguish between different parallel region templates. In addition, PCA may also hint counters that are likely to be correlated.

3.3 Step 3: Hardware Performance Counter Reduction

If two hardware performance counters (variables) are completely correlated, one of them can be considered redundant [14] and can be discarded.

Therefore, this step consists in performing a linear correlation analysis over the normalized data produced in step 1. This analysis produces a square symmetric matrix with the correlation coefficients between every pair of counters. From this matrix, we will assume that, in general, variables with a correlation coefficient close to 1 are linearly dependent and can be considered for discarding.

With the results obtained from the correlation analysis, we will check if a logical relationship can be established between counters with high correlation. We verify which hardware performance counters are accessed by the two events and analyze if they describe the same behaviour. For example, if the analysis tells us that L1 cache misses is highly correlated to branch instructions, this correlation is not logical and both counters are preserved, but if it tells us that double point operations and double point instructions are correlated, a logical relationship can be established and one of the two counters can be discarded.

Finally, if there are discarded counters the corresponding columns of the database are eliminated, generating a new database with a smaller number of variables. In this case, we go back to step 2 using the reduced database. On the contrary, if no counters have been discarded then the current database is considered to be the smallest set of data characterizing all the executions of the considered templates.

4 Experimentation

This section presents the results obtained using the proposed methodology on a specific set of templates. In addition, to show that these results effectively characterize the considered code regions, the values of the reduced set of counters are used for training a neural network for recognizing parallel region templates independently of the template’s configuration.

We have used the parallel regions included in the STREAM benchmark [10] as the set of templates for our experimentation because they approximate the behaviour of multiple memory bound real OpenMP applications.

STREAM has four patterns with different number of operations and memory access pattern:

  • COPY. One vector is copied into another, there are no arithmetic operations involved, just one read and one store.

  • SCALE. The multiplication of the elements of a vector by a scalar is stored into another vector. There is one multiplication, one read and one store.

  • SUM. The addition of two vectors is stored in a third vector. There is one addition, two reads and one store.

  • TRIAD. It combines SUM and SCALE, adding a vector multiplied by a scalar to another vector. There are two operations (addition and multiplication), two reads and one store.

The hardware used in the experimentation is a DELL T7500. This machine has two Xeon E5645 processors with six multi-threaded cores per processor. Its memory hierarchy is composed by a 256KB L2 and a 12MB L3 caches in each processor, and 96GB of main memory.

Step 1 of the proposed methodology indicates that we must obtain the preset events for the target processor and determine the valid groups. PAPI reports 58 available preset events for the Xeon E5645. The measurable event types and the number of counters for each type are the following:

figure a

Next, we must execute all the combinations of the selected templates (4), created groups of counters (12), data sizes (from 3KB to 4.5GB, using 56 different sizes), compiler flags (O0 and O2), and number of repetitions (1,000); normalize the results; and build the performance database. According to expression 1, there are 1,344,000 executions for each template, which are used to build the 448,000 entries (58 columns each) of the performance database.

Fig. 2.
figure 2

Correlation matrix and PCA with the full list of hardware counters.

Then, we can proceed with step 2 of the methodology and apply PCA to the normalized database. Figure 2(b) shows the visualization of the data for the first and second principal components, which explain more than 89% of the data’s variance. The different STREAM templates can be distinguished even in this two-dimensional plot, indicating that our main hypothesis is true for this set of templates.

Using the PCA’s results to get hints about counters’ significance, we realized that the events related to the instructions cache (18) depend more on the code generated by the compiler than on the behaviour of the application. This allows to make a first reduction of the number of columns of the performance database to 40. After discarding these counters, the PCA analysis showed a small improvement in the proportion of the variance explained by the first principal components.

Table 1 shows the cumulative variance explained by the principal components for all available hardware counters (58) and the results removing those related to the instruction cache.

Table 1. Comparison of the cumulative variance before and after removing hardware counters related to instruction cache.

Next, we can go to step 3 of the methodology and perform a linear correlation analysis on the normalized database. Figure 2(a) shows the correlation matrix where darker points indicate a stronger correlation between a pair of hardware counters. Based on this matrix, we analyze the strongest correlations and decide which counters can be discarded.

For example, we discarded the hardware counter total cycles (TOT_CYC) because it is completely correlated to reference cycles (REF_CYC). We decided to keep reference cycles as it uses a reference clock instead of the clock of the CPU which can change depending on features such as Intel’s turbo boost.

We also have discarded different hardware counters that access the same resource. This is the case of the single point vectorization (VEC_SP) and double point vectorization (VEC_DP) counters that read the same register, which counts the number of SIMD instructions.

In other cases, we found that some events where the addition or combination of multiple events. As for example, the branch instructions counter (BR_INS) is the addition of conditional and unconditional branch instructions ones (BR_CN and BR_UCN, respectively), so, we can discard the first one.

Summarizing, after this analysis, we end up with 20 hardware performance counters, distributed in the following way:

figure b

After completing the 3 steps of the methodology, we go back to step 2 because the number of counters has been significantly reduced. This means that PCA must be applied to the new database to show that the remaining counters still characterize the considered templates. Figure 3(b) shows the visualization of the data for the first and second principal components, which explain more than 88% of the data variance. It can be seen that the templates can still be clearly distinguished using this reduced set of counters.

Table 2 shows that the data variance explained by the first principal components is similar to the one obtained when considering the whole set of counters.

Table 2. Cumulative variance with the reduced list of hardware performance counters.

Finally, we perform a new linear correlation analysis (step 3) to decide if the set of counters can be furtherly reduced.

Fig. 3.
figure 3

Correlation matrix and PCA analysis with reduced list of hardware counters.

Figure 3(a) shows the obtained correlation matrix with some dark points still indicating strong correlations between pairs of counters. However, after analyzing them, no logical relationship can be established between the corresponding counters. For example, L2 storage misses (L2_STM) is not logically related to branches taken (BR_TKN), so, both counters are kept despite they are highly correlated.

Consequently, as far as no new counters have been discarded, the performance database including 20 counters is regarded as the smallest set of data that characterizes the considered templates.

Our motivation for proposing a methodology to find a reduced number of hardware performance counters was based on the fact that current processors include a significant number of counters and measuring all of them at execution time can be costly or even unfeasible. Now, we want to illustrate this claim using the experiment described previously.

On the one hand, for several templates, multiplexing the full list of hardware performance counters lead to erroneous values (sometimes negative ones) because the execution time was not long enough, so, basically, it is not feasible to measure all the counters using the considered templates.

On the other hand, in the case of the reduced list of hardware counters, we were able to assess the precision of the measured metrics and the overhead for obtaining them. The overhead for regions with execution time around a few seconds is of up to 10 milliseconds, and for regions with execution time of less than a second it is of up to 4 milliseconds. This overhead includes the time for setting up the methods to count events, and the time for collecting the counters and multiplexing the groups of counters. As for the precision, in the cases of execution times of less than a second, the precision of multiplexing groups of counters is in some cases low, the accuracy is between 90% and 99%, and unconditional branches are not properly estimated. For longer executions (execution time higher than one second), the accuracy increases to more than 99%.

Finally, the main hypothesis behind our work is that a parallel region can be characterized by the signature composed by the values of a set of hardware performance counters. The results of PCA seem to corroborate it, but we want to add more evidence for this claim using the results of the presented experiment.

To do so, we have trained a simple artificial neural network with one hidden layer using the database with 20 counters produced by applying the proposed methodology. This database has been divided into two subsets, one for training the network (432,000 entries) and another for validating it (the remaining 16,000 entries). The validation set is built with the entries corresponding to 2 of the 56 different input data sizes, i.e., \(4(templates) \times 2(compiler flags) \times 2(data sizes) \times 1,000 (repetitions)\).

After training the network for ten epochs, it gets an accuracy of up to 99.98% for the validation set. These results are relevant for two reasons, on the one hand, they provide the evidence we were looking for and, on the other hand, they hint that the signatures of parallel regions can be used in machine learning techniques.

5 Related Work

Several proposals share our objective of characterizing code regions although, in most cases, this characterization is aimed at detecting phases in the execution of a program to determine when the behaviour of the application changes.

Bhattacharyya et al. [4] characterized phases in cloud applications using execution snapshots. Each snapshot has information regarding sets of functions in the thread-dumps, the program memory’s usage and the use of the CPU. PCA is used to detect outliers and identify when there is a different phase in the program’s execution.

Ziedan et al. [15] also identified and classified phases in the application, in this case, their proposal tracks changes in the L2 cache access pattern. This methodology creates a Cache Access Signature Vector with information about accessed positions and their intensity for each interval. One interval is defined by a fixed number of instructions.

Fang et al. [8] generated signatures of the execution to detect phases. The signatures include information regarding cache miss rates, branch miss rates and IPC. Phases are classified using their signature and comparing it to a signature table in order to find if there is a new phase or the phase was executed before.

Chetsa et al. [7] detected application phases using execution vectors. These vectors include information about hardware performance counters, transmitted network bytes and disk usage. This methodology uses only general purpose counters to avoid redundancy (retired instructions, L3 cache references and misses, branch instructions and branch misses).

The explained methodologies sample the application in blocks of instructions to find changes in the behaviour of the application, while in our case we want to classify parallel code regions that have been identified in the code.

Another approach to generate signatures was developed by Wong et al. [13] for message passing applications. The execution of the application is divided in blocks depending on communication, instead of instructions, and the signature is generated using the communications (patterns and volume) and computational time. In this case, hardware performance counters are not used and the methodology is designed only for message passing applications.

6 Conclusion and Future Work

Considering the hypothesis that a parallel region can be characterized by the values of a set of hardware counters (region signature) as the starting point of this work, we have developed a methodology to reduce the variables (counters) of this set in order to be able to measure them at execution time with the adequate precision.

The proposed methodology, based on PCA and linear correlation analysis, has been tested using a limited set of representative OpenMP templates on a specific machine with 58 preset counters. This evaluation has shown that (i) the number counters included in the signature can be reduced following the methodology steps; (ii) that the reduced set can be measured at execution time using counter multiplexing, while measuring the full set was unfeasible; and that the resulting performance database could be used to identify the templates with high accuracy.

Currently, we are extending the set of templates with new parallel regions code patterns and also working on strategies for automatically and dynamically identifying and solving performance problems associated to these regions.