Keywords

1 Introduction

Adapting data allocations and structures to the way data is used is a key optimization for parallel architectures. Changing data layout can enhance spatial data locality and memory consumption, having a large impact on code performance. Associated with instruction rescheduling and loop nest transformations, layout restructuring has a strong impact on vectorization and may lead to a better use of cache hierarchy, through temporal and spatial locality. Data restructuring is a global optimization in general, requiring interprocedural analysis, and in languages such as C, possible aliases hamper the scope of transformations. When considering combined data layout and control-flow transformation, dependence analyse is further limits the applicability of the methods. Finally, due to the complexity of memory hierarchy, the impact on performance of a data structure change is difficult to assess. To illustrate this difficulty, the simple choice between an array of structures (AoS) or a structure of arrays (SoA) is highly dependent on the use of the structure. Depending on the locality of data, it may be beneficial to use the SoA version if when using a single field at a time or the AoS version when using multiple fields (such as a complex number, for instance). For a parallel code, an Array of Structures of Vectors/Arrays may have to be considered, resulting in portability issues and unacceptable program complexity for the human programmer [18].

Several works have studied data layout restructuring for specific applications [13, 22] and for stencils [10]. In a recent work [1] we proposed a framework to analyze binary codes, and to formulate user-targeted hints about SIMDization potentials and hindrances. These hints provide the user with possible strategies to remove SIMDization hurdles, such as code transformations or data restructuring. However, this preliminary work conducted a qualitative analysis only, thus lacking an estimation in the transformation gains. In [8], we proposed a more quantified approach, to detect simple arrays and structures from execution traces, and suggest promising data layout transformations.

This paper proposes a novel approach for data restructuring. A formalization of data structures and of their transformations is described, independently of any control-flow or rescheduling optimization. We show that this framework can be used from memory traces in order to provide a quick assessment of potential gains (or lack of) to be expected from some transformations. For this purpose, we show how to setup mock-up executions for an application, in order to evaluate the impact of the transformation without the need to actually change the whole data structures or re-execute the whole application. This approach is evaluated on two real applications parallelized with OpenMP, combining restructuring and vectorization. The contributions proposed in this paper are the following:

  • Description of data structure layouts and their transformations, independently of control-flow optimizations;

  • Generation of mock-up codes with restructured layouts;

  • Performance evaluation of mock-ups, with and without SIMDization.

The paper is organized as follows: Sect. 2 presents two motivating examples with sub-optimal data layout. Section 3 describes a method for finding an initial multidimensional layout matching a trace, and the possible transformations. Section 4 presents the evaluation methodology. The experimental results are discussed in Sect. 5 and Sect. 6 presents related work.

2 Motivating Examples

From a user perspective, abstract data types correspond to algorithmic requirements, but choosing the actual data layout requires to take compiler, runtime support and architectural constraints into consideration. We illustrate this gap between the data layout chosen for the two following applications. In the cardiac wave simulation [23], the hot spot of the OpenMP version of the application uses a large 4D array to store the whole data structure, as shown in Fig. 1. The first 2 dimensions have starting index of 1, creating unnecessary gaps between lines. The third dimension is used as a structure with numbered fields, and the fourth dimension has a spatial locality issue, since it is indexed with the parity of the computation step (to keep only the previous computation results). While reordering dimensions here is not very complex, the ordering and locality choices for the last dimension depend on the computation itself and on the architecture.

Fig. 1.
figure 1

Two examples of codes needing data layout restructuring. In the Cardiac wave simulation, the 4-D array datarr is used as an array of structures. For the QCD simulation, all elements are complex double values. The space iterated by the outer loop is a 4-D linearized space and the indirection used for U accesses the white elements of a 4-D checkerboard.

Fig. 2.
figure 2

Example trace for Qiral for array U accesses, simplified 2D version for conciseness. Each color in the map represents one line of the trace. (Color figure online)

The second example considered is a Lattice QCD application, based on ETMC simulation [2]. The hotspot of the application performs several matrix-vector computations. Each matrix is described as an element of a large array, U. The space iterated by iL is a 4D linearized space. In this 4D space, only the white elements of a checkerboard are accessed, through an indirection array. Deciding how to restructure this array and whether it is worth to get rid of the indirection is important for the code performance. This example is difficult to analyze statically and would require some additional information from the user. An analysis based on traces on the contrary would capture the regularity of the accesses, in spite of the indirection.

3 Layout Description and Transformations

We give here a formal description for layouts and rules for transforming them.

3.1 Data Layout Description

Data structures are considered as any combination of arrays and structures, of any length. A layout is the description of this structure and of the elements that are accessed in it. A layout can be defined only for a limited code fragment. When considering a syntactic memory access expression in the code, it defines a set of memory address values. This set can be denoted as \(base + {I}\) where base is the base address and I is the set of positive integers, including 0. All addresses are within a range \([base, base + d - 1]\) where d is the diameter of I. The set of offsets I can be represented by a layout function \(S_{I,d}\), characterizing I:

$$S_{I,d}:\begin{array}[t]{rcl} [0, d-1] &{}\rightarrow &{} \{0,1\}\\ x &{} \rightarrow &{} 1 \text{ if } x \in I, 0 \text{ otherwise } \end{array} $$

\(S_{I,d}\) is called a structure layout. If \(I=[0, d-1]\) (all elements are accessed), \(S_{[0,d-1],d}\) is more specifically called an array layout, denoted \(A_d\). Note that these terms of arrays and structures may not correspond to the structures really occurring in the source code. To build a multidimensional data structure, we define the product operator \(\otimes \) and the sum \(\oplus \) on layout functions \(L_1\) and \(L_2\):

$$ \begin{array}{lr} L_1 \otimes L_2: \begin{array}[t]{rcl} I_1\times I_2 &{} \rightarrow &{}\{0,1\} \\ x,y &{} \rightarrow &{} L_1(x)* L_2(y) \end{array}&{} L_1 \oplus L_2: \begin{array}[t]{rcl} I &{} \rightarrow &{}\{0,1\}\\ x &{} \rightarrow &{} L_1(x) + L_2(x) \end{array} \end{array} $$

For the product, the two layout functions \(L_1\) and \(L_2\) may have two different domains, \(I_1\) and \(I_2\). For the sum, the domain of the two functions must be the same. The \(+\) operation is a saturated addition between integers. With this notation, Array of Structures correspond to the combination of the two types of layout, described by \(A_{d'} \otimes S_{I,d}\) for some values of \(d, d'\) and I. The formal description corresponds to the intuitive representation of the data. The same factorization identities exist with \(\oplus \) and \(\otimes \) as with integers. Some simplifications are possible between expressions involving both operators:

$$(S_{I,d} \otimes L) \oplus (S_{J,d} \otimes L) = S_{I\cup J,d} \otimes L, \quad (L \otimes S_{I,d}) \oplus (L \otimes S_{J,d}) = L \otimes S_{I\cup J,d}.$$

3.2 Finding the Initial Multidimensional Layout

In the general case, the memory accesses are given as flat, linearized addresses. The objective of this section is to find out the different multidimensional layouts used in the code fragment considered. On the source code, finding whether two memory accesses correspond to the same array region correspond to an alias analysis. Delinearization can be used in some simple cases to retrieve the multidimensional structure associated to the addresses. Because indirections or complex operations can be involved in the address computation, as shown in the two codes given as motivating examples, we propose in this paper to resort to memory traces. The code fragment is executed and all memory accesses generate a trace. This trace is compacted on-the-fly with the NLR method [12] in order to find possible recurring stride patterns. The following rewriting system transforms a flat layout into a multidimensional layout:

$$\begin{aligned} S_{I, m}&\rightarrow ~ S_{J,n}\otimes A_p \text{ if } I = \{ j*p + k, j\in J, k\in [0,n]\} \end{aligned}$$
(1)
$$\begin{aligned} S_{I,m}&\rightarrow ~ A_n \otimes S_{J,p} \text{ if } I = \{ k*p + j, j\in J, k\in [0,n]\} \end{aligned}$$
(2)
$$\begin{aligned} S_{n*I+p,m*n}&\rightarrow ~ S_{I,m} \otimes S_{\{p\}, n} \text{ if } p<n \end{aligned}$$
(3)

with \(n*I+p = \{ n*i+p, i \in I\}\). The first rule corresponds to the case where the initial layout is a structure of array, the second to an array of structure, and the third is the general case, where two structures layouts have been linearized. The initial multidimensional layout can be found by applying these rules iteratively until convergence. The rewriting system is confluent and convergent. Convergence comes from structures with diminishing sizes. We assume that array layouts are not rewritten. Confluence entails that the rules can be applied in any order and results from the fact that there is only one way to rewrite any given part of the addresses.

We apply the previous algorithm to restructure the trace given in Fig. 2. The trace, given as a for..loop enumerating addresses, is a simplified version for the memory access of matrix U (2D case, only first statement, no outer dimension). The following initial structure corresponds to the set of values accessed by the trace:

figure a

Applying Rule 3, then merging the first two lines and the last two, and then applying Rule 2 and finally Rule 1 leads to the formulation on the right. This corresponds to an AoSoAoS: This is an array of 2 lines, even lines and odd lines. Even lines have 256 elements that are structures of 4 doubles, using only the first 2. Odd lines have 256 elements having 4 doubles, using only the last 2. This is represented in Fig. 2.

3.3 Transformations

We define layout transformations as rewriting rules that applies on the layouts described in previously defined formalism. For these rules, rules applying to structures S are assumed not to apply to arrays. If \( S_{I, d}(x) = 1 \), we will rather write it as: \(S_d\). \(\#I\) corresponds to the number of elements in I:

$$\begin{aligned} L \otimes L'\rightarrow & {} L' \otimes L \end{aligned}$$
(4)
$$\begin{aligned} A_{n * m}\rightarrow & {} A_{n} \otimes A_{m} \end{aligned}$$
(5)
$$\begin{aligned} S_{I, n} \otimes S_{J, m}\rightarrow & {} S_{I^{\prime }, n * m},~if \#I^{\prime }= \#I \times \#J \end{aligned}$$
(6)
$$\begin{aligned} S_{I, d}\rightarrow & {} S_{I^{\prime },d^{\prime }},~if \#I= \#I^{\prime }, d \le d^{\prime } \end{aligned}$$
(7)

Rule 4 permutes two layouts, Rule 5 cuts an array in two arrays. Rule 6 merges two structure layouts and the last one, Rule 7, removes unused elements in a structure. In a layout expression composed of different terms in a \(\oplus \), all terms of the sum at the same position must be rewritten with the same rule, since it corresponds to the same sub-structure. All transformations preserve the number of elements in the layouts.

3.4 Exploring Layouts

The previous rewriting system generates a finite but potentially large number of layouts. We propose a strategy to limit the exploration. Rule 5 is only applied at most once to split an array for SIMDization purposes. One of the created array is then permuted in the rightmost position of the term in order to create a possible vector of elements. Rule 7 is applied whenever possible. Rule 6 simplifies code generation by fusing contiguous dimensions. This rule is only applied at the end of the rewriting. To further reduce exploration, we propose to guide the generation by proposing patterns of layouts. For instance, SIMDization requires that the layout ends with an array. Only terms in the form of the regular expression \(*\otimes A\) are considered. For instance, on the two examples shown as motivating examples, we look for layouts of the form \(*\otimes A\) or \(*\otimes A\otimes S_c\) (with \(S_c\) the structure corresponding to complex numbers). This leads to the layouts presented in the following table:

figure b

The preconditioned version for QIRAL has a detected 4D checkerboard pattern, here expressed in a concise form, and v has the size of a SIMD vector. Checkerboard compression leads to the same transformations, only with L half the size. The QIRAL excerpt corresponds to the code presented as the motivating example while the application includes a larger scope of code.

4 Transformation Evaluation

This section deals with the quantified part of the user feedback we provide. The idea is to estimate the potential speedup of transformations in order to help the user make a choice for data restructuring.

4.1 Principle of Mock-Up Evaluation

We propose an evaluation methodology that explores a set of different layout transformations. Because these transformations are based on the values collected by memory traces, the generated transformed codes are in general not semantically equivalent to the initial code, outside of the trace. However they can serve as performance mock-ups. The idea is to measure possible performance gains of the application by executing the mock-ups. To preserve the application execution conditions, the mock-up is executed in the context of the application. Checkpoint/restart technique is used for this objective: Assuming the user knows the hotspot of the application, the original binary code is patched with a checkpoint right before the hotspot and then run until the checkpoint is reached. This checkpoint generates an execution context, used for capturing the trace and running/evaluating the mock-ups. The binary code is instrumented in order to collect the memory trace and restarted from this context. Then several layout transformations are applied on the initial code, generating new versions of the code that are restarted from the same context. As the checkpoint/restart mechanism preserves the memory addresses in use, the addresses and sizes of layouts captured in the trace can be reused in the mock-up codes. We rely on this property for generating data layout copies and the transformed codes. Our approach does not preserve however the hotspot cache state. Cache warm-up may be a solution to this issue, but goes beyond the scope of this paper. Mock-ups are stopped when the control leaves the hotspot and the timing is deduced at this point. For checkpoint/restart, we resort to the BLCR library [9].

4.2 Automatic Mock-Up Generation Technique

Mock-ups are generated at compile time, as library functions. A mock-up corresponds to the initial hotspot, with different memory accesses and their address computation. The rest of the computation itself corresponds to the original code. Before executing the mock-up, the data layout has to be created and data copied. This copy-in operation is guided by the trace information. The objective is to optimize the hotspot performance, and to push away the copies from the kernel to minimize their impact, avoiding cache pollution due to the copy itself. We choose to move the copy up to the beginning of the function if applicable, the limit being the last write on the array we want to restructure. This is determined automatically by trace inspection. The sequence of transformation rules applied to the initial layout corresponds also to transformations on the iterators of these structures. The copy codes are simple loops changing one layout, with one iterator, into another. For the indexation of data in the computation code, the control is kept unchanged. New scalar iterators are created in order to map the previous index to the new index. For this, the trace provides for each individual assembly instruction the sequence of addresses accessed. This sequence of indices is transformed into a sequence of new indices, of the new layout. The binary code is parsed with the MAQAO tool [3], and the modified code of the mock-up is generated in a C file, using inline assembly. The advantage of this approach is to rely on the compiler for an optimized register allocation for all the new induction variables added for indexing, and for removing dead code. For instance, the loads corresponding to the indirection are removed when reindexing the data structure in a simpler way. The code generated is only valid within the scope of the values collected by the trace.

4.3 Combining Layout Restructuring with SIMDization

Data restructuring is a SIMDization-enabling transformation, as data can be placed contiguously to fill a vector. We perform SIMDization whenever dependences allows it, impacting the control (loops) of the hotspot. From the trace analysis, we build a dependence graph that determines whether some arrays can be vectorized. We rely on MAQAO for this analysis [1], as well as for the detection of loop structures and for loop counters. The generated vectorized loop has a shorter loop trip count by a factor equal to the architecture vector size. This loop trip count is retrieved from the memory traces. All instructions involving the initial data structure have to be replaced by their vectorized counterpart, including load and stores. Some compiler optimization can be untangled, such as partial loads that are replaced by a single packed load operation. Reductions are detected through dependence graph analysis, and are replaced using horizontal operations. We detect read-only arrays or constants and unpack them. However, our SIMDization step from binary code to assembly code (inline assembly) is still fragile and essentially only applies a straightforward vectorization scheme.

5 Experimental Results

The objective of the section is to show how relevant the speedup hints are, in the sense that they provide useful advice to the programmer. To do so, we compare our mock-up speedups with the actual performance observed by restructuring the C code by hand, using layouts defined in Sect. 3.4. All experiments are conducted on an Intel(R) Xeon(R) CPU E5-2650 2 GHz 2*8-core processor with SSE2 features, using icc 15.0.0 and gcc 5.3.1 compilers, both with -O3 flag.

Lattice QCD: Figure 3 shows performance of both mock-ups and hand-transformed codes for the loop nest in Fig. 1.(b). The hand-tuned code focuses only on restructuring layout and does not perform explicit SIMDization. It appears that gcc does not vectorize the code when handling complex data types and performs poorly even compared to the non-vectorized mock-ups. For the code without preconditioning (left graph), all mock-ups predict performance improvement for each of the four transformation presented, with an average relative error of 16% compared to the hand-tuned codes. For AoSoA-cplx, the mock-up under-estimate performance. The reasons comes from the fact that icc optimizes the complex multiply and the load/stores, outperforming the naive SIMDization of the mock-up. Similar conclusions hold for the code with even/odd preconditioning. For the whole multithreaded hotspot function, the manually restructured version resorts to intrinsics, as the compilers do not manage auto-vectorization. Predictions for SoA and AoSoA are reliable with an average relative error of 4%, as shown in Fig. 4, as mock-up SIMDization perform close to user restructured code. With a packed thread policy and hyper-threading disabled, the multithreaded context does not disrupt the mock-up prediction, since the code is parallel and compute-bound.

Fig. 3.
figure 3

Lattice QCD benchmark without preconditioning (left), with even/odd preconditioning (right) speedup, single thread.

Fig. 4.
figure 4

Lattice QCD application restructuring+SIMD speedup with respect to thread number.

2D Cardiac Wave Propagation Simulation: The hotspot here is not initially vectorized, but it is successfully vectorized after data layout restructuring; Consequently no intrinsics are used in hand-tuned codes. We study layout restructuring impact on performance on two different datasets, corresponding to two different layout sizes. Speedups obtained after restructuring are shown in Fig. 5 for the Dataset-256. With only the restructuring, the mock-ups exhibit a speed-up of \(2.4\times \) on average. When considering mock-ups with SIMDization, the gain of SIMDization alone is around \(2\times \). Mock-up prediction average relative error is 9% too optimistic in this experiment. This over-estimation is explained by the effect of cache warm-up. In the mock-ups, the data copy loads data in the cache right before the hotspot. In the application, such “prefetch” is not performed. The input size is multiplied by a factor 4 using Dataset-512. In this new configuration, restructuring gain is dramatically higher than before, increasing with the number of threads and reaching roughly \(14\times \) with 8 threads, as application achieves to take full advantage of all private L2 caches. Moreover, prediction remains consistently slightly overoptimistic as memory cache may be warmer before kernel execution than actual real application cache, while still being accurate with an average relative error as low as 5%.

Fig. 5.
figure 5

2D Wave propagation application restructuring+SIMDization speedup on dataset-256 (left) or on Dataset-512 (right)—with respect to reference using respectively equal number of threads—average relative error is 9% ± 8% (left), 5% ± 2% (right)

6 Related Work

Many modern languages, in particular object oriented languages, propose a layer of abstraction between data types and the data layout in memory (hierarchical arrays, C++ libraries). However, few works propose to restructure existing data, in codes written in C or Fortran. This abstraction layer is also provided by libraries, hiding in particular the complexity of AoSoA layouts with SIMDization to the user (Cyme [6], Boost:SIMD or Kokkos [5], to name a few). The StructSlim profiler [16] helps programmers into data restructuring through structure splitting. For GPU, copy is performed at transfer time and data layout change is also performed at this step [20]. Code analysis is performed statically, on OpenCL for instance. The same approach has been explored for heterogeneous architectures [15], assessing affinity between fields and clustering fields, and devising multi-phase AoS vs SoA data layouts and transforms. \(VP^3\) [24] is a tool for pinpointing SIMDization related performance bottlenecks. It tries to predict performance gains by changing the memory access pattern or instructions. However, it does not propose high level restructuring. Similarly, ArrayTool [14] can be used to regroup arrays, to gain locality, but there is no deeper change in data layouts. Annotations and specific data layout optimizations with compiler support has been proposed by Sharma et al. [19]. The source-to-source transformation requires to describe in a separate file the desired array interleaving. Similarly, the array unification described by Kandemir [11] and Inter-Array Data regrouping [4] propose to merge different arrays at compile-time in order to gain locality. The POLCA semantics-aware transformation toolchain is an Haskell framework offering numerous transformation operators using programmer inserted pragma annotations [21]. Neither of these approaches provide an assessment of the performance gains to guide the user restructuring or hint generation, and these compile-time approaches cannot handle indirections. An approach to find a good layout using profile information has been proposed by [17], but relies on simulation to test the layouts and does not address vectorization or AoS transformations. Delinearization is the first analysis on the compiler side, in order to be able to restructure the layout. Parametric delinearization, for some particular codes, has been proposed by Grosser et al. [7]. Specifically for stencil codes, using the polyhedral model, Henretty et al. [10] propose a complete restructuring of layout for SIMDization. This would not apply to the Lattice QCD code with the even/odd preconditioning (indirection) Compared to the authors previous work [8], the work presented in this paper gives a more general framework for the recognition of complex data layouts and systematic exploration of data layouts. The code generation and SIMDization are automatically achieved, for a given transformation.

7 Conclusion

We have presented in this paper an original contribution for assessing the impact on performance of data layout restructuring. The layout transformations, based on profile information and described by a rewriting system, can be shown and explained to the user, from the initial layout to the transformed one. These transformations can then be applied and explored directly on a binary code generating automatically a new binary code. A set of different restructuring has been combined with SIMDization and the evaluation has been conducted on two applications, with different parameters (size of input, preconditioning used) and using different number of threads. The results show that the performance prediction of mock-up restructuring is reliable compared to a hand-tuned transformation and SIMDization (below 5% in average of relative error).