Keywords

1 Introduction

Data corruption in high-performance computation (HPC) can be devastating. This is especially true for scientific and national security applications, where computationally complex multi-physics simulations may be the only real way to experiment. Data corruption in this context can cause errors in the calculation that might not even be seen, and thus cause an incorrect result to be accepted.

Thus, it is important to understand the nature and behavior of the physical phenomena that cause faults, and the interactions of these faults with applications. This is usually done by measuring their rates, and then identifying the impacts they have on running applications. Many studies investigate the impact of silent data corruption (SDC) on application behavior and correctness by using simulation-based fault injection studies to support their conclusions [5, 7].

In contrast to this experimental approach, this paper is based on theory. We develop a first-principles mathematical approach that analyzes the impact of data corruption on low-level operations such as multiplication, within higher-level applications such as matrix multiplication. We then validate these models using experimental results. This theoretical approach strengthens our fault injection work, and provides an exemplar of the development of analytic models of masked errors on general integer operations. This work is complementary to some emerging work done recently on floating point numbers [4].

The genesis of this work was a fault-injection study on fault-tolerant matrix multiplication. Because of the way we injected the faults, all injections should have resulted in observable error. Instead, we were surprised to find that in some of the trials, the output was completely correct. The fault injector was mature and well-validated, and the code we were testing was simple; still, we examined both suites carefully and found no bugs. When we finally turned to the mathematics beneath the code, we found the unexpected theoretical resilience of the multiplication operator that we discuss here. The model we developed matches the error statistics seen in the study, so this mathematical resilience completely explains the unanticipated result.

The contributions of this work are as follows:

  • We present an analytical approach to fault analysis and resilience that is theoretically justifiable and explains experimental results. This kind of approach can be used to better estimate resilience and actual error rates.

  • We use this approach to develop an analytical model to calculate the resilience inherent to the integer multiplication operation.

  • We validate this model through exhaustive and sampled experimentation, and demonstrate that the theoretical model and targeted experimentation exactly match each other, and that together, they closely match the observed behavior in a micro-benchmark under soft-error injection.

  • We quantify a “built-in” resilience that may be inherent to many mathematical operations, using integer multiplication as one example. These characteristics are interesting because they provide a measure of resilience that is not coupled with any hardware or software-based fault tolerance mechanisms.

  • We present a masking behavior that can influence the process of analysis.

2 Theory of Multiplicative Resilience

In this section, we discuss our theoretical model for resilience of integer multiplication. The concept is quite simple: if a is multiplied by b, and b is a multiple of \(2^k\), the multiplication essentially invokes a left shift on a by k bits, and any faults in the leftmost k bits of a disappear. Simple as this is, it was unanticipated, and was not our initial approach to analyzing the unexpectedly correct results in our experiment. We have not found this effect mentioned in the literature.

Throughout, we use the term “wordlength” to describe the length of an integer native to the architecture, and the term “bitlength” to describe the minimum number of bits needed to represent a given integer. We do not show the proofs in this short paper, for the sake of brevity.

2.1 Multiplicative Resilience from Overflow Implementation

The implementation of integer multiplicative overflow is truncation at n bits, on every n-bit machine we have tested. If \(a\times {b}\) does not overflow, faults are benign if the least n bits of the result are the same as the least n bits of \(a\times {b}\).

For example, consider the 8-bit multiplication \(4\times 16\). The correct result is 64. If there happened to be a single bitflip in the \(6^{th}\) bit of the first factor, the product becomes \((2^6+4)\times {16} = 1088\). Truncating this result to the lowest 8 bits gives 64, and the correct answer is returned.

This happens whenever \(b=b'\times {2^s}\), \(b'\) odd, and the bitflips on a take place only within the top s bit positions. Then the entire error will be truncated when the overflow is truncated. If the bitflips occur in the top s bits of a, then the result is correct. This means that benign behavior of multiplicative faults depends only on the power of 2 dividing b and on where the bitflips occur in a.

This is a mathematical property of multiplication. The only machine contingency that affects this result is the treatment of overflow, which is undefined. Every machine we have tested implements overflow by truncating, and this permits the multiplicative resilience we describe in this paper.

Multiplicative behavior upon overflow is undefined in the C/C++ spec [3]. Although there are some non-language compliant compiler-specific options that in some cases can alter the behavior of overflow, these are often ignored. Additionally, there are run-time checks that can be incorporated to detect the possibility of overflow prior to and after instruction execution; however, this functionality is in most cases compiler- and OS-specific, and is almost never enabled by default. Furthermore, this type of protection can be costly as it involves inserting additional instructions after every operation that may result in overflow, which can result in lower run-time performance [3]; an unwanted characteristic in HPC.

This illustrates how the implementation of undefined behavior might be used to affect the resilience of a given operation. For example, one might implement to recover the most significant bits upon overflow. This would not give the resilience to high-bit faults described here, but might give other forms of resilience.

Proposition 1

Let multiplication overflow be implemented via truncation.

Let n be the wordlength, let a be an n-bit integer having some number k of bitflips, and let b be an n-bit integer such that \(2^j\) is the largest power of 2 dividing b, so \(b=2^j\times b'\) and \(b'\) is odd.

Let h be the bit of smallest index in a (where indices start at 0) that is flipped.

Then \(a\times {b}\) gives the same answer pre- and post-bitflip if and only if \(h\ge n-j\).

In the next two corollaries, we consider the case where overflow does not occur in the original multiplication \(a\times {b}\). We do this as these corollaries are concerned with correctness, and the calculation will not be correct if there is overflow.

Corollary 1

Let the conditions on a, b, h and the bitflips be as in Proposition 1, and let \(a\times {b}\) give the same answer pre- and post-bitflip. If \(a\times {b}\) does not produce overflow, then \(a_{bitflipped}\times {b}\) will be correct.

Corollary 2

If b is odd, \(a\times {b}\) will always be in error.

If b is even, \(a\times {b}\) may be correct, if the faults in a fall into the right bits and \(a\times {b}\) does not produce overflow.

2.2 Probability of Benign Faults on a Uniform Fault Model

We calculate the probability of faults before integer multiply being benign, where an arbitrary k faults following a uniform distribution are injected into n-bit signed integers a or b.

Let \(0\le {a}<{2^m}\) and \(0\le {b}<{2^m}\), for some number of bits m, with \(m<{n}\). We permit faults to fall anywhere in the n available bits. A special case of this calculation is when a and b vary over all n-bit signed integers, so \(m=n-1\).

We note that Proposition 1 may be applied to an arbitrary fault model (including one resulting from a data protection scheme with some SDC). Given the fault rate (protected or not) for a particular hardware or the number of undetected faults possible for a particular ECC scheme, the combinatorial calculation may be performed for that rate or number of faults.

General Probability of Benign Faults. To calculate the overall probability of a benign result after k bitflips, we calculate the number of benign results after k faults, and divide by the total number of (ab) pairs having k faults. This gives the probability of a benign result.

$$ \frac{\text {benign }(a,b,faults)}{\text {total }(a,b,faults)} $$

We count the total number of \((a,b\ne 0,faults)\) triples by calculating a sum ranging over the set of \(b\ne 0\), so ranging from 1 to \(2^m-1\). For each a corresponding to a b, there are \(\left( {\begin{array}{c}n\\ k\end{array}}\right) \) ways of choosing k faults on n bits.

Let \(num\_benign\_a_b\) be the number of a corresponding to each \(b\ne 0\) having 0 s in the j least significant bits and set of k faults. This number will differ according to the conditions imposed on a and b (i.e., overflow allowed or not in their product). We leave \(num\_benign\_a_b\) as a general variable, and obtain

$$\begin{aligned} \frac{\left( {\begin{array}{c}n\\ k\end{array}}\right) 2^m+\sum \limits _{j=k}^{m-1}\left( {\begin{array}{c}j\\ k\end{array}}\right) \sum \limits _{i=0}^{2^{(m-j-1)}-1}num\_benign\_a_b}{\left( {\begin{array}{c}n\\ k\end{array}}\right) 2^m+\sum \limits _{b=1}^{2^m-1}\left( {\begin{array}{c}n\\ k\end{array}}\right) num\_a_b} \end{aligned}$$
(1)

Formula (1) is the general expression of the probability of a benign result upon k bitflips. We now distinguish between the cases where \(a\times b\) is permitted to produce overflow, and where \(a\times b\) is not.

Probability of Benign Faults When Overflow is Permitted. Since overflow is permitted, the multiply calculation will not get the correct answer if it does overflow. For this case, then, we do not look for correctness per se, but merely ask that the multiply get the same answers in the faulty and fault-free calculations.

\(num\_a_b\) is the total number of a corresponding to each \(b\ne 0\) and set of k faults. We do not care if \(a\times b\) overflows, so all a are eligible, and \(num\_a_b\) is \(2^m\).

\(num\_benign\_a_b\) is the number of a corresponding to each \(b\ne 0\) having 0 s in the j least significant bits and set of k faults. Again, there are \(2^m\) of these.

Substituting \(2^m\) for \(num\_a_b\) and \(num\_benign\_a_b\) in Formula (1), we obtain

$$ \frac{\left( {\begin{array}{c}n\\ k\end{array}}\right) 2^m+\sum \limits _{j=k}^{m-1}\left( {\begin{array}{c}j\\ k\end{array}}\right) \sum \limits _{i=0}^{2^{(m-j-1)}-1}2^m}{\left( {\begin{array}{c}n\\ k\end{array}}\right) 2^m+\sum \limits _{b=1}^{2^m-1}\left( {\begin{array}{c}n\\ k\end{array}}\right) 2^m} $$

This reduces to

$$\begin{aligned} \frac{\left( {\begin{array}{c}n\\ k\end{array}}\right) +\sum \limits _{j=k}^{m-1}{(\left( {\begin{array}{c}j\\ k\end{array}}\right) {2^{(m-j-1)}}})}{\left( {\begin{array}{c}n\\ k\end{array}}\right) {2^{m}}} \end{aligned}$$
(2)

Proposition 2

Let integers a and b of word length n and bitlength m be multiplied, with overflow permitted, and let k faults occur in one of the integers. Then the probability of masked-benign results in this integer multiply increases as word length n decreases. In other words, the resilience of integer multiply with overflow permitted increases with shorter word length.

Probability of Benign Faults When Overflow is Not Permitted. In this case, a and b are chosen so that \(a\times b\) does not overflow. This is the meaningful case, since overflow multiply values are incorrect.

This restriction means that \(a\times b\le 2^{n-1}-1\), or \(a\le \frac{2^{n-1}-1}{b}\). We have the additional restriction that both a and b are less than \(2^m\). We find values for \(num\_a_b\) and \(num\_benign\_a_b\) and substitute them in to Formula 1, for a probability in the no-overflow case.

We obtain the following values for \(num\_a_b\) and \(num\_benign\_a_b\):

$$ num\_a_b=\text {min}(2^{m},\lfloor \frac{2^{(n-1)}-1}{b}\rfloor +1) $$
$$ num\_benign\_a_b=\text {min}(2^{m},\lfloor \frac{2^{(n-1)}-1}{2^j\times {(2i+1)}}\rfloor +1) $$

and obtain an overall probability in the case of no overflow of

$$\begin{aligned} \frac{\left( {\begin{array}{c}n\\ k\end{array}}\right) 2^{m} + \sum \limits _{j=k}^{m-1} ({\left( {\begin{array}{c}j\\ k\end{array}}\right) {\sum \limits _{i=0}^{2^{(m-j-1)}-1}{\text {min}(2^{m},\lfloor \frac{2^{(n-1)}-1}{2^j\times {(2i+1)}}\rfloor +1)}}}) }{\left( {\begin{array}{c}n\\ k\end{array}}\right) {(2^m+\sum \limits _{b=1}^{2^m-1}{\text {min}(2^{m},\lfloor \frac{2^{(n-1)}-1}{b}\rfloor +1))}}} \end{aligned}$$
(3)

Conjecture 1

Let integers a and b of wordlength n and bitlength m be multiplied, with overflow not permitted, and let k faults occur in one of the integers. Then the probability of masked-benign results in this integer multiply increases as wordlength n decreases. In other words, the resilience of integer multiply with overflow not permitted increases with shorter word length.

All experimental evidence we have generated supports this conjecture. The conjecture is clearly true in the case where \(m\le {\lfloor {\frac{n}{2}}\rfloor }\), for word lengths \(\ge {n}\): this reduces to the overflow case described in Proposition 2, since no two m-bitlength integers multiplied together can have more than \(2m\le n\) bits, so cannot overflow.

Fig. 1.
figure 1

Results from the analytical model when a and b are chosen such that overflow will not occur in \(a*b\) as well as when overflow is allowed to occur (calculated only to bitlength of a and b equal to 27, because of the large amount of time needed to compute the summations for larger bitlength). Note the relatively high percentage of multiplicative resilience when the bitlength of the factors is smaller compared to the architectural width.

Special Case: 1 Bitflip with Overflow Permitted. We calculate this case explicitly because it reduces to such a simple form. Setting \(m=n-1\), and \(k=1\) gives the following formula directly from Formula 2.

$$\begin{aligned} \frac{n+\sum \limits _{j=1}^{n-2}{j\times {2^{(n-j-2)}}}}{n\times {2^{(n-1)}}} = \frac{1}{n} \end{aligned}$$
(4)

2.3 Implications

The theoretical results have three implications: (1) There is non-trivial resilience occurring naturally in multiply, contingent only on overflow handling, (2) Resilience is better when overflow is not permitted on the product of a and b, and (3) Improved resilience is obtained in some cases by using the smallest wordlength possible for a variable (experimental evidence supports this in all cases).

Figure 1 represents the resilience of multiply when a and b have been screened to preclude overflow as well as when overflow is allowed. The results for the two sets of data are the same, up to the point where bitlength of a and b are about half of wordlength. After that, when overflow is allowed, we see continued decreasing convergence to the resilience when a and b are the entire wordlength.

3 Experimental Verification via Exhaustive Multiplication Testing

We verify the theoretical model described in Sect. 2 by simply multiplying integers after bitflips, and calculating the percentage of time the product is correct. Where possible, we exhaustively compute all combinations of \(a*b\), for every bit-flip that can occur in a. When it is too time-consuming to accomplish this, we sample from the overall space of all possible combinations of \(a*b\).

3.1 Exhaustive Search

Although the analytic model discussed in Sect. 2 is parameterized to handle a configurable multiple bit-flips (k) and arbitrary-sized a’s and b’s, in this section we focus on 32-bit integer multiplication, and single bit-flips to a. We have chosen these parameters because they match the situation in our micro benchmark, discussed in Sect. 4. Note, it is not necessary to flip bits in b, since integer multiplication is commutative.

In the first strategy, “Exhaustive”, we loop across all possible combinations of \(a*b\), where \(0\le a,b\le 2^i-1\), where i can be chosen for ranges of a and b that are of interest for comparing to a particular application under consideration. Then for each product, we look across possible single-bit bit-flips and count up the number of times that the uncorrupted product is equal to the product that results under fault injection. Using this count, we compute a simple percentage of the combinations to result in unmasked benign errors.

3.2 Sampled Search

The above “Exhaustive” strategy works well, provided that the ranges of a and b are small enough that the total number of combinations does not become prohibitively expensive to compute. For the machine used in this research, the “cut-off” point for 32-bit integers occurs when i is chosen to be larger than roughly 17 bits. As such, we were interested in a sampled approach that would allow us to verify the theoretic model for larger values of i. We refer to this strategy as “Sampled”. It is similar to the prior strategy in that we look across all possible values of b, but that for each of these values, we randomly choose a, \(0\le a \le 2^i-1\), and then randomly flip one of a’s 32 bits. The number of trials across a, \(a\_trials\), is chosen to be large enough to give an adequate sampling in a, without the need to iterate across all possible a, as is done in the “Exhaustive” approach.

3.3 Comparison of Experimentation to Theoretical Model

We show in Table 1 a set of experimental results and theoretical calculations for single bitflips on 32-bit integers. When a and b have bitlength greater than 18, we sampled as above, but otherwise we exhaustively tested every abbitflip combination possible. Note, the results for bits 14 to 22 were identical to those for bits 13 and 23, and were therefore omitted from table in the interest of space.

Table 1. The correspondence between theoretical calculations and experimental results for 32-bit wordlength. Entries represent the percentage of benign bitflips. Experimental and theoretical values are an exact match where we tested exhaustively (bitlength up to 17), and are a close match where we sampled (bitlength greater than 17).

As seen in the table, the theoretical model exactly matches the experimental data from the exhaustive testing, and matches very closely the sampled data. In every case we tested, the experimental results matched the theoretic model. When we conducted an exhaustive experiment, the experimental probabilities matched the theoretical calculation exactly. When we sampled, for larger wordlength and bitlength, the sampled results were extremely close.

The theoretical formula is a double summation depending on bitlength and has high time complexity, so can be hard to calculate for large bitlength. Entries in the table represented as “–” denotes these missing calculations. Sampling and multiplying (a,b) pairs is a Monte Carlo method of approximating the theoretical calculations, and in practice converges fast to a useable solution. The experiment columns thus adequately represent the percentage where we did not do the full theoretical calculation.

4 Experimental Verification via Matrix Multiplication Micro-Benchmark

We verify multiplicative resilience in another way by running a common fault-tolerant matrix multiplication algorithm [2, 8, 9] (ABFT-MM). We injected faults into randomly selected multiplication operands as the inner-products are computed, and observed how often faults were actually caught and corrected. This also allows us to observe the results of multiplicative resilience on more complex code similar to algorithms run in the field.

We provide the results of extensive experimentation under random fault injection and compare these results to those predicted by the theoretical model, as well as those obtained via the sampling multiplicative test discussed in Sects. 2 and 3.

4.1 ABFT-MM Experimentation and Results

We present the results of studying an integer implementation of ABFT MM in the presence of injected faults using the F-SEFI fault injector [6] based on QEMU [1]. In this benchmark, we targeted the fault injection on the traditional triply-nested FOR LOOP structure of the dense matrix multiply. Specifically, we used F-SEFI to corrupt a single randomly-chosen bit in the output of a randomly-chosen IMUL operation in the algorithm’s inner-product calculation.

Our benchmark included roughly 60,000 matrix multiplication trials within F-SEFI. In each of these trials, both input matrices A and B were randomly generated where each element, e, ranged from \(0\le e \le 2^{i}-1\), for \(3\le i\le 7\), giving a bitlength of \(i+1\) for signed integers. These ranges were chosen to illustrate the impact of increasing used bitlengths while at the same time, were capped by 127 so as to guarantee that overflow would not occur during the dot-product calculation for matrices of size \(45\times 45\) using 32-bit signed integer datatypes.

Table 2. Masked Benign error rates

After each execution, a pre-computed “golden” answer for the matrix product under fault injection, \(C'\), was compared to the result of the ABFT MM after error detection and correction to determine whether the algorithm had successfully detected and corrected any errors in \(C'\).

These results in Table 2 also provide data that allows us to easily compare the empirical results from ABFT-MM under fault injection in F-SEFI with data derived from the theoretical model and other experimental models discussed in Sects. 2 and 3.2. These results show that under fault injection, the percentage of benign errors in ABFT-MM closely matches those predicted by the theoretical model.

5 Preliminary Results on Other Operations

5.1 Experimental Results

We have experimentally demonstrated that this masked benign behavior is not unique to integer multiplication. By making use of the exhaustive search strategy, we explore the resilience of integer division, and modulo, and contrast these with our prior findings for integer multiplication, as shown in Fig. 2. It should be noted that while multiplication is commutative, division and modulo are not, and as such, we provide experimental results for each operation, a OP b, both where the bitflip is in a and also where the bitflip is in b, as well as an average across these two results.

Fig. 2.
figure 2

Results from an exhaustive search for 16-bit a and b space, testing every possible bitflip on a or b as specified, for integer multiplication, division and modulo. Note the relatively high percentage of resilience across each of these operations.

These experimental results are provocative, and strongly suggest that this masked benign behavior is present in operations beyond integer multiplication. With this in mind, we intend to explore other integer and floating point arithmetic operations, from both an experimental and theoretical point of view, in our future work.

5.2 Partial Theory of Division Resilience

We present here some preliminary results on the resilience of the integer division operator. We examine the case where only the numerator experiences bitflips, and show experimental verification of the propositions and experimental support for the conjectures. Again, we do not present the proofs of these, in the interest of brevity.

Experimental evidence shows that integer division is much more resilient than integer multiply. Future work will include a complete demonstration as shown earlier for the integer multiplication operator.

This gives a path forward for the case of the division operator, and demonstrates that a mathematical approach to resilience on arithmetic operators completely shown for multiplication may be extended to other operators.

Division Resilience When Numerator Has an Arbitrary Number of Bitflips. As a proof of concept, we consider the case where only the numerator experiences bitflips.

Proposition 3

Let b be an n-bit integer equal to \(2^k\) for some k, and let a be an integer such that \(0\le a<2^n\). Assuming equal probability for any combination of bitflips, the overall percentage of correct answers on \(\frac{a}{b}\) when a experiences some number of bitflips is \(\frac{2^k}{2^n}=\frac{1}{2^{n-k}}\).

Conjecture 2

Let b be an n-bit integer, not necessarily a power of 2, and let a be an integer such that \(0\le a<2^n\). Assuming equal probability for any combination of bitflips, then the overall percentage of correct answers on integer divide when bitflips occur only on a is close to \(\frac{b}{2^n}\).

Proposition 4

Let b be an n-bit integer equal to \(2^k\) for some k, and let a be an integer such that \(0\le a<2^n\). Then the overall percentage of correct answers on \(\frac{a}{b}\) when a experiences m bitflips is \(\frac{\left( {\begin{array}{c}k\\ m\end{array}}\right) }{\left( {\begin{array}{c}n\\ m\end{array}}\right) }\).

Corollary 3

Let b be an n-bit integer equal to \(2^k\) for some k, and let a be an integer such that \(0\le a<2^n\). Let p(i) be the probability that i faults occur in a given fault model. Then the overall probability of correct results using that fault model on integer divide when bitflips occur only on a is \(\sum _{0}^{n} p(i)\frac{\left( {\begin{array}{c}k\\ i\end{array}}\right) }{\left( {\begin{array}{c}n\\ i\end{array}}\right) }\)

6 Conclusion

Providing an analytic model for the impact that soft errors have on low-level operations forms the basis for establishing confidence in any injection-based empirical study. In this work, we’ve established this model for integer multiplication and have begun investigation into other integer operations. Furthermore, we have shown a non-trivial multiplicative resilience under the integer multiplication operator and have experimentally shown that the resilience increases as wordlength gets shorter, and that our analytic models are clearly supported through extensive experimentation that also shows promise for other integer operators as well.