Low Power Semi-systolic Architectures for Polynomial-Basis Multiplication over GF(2 m ) Using Progressive Multiplier Reduction

Ibrahim, Atef; Gebali, Fayez

doi:10.1007/s11265-015-1000-x

Low Power Semi-systolic Architectures for Polynomial-Basis Multiplication over GF(2^m) Using Progressive Multiplier Reduction

Published: 19 April 2015

Volume 82, pages 331–343, (2016)
Cite this article

We’re sorry, something doesn't seem to be working properly.

Please try refreshing the page. If that doesn't work, please contact support so we can address the problem.

Journal of Signal Processing Systems Aims and scope Submit manuscript

Low Power Semi-systolic Architectures for Polynomial-Basis Multiplication over GF(2^m) Using Progressive Multiplier Reduction

We’re sorry, something doesn't seem to be working properly.

Please try refreshing the page. If that doesn't work, please contact support so we can address the problem.

Atef Ibrahim^1,2 &
Fayez Gebali³

340 Accesses
15 Citations
3 Altmetric
Explore all metrics

Abstract

We present low area and low power semi-systolic array architectures for polynomial basis multiplication over GF(2^m) using Progressive Multiplier Reduction Technique (PMR). These architectures are explored using linear and nonlinear techniques applied to the polynomial multiplication algorithm. The nonlinear techniques allow the designer, to control the processor workload and reduce the inter-processor communications. The semi-systolic architectures obtained have simple structure with local communication. ASIC implementations of our designs and comparable published designs show that the proposed scalable semi-systolic structures have less area complexity (56.8–94.6 %) and power consumption (55.2–84.2 %) except for a scalable design published by the same authors. However, one of the proposed scalable designs outperforms this design in terms of throughput by 73.8 %. This makes the proposed designs suited to embedded applications that require low power consumption and moderate speed.

We’re sorry, something doesn't seem to be working properly.

Please try refreshing the page. If that doesn't work, please contact support so we can address the problem.

1 Introduction and Related Work

Efficient arithmetic operations in finite fields are important in many applications, including coding theory, computer algebra systems, information theory, number theory, and elliptic curve cryptosystems (ECC) [1]. Multiplication over GF (2^m) is the basic field operation which is frequently encountered in most of these applications. Multipliers with different bases of representation, e.g., polynomial basis, normal basis, and dual basis, have been realized for various applications. However, the polynomial basis multipliers are more efficient and more widely used compared with the multipliers based on the other two bases.

Numerous hardware architectures have been proposed for polynomial-basis finite-field multiplication over GF (2^m) [2–10, 12–25]. In terms of design style, the hardware architectures can be classified into two basic forms. The first form is systolic or semi-systolic architecture and the second form is nonsystolic architecture. Nonsystolic designs mostly aim to reduce the number of partial products to realize the multipliers with least hardware and shorter duration of latency [2–11]. On the other hand, the systolic designs in [12, 18–28] posses advantages over the nonsystolic ones due to: regularity, modularity, simplicity of the processing elements (PEs), local interconnections, and high-throughput rates [29]. We explore several semi-systolic architectures by converting the GF(2^m) multiplication into an iterative algorithm using systematic linear and nonlinear techniques that combine affine and nonlinear task scheduling and assignment of tasks to processors. The nonlinear techniques discussed here allow the designer to control the processor workload, the processor word width, and also inter-processor communication.

The paper is organized as follows: Section 2 discusses finite field multiplication over GF(2^m) based on irreducible trinomials. Section 3 discusses converting field multiplication into an iterative algorithm using Progressive multiplier Reduction (PMR). Section 4 presents a systematic technique to parallelize the PMR iterative multiplication algorithm using linear and nonlinear data scheduling and projection techniques. Section 5 discusses the design space exploration for the PMR iterative multiplication algorithm. Section 6 discusses the proposed designs complexity and compares it to previous work. Finally, Section 7 provides the conclusions of this work.

2 Problem Formulation

The National Institute of Standards and Technology (NIST) recommended five irreducible field polynomials for ECC over GF (2^m) [30]. Two of these polynomials are trinomials: Q(x) = x ²³³ + x ⁷³ + 1 and Q(x) = x ⁴⁰⁹ + x ⁸⁷ + 1. This motivated several semi-systolic implementations using these polynomials [6–8, 25, 28]. The field polynomial has the form:

$$ Q(x) = x^{m}+x^{k}+1 $$

(1)

Assuming α is a root of Q(x), the two field elements A and B to be multiplied are represented by the polynomials:

$$ A = \sum\limits_{h=0}^{m-1} a_{h} \ \alpha^{h} \quad \text{and} \quad B = \sum\limits_{g=0}^{m-1} b_{g} \ \alpha^{g} $$

(2)

where a _h and b _g ∈ G F(2) for 0 ≤ h, g < m. The reduced product C will be m-bits long:

$$\begin{array}{@{}rcl@{}} C &=& A\times B = \left [\sum\limits_{h=0}^{m-1} \sum\limits_{g=0}^{m-1} a_{h} \ b_{g} \ \alpha^{h+g} \right ]\ \text{mod} Q(\alpha)\\ &=& \sum\limits_{g=0}^{m-1} \ c_{g} \ \alpha^{g} \end{array} $$

(3)

It is not practical to perform the modulo operation on the polynomial in Eq. 3 whose degree is 2m − 2. Since the modulo operation is distributive, we can write (3) as:

$$ C = \sum\limits_{g=0}^{m-1} b_{g} \left [\ \alpha^{g} A \quad \text{mod } Q(\alpha) \right ] \quad = \sum\limits_{g=0}^{m-1} C_{g} $$

(4)

We note from Eq. 4 that each partial product is a polynomial:

$$ C_{g} = b_{g} \alpha^{g} A \quad \text{mod } Q(\alpha) $$

(5)

It is not practical to perform the reduction in Eq. 4 or Eq. 5 in one step. An attractive approach is to iteratively perform the reduction operation on the different powers of the multiplier α ^g A modQ(α) as will be explained in the following section.

3 Progressive Multiplier Reduction (PMR) Technique

We convert (4) into an iteration using increasing powers of α ^g based on Algorithm 1.

A ⁱ is given by:

$$ A^{i} = \sum\limits_{j=0}^{m-1} {a^{i}_{j}} \alpha^{j} $$

(6)

And α A ⁱ is written as:

$$ \alpha A^{i} = \sum\limits_{j=1}^{m} a^{i}_{j-1} \alpha^{j} $$

(7)

Using Eq. 1, we can write:

$$ \alpha^{m} = \alpha^{k} +1 \quad \text{mod } Q(\alpha) $$

(8)

Substituting (8) in Eq. 7 effectively accomplishes the reduction step and we get:

$$ \alpha A^{i} \ \text{mod } Q(\alpha) = a^{i}_{m-1}\left (\alpha^{k}+1\right ) + \sum\limits_{j=1}^{m-1} a^{i}_{j-1} \alpha^{j} $$

(9)

The above equation ensures the reduction step in Eq. 9 produces a polynomial A ^{i + 1} with a degree less than m. We modify Algorithm 1 in terms of the bits as shown in Algorithm 2. In this algorithm, a _j represents the j-th bit of the operand A and c _j represents the j-th bit of the final product C. Also, the terms ${a^{i}_{j}}$ and ${c^{i}_{j}}$ represent the j-th bit of the operand ”A” and partial product C at iteration i, respectively.

4 Parallelizing the PMR Technique

The operations in Algorithm 2 Steps 6–13 define the iterative algorithm to implement (4). The second author developed systematic techniques to parallelize iterative algorithms that allow for exploring all possible systolic arrays and optimizing the performance according to certain specifications [29]. Early techniques represented dependence among pairs of variables as a dependence graph (DG) and had several problems: (a) They are confined to simple two-dimensional (2D) algorithms such as matrix-vector multiplication. It becomes very difficult to deal with high-dimensionality algorithms or with algorithms that contain many variables. (b) Using the DG gives few options for developing possible scheduling algorithms.

4.1 Study of Algorithm Variables

Algorithm 2 has two indices i and j and their ranges define a set of points in a convex hull 𝔻 in the 2-D integer space, i.e. 𝔻 ⊂ ℤ² [29]. The algorithm has two input variables A and B; two intermediate variables A ⁱ and C ⁱ; and one output variable C. The input bits ${a^{0}_{j}}$ are shown in Fig. 1 at the top row. Bit b _i is used only at row i. This is indicated by the horizontal lines in Fig. 1.

The intermediate variable A ⁱ is updated using iteration steps 8 and 9. This is indicated by the diagonal lines in Fig. 1. The bits for intermediate variable C ⁱ are updated using step 10. This is indicated by the vertical lines in Fig. 1. The arrows indicate the direction of data flow between the nodes at each time step. The final product bits for output variable C are obtained at the bottom of the graph. Notice from step 10 that successive reduction steps are represented by the feedback lines obtained from the most significant bit (right-most bit) of a ⁱ as indicated by the dashed red lines.

4.2 Scheduling Function Design for PMR Technique

We use an affine scheduling function such that point $\mathbf {p} =[i j]^{t} \in \mathbb {D}$ is assigned a time value n(p) given by:

$$\begin{array}{@{}rcl@{}} n(\mathbf{p}) &=& \mathbf{s\ p} -\gamma \end{array} $$

(10)

$$\begin{array}{@{}rcl@{}} & = & i \alpha + j W-\gamma \end{array} $$

(11)

where s = [α W] is the scheduling vector and γ is a scalar constant. The scheduling function assigns a time index value for each node in the graph. Thus the data moving between the nodes are now governed by a time relationship. The scheduling function converts the dependence graph 𝔻 into a directed acyclic graph DAG.

Based on the data flow in Fig. 1, we have two restrictions on our choice of s: The iterative calculation of ${a^{i}_{j}}$ implies that task at point [i + 1, j + 1] must be executed after task at point [i, j]. This restriction can be written as

$$ [\begin{array}{cc}\alpha & W \end{array}] [\begin{array}{cc}i+1 & j+1 \end{array}]^{t} >[\begin{array}{cc}\alpha & W \end{array}][\begin{array}{cc}i & j \end{array}]^{t} $$

(12)

This results in a condition on the components of s:

$$ \alpha+W > 0 $$

(13)

Another restriction on timing is due to the feedback in Fig. 1 such that task at point [i + 1,0] can only proceed after task at point [i, m − 1] has been evaluated:

$$ [\begin{array}{cc}\alpha & W \end{array}] [\begin{array}{cc}i+1 & 0 \end{array}]^{t} >[\begin{array}{cc}\alpha & W \end{array}] [\begin{array}{cc}i & m-1 \end{array}]^{t} $$

(14)

This results in another inequality:

$$ \alpha > (m-1) W $$

(15)

Based on Eqs. 13 and 15 we have two simple scheduling functions as discussed in the following paragraphs.

The first scheduling function s ₁ is a linear affine mapping function that assigns a time value to each point p ∈ 𝔻 of Fig 1:

$$ n(\mathbf{p}) = \mathbf{s_{1}\ p} - \gamma_{1}= i $$

(16)

$$ \textbf{s}{1} = [\begin{array}{cc}1&0 \end{array} ] $$

(17)

$$ \gamma_{1} = 0 $$

(18)

Figure 2 shows node timing for the PMR algorithm using the scheduling functions s ₁ for m = 7 and k = 4. Note that the time index n is identical to the i-axis index values such that iteration i and time index value n are related by:

$$ n = i $$

(19)

The grey boxes indicate equitemporal regions where the nodes in each region execute at the same time. The numbers on the right of the figure indicate the times. The PMR technique will require m time steps to complete when s ₁ is used.

Using s ₁, the workload per time step is m which depends on the size of the polynomial being processed. In that sense, we are unable to control the workload per time step using linear affine scheduling.

The second timing function s ₂ controls the workload per time step and the number of time steps required to complete the multiplication operation. As an indirect benefit, the function controls the number of processing elements. The nonlinear scheduling function has the form:

$$\begin{array}{@{}rcl@{}} n(\mathbf{p}) &=& \mathbf{s_{2}\ p} -\gamma_{2} \\ &=& i \left \lceil \frac{m}{T} \right \rceil -\left \lfloor\frac{j+\mu_{2}}{T}\right \rfloor -\gamma_{2} \end{array} $$

(20)

$$\begin{array}{@{}rcl@{}} \mathrm{\textbf{s}}_{2} &=& \left [\begin{array}{cc} \left \lceil \frac{m}{T} \right \rceil &-\left \lfloor\frac{.+\mu_{2}}{T}\right \rfloor \end{array} \right ] \end{array} $$

(21)

$$\begin{array}{@{}rcl@{}} \mu_{2} &=& T \left \lceil \frac{m}{T} \right \rceil-m \end{array} $$

(22)

$$\begin{array}{@{}rcl@{}} \gamma_{2} &=& -\left \lfloor\frac{m-1+\mu_{2}}{T}\right \rfloor \end{array} $$

(23)

where T is the number of tasks to be executed in one time step, ⌈./T⌉ and ⌊./T⌋ indicate the ceiling and floor functions, respectively, and the ’dot’ is a place holder for the argument.

Figure 3 shows node timing function s ₂ when m = 7, k = 4 and T = 3. The number of tasks executed at each time step are not the same when m is not an integer multiple of T. The time index n in Eq. 20 depends now on the values of both i and j indices.

Figure 3 shows how the dependance graph of Fig. 1 is converted to a DAG upon using s ₂. The grey areas indicate nodes having the same time index. The time index of each grey area is indicated within each area. The number of tasks executed at each time is made constant when m is increased to m′ which is an integer multiple of T:

$$ m^{\prime} = T \left \lceil \frac{m}{T} \right \rceil = m + \mu_{2} $$

(24)

For the case when m = 7 and T = 3, we get m′ = 9 as shown in the figure. The figure also shows that we chose to pad the LSB bits of A and C to obtain the augmented polynomials A′ and C′, respectively. For example, the case when m = 7 and T = 3 yields μ ₂ = 2 and the multiplier has the form:

$$\begin{array}{@{}rcl@{}} A^{\prime} &=& \left [ \begin{array}{*6{c}} a^{\prime}_{0} & a^{\prime}_{1} & a^{\prime}_{2} & a^{\prime}_{3} & {\cdots} & a^{\prime}_{m^{\prime}-1} \end{array}\right ] \end{array} $$

(25)

$$\begin{array}{@{}rcl@{}} &=&\left [ \begin{array}{*6{c}} 0 & 0 & a_{0} & a_{1} & {\cdots} & a_{m-1} \end{array}\right ] \end{array} $$

(26)

$$\begin{array}{@{}rcl@{}} m^{\prime} &=& m+\mu_{2} \end{array} $$

(27)

$$\begin{array}{@{}rcl@{}} a_{j} &=& a^{\prime}_{j+\mu_{2}} \qquad 0 \leq j < m \end{array} $$

(28)

Padding on the left involves the LSB bits and leaves the MSB relatively unchanged. In other words, it becomes easy to identify the location of the MSB. However, the locations of the bits corresponding to locations 0 and k in A have now shifted in A′. This is useful since MSB bit is responsible for generating the feedback signal f. Likewise the product polynomial is padded with μ ₂ bits on the left to get:

$$\begin{array}{@{}rcl@{}} C^{\prime} &=& \left [ \begin{array}{*6{c}} c^{\prime}_{0} & c^{\prime}_{1} & c^{\prime}_{2} & c^{\prime}_{3} & {\cdots} & c^{\prime}_{m^{\prime}-1} \end{array}\right ] \end{array} $$

(29)

$$\begin{array}{@{}rcl@{}} &=&\left [ \begin{array}{*6{c}} 0 & 0 & c_{0} & c_{1} & {\cdots} & c_{m-1} \end{array}\right ] \end{array} $$

(30)

$$\begin{array}{@{}rcl@{}} c_{j} &=& c^{\prime}_{j+\mu_{2}} \qquad 0 \leq j < m \end{array} $$

(31)

We chose this particular form of nonlinear scheduling function for the following reasons:

1.
The number of nodes processed at a given time is fixed and equals T. The value of μ ₂ determines how many more dummy nodes are needed to increase the number of nodes form m to m′, which is an integer multiple of T.
2.
The feedback signal f is obtained from the MSB bit of A′, which are the rightmost nodes in Fig. 3.
3.
Based on Eq. 21, the feedback signal is updated at times n = i⌈m/T⌉, with i = 0, 1, ⋯.
4.
Based on Eq. 21, the feedback signal will be supplied to node 0 at times:
$$ n= i \left \lceil \frac{m}{T} \right \rceil -\left \lceil\frac{\mu_{2}}{T}\right \rceil -\gamma_{2} \quad\text{ with}\quad i=0, 1, {\cdots} $$
(32)
and to node k at times:
$$ n= i \left \lceil \frac{m}{T} \right \rceil -\left \lceil\frac{j+\mu_{2}}{T}\right \rceil -\gamma_{2} \quad\text{ with}\quad i=0, 1, \cdots $$
(33)

Using this nonlinear scheduling function, we are now able to control the workload per time step. In this case it is equal to T = 3. The multiplication will require m⌈m/T⌉ time steps to complete. This is longer than the scheduling function defined by s ₁ but when coupled with the projection vectors defined in the next section, will result in very practical and scalable designs that are more suited for embedded applications with limited processor resources.

4.3 Projection Function Design for PMR Method

In Section 4.2 we discussed how we can associate a time index to each point (task) in the dependence graph. In this section we discuss how we can assign a processor to each node in the dependence graph. We can use the techniques proposed in [29] to derive linear affine task projection. Assume two points in DAG lie along the projection direction d such that

$$ \mathbf{p}_{2} = \mathbf{p}_{1} + e \mathbf{d} $$

(34)

where e is some constant. These two points will be mapped to the same processor if we make that projection direction d the nullvector of the projection matrix P [29]. In other words we can write:

$$ \mathbf{Pd} = \mathbf{0} $$

(35)

Typically we should ensure that d s ≠ 0. So a choice for a scheduling vector has implications for the choice of projection vectors.

A point p ∈ 𝔻 will be projected to point $\overline {\mathbf {p}}$ in the processor array space using the affine projection operation

$$ \overline{\mathbf{p}} = \mathbf{Pp}- \delta $$

(36)

where P is a rank-deficient projection matrix and δ is a scalar constant to adjust the processor indices to start at 0 value. Reference [29] places one restriction on the projection directions; namely:

$$ \mathrm{\textbf{s}} \mathrm{\textbf{d}} \neq 0 $$

(37)

which ensures that each processor does not do m calculations at one time step and also that all processors are well utilized by working at each time step.

Most of the time, we will be seeking one-dimensional processor arrays to implement an algorithm. Since our algorithm is two-dimensional, P will reduce to a row vector such that the product in Eq. 36 will result in scalar value for the processor index that the point maps to. Generalization to processor arrays of higher dimensions is beyond the scope of this work.

The following subsections illustrate the design space exploration for the PMR technique through the different choices of scheduling functions and projection directions. Table 1 shows tthree projection directions associated with scheduling function s ₁.

Table 1 Projection vectors associated with the scheduling function s ₁.

Full size table

Table 2 shows two projection directions associated with scheduling function s ₂.

Table 2 Projection vectors associated with the scheduling function s ₂.

Full size table

We note that we used simple linear affine projection directions and also used complex-looking nonlinear projection directions. This latter choice will result in simple hardware for the processor array where the feedback signal is easily extracted from the processor with the highest index as will be explained in sequel.

5 Design Space Exploration for PMR Technique

The following subsections discuss the different designs associated with each choice of s and d that are listed in Tables 1 and 2.

5.1 Design # 1: Using s ₁ and d ₁₁

A point $\mathbf {p} = [i j]^{t} \in \mathbb {D}$ is mapped by the projection matrix P ₁₁ = [01] onto the point:

$$ \overline{\mathbf{p}} = \mathbf{P}_{11}\mathbf{p}-\delta_{11}= j $$

(38)

All nodes in a column will map to a single PE but will execute at different time steps. Figure 4 shows the hardware details for Design 1. Figure 4a shows semi-systolic array design when m = 7 and k = 4. Communication between adjacent PEs requires only a one-bit line for transmitting bits ${a^{i}_{j}}$ while the partial product bits ${c^{i}_{j}}$ are stored locally. Multiplicand bits b _i, 0 ≤ i < m, are broadcast to the PEs and the updated multiplier bits ${a^{i}_{j}}$ are propagated between the PEs, as shown in Fig. 4a. The feedback signal f is obtained from the output of PE _{m − 1} at each clock. This signal is fed back to the two processors PE ₀ and PE _k.

Figure 4b shows the details of PE _j when j ≠ 0, k. Figure 4c shows the details of PE _j when j = 0, k. An extra XOR gate is needed to process the feedback signal f.

We summarize the operation of each PE _j (0 ≤ j < m) for design #1.

1.
At time n = 0, the lower FF’s in Figs. 4b and 4c accumulating the product bits C are cleared.
2.
At time n = 0, also, MUX’s M accepts the multiplier bits A through their upper input.
3.
At time n ≥ 0, input multiplicand bit b _i is broadcast to all PE’s.
4.
At time n > 0, the signals ${a^{i}_{j}}$ are pipelined by setting the MUX M to accept the lower input.
5.
The feedback signal f is obtained from the output of PE _{m − 1}. This signal is fed back to the two processors PE ₀ and PE _k.
6.
At time n = m − 1, the output product C is produced where bit c _j is obtained from PE _j.

5.2 Design # 2: Using s ₁ and d ₁₂

Here we increase the workload of each processor from one to W while preserving the interprocessor communication to one-bit. A point $\mathbf {p} = [i j]^{t} \in \mathbb {D}$ will be mapped, using the projection matrix $\mathbf {P}_{12} = [\begin {array}{cc} 0& \left \lfloor (.\ + \mu _{12})/W\right \rfloor \end {array}]$, to PE _x and assigned bit y in that PE. The indices x and y are given by:

$$ x =\left \lfloor \frac{j+\mu_{12}}{W} \right \rfloor +\delta_{12}, \qquad y = (j+\mu_{12}) \text{mod} W $$

(39)

where

$$ \mu_{12} = W\left \lceil \frac{m}{W} \right \rceil -m\quad \text{ and} \quad \delta_{12} = 0 $$

(40)

Figure 5 shows the hardware details for Design 2. Figure 5a shows semi-systolic array design when m = 7, k = 4 and W = 2.Figure 5b shows the details of a PE that does not use the feedback signal. Figure 5c shows the details of a PE that uses the feedback signal.

The operation steps of each PE _j (0 ≤ j < m) of Design #2 are similar to that of Design# 1 except the feedback signal f is routed to PE _x and assigned to bit y when x and y satisfy either of the following two conditions:

$$ x= 0 \quad \text{and}\quad y = \mu_{12} $$

(41)

or

$$ x =\left \lfloor \frac{k+\mu_{12}}{W} \right \rfloor \quad \text{and}\quad y = (k+\mu_{12}) \text{mod} W $$

(42)

5.3 Design #3: Using s ₁ and d ₁₃

A point $\mathbf {p} = [i j]^{t} \in \mathbb {D}$ will be mapped by the projection matrix P ₁₃ = [1−1] onto the point

$$ \overline{\mathbf{p}} = \mathbf{P}_{13}\mathbf{p} -\delta_{13}= i-j $$

(43)

The resulting processor array corresponding to the projection matrix P ₁₃ consists of 2m − 1 PEs but m PEs are active at a given time step, To improve PE utilization, we need to reduce the number of processors using nonlinear mapping operator:

$$ \overline{\mathbf{p}} = {\mathbf{P}}_{13}{\mathbf{p}} -\delta_{13} \quad \text{mod}~m $$

(44)

The activity of the processors is illustrated in Fig. 6 where the numbers inside the circles indicate the PE index.

The processor array is shown in Fig. 7a.Figure 7b shows the PE details.

We summarize the operation of each PE _j (0 ≤ j < m) for design #3.

1.
At time n = 0, the FF at the bottom of Fig. 7b, is cleared.
2.
At time n = 0, MUX M ₁ of PE _j is set to accept the upper input to load the multiplier bit ${a^{0}_{k}}$ such that:
$$k = m-j \text{mod}~m $$
3.
At time n > 0, input multiplicand bit b _i is broadcast to the PE’s.
4.
At time n > 0, PE _j will broadcast the feedback signal f, with:
$$ j = n+1, \quad 0 \leq n < m-1 $$
(45)
5.
At time n > 0, PE _j will set MUX M ₃ to read the feedback signal f when either of the following conditions is satisfied:
$$ j = n $$
(46)
or
$$ j = k+n-1 \quad \text{mod}~m $$
(47)
6.
At time n = m − 1, PE _j will produce product bit c _k such that:
$$ k = m-j-1 $$
(48)

5.4 Design #4: Using s ₂ and d ₂₁

A point $\mathbf {p} = [i j]^{t} \in \mathbb {D}$ will be mapped by the projection matrix $\mathbf {P}_{21} = \left [\begin {array}{cc} 0& \left (.\ +\mu _{2} \right ) \text {mod}~T\end {array}\right ]$ onto the point:

$$ \overline{\mathbf{p}} = \mathbf{P}_{21}\mathbf{p} -\delta_{21}=\left (j\ +\mu_{2} \right ) \text{mod}~T $$

(49)

where

$$ \mu_{2} = T \left\lceil m/T \right \rceil-m, \quad \text{and} \quad\delta_{21} = 0 $$

(50)

Figure 8 shows the hardware details for Design #4. Figure 8a shows semi-systolic array design when m = 7, k = 4 and T = 3. The number of PE’s is T. Therefore the combination of nonlinear scheduling and projection functions allows us to control the workload per time step and the number of PEs required.

Figure 8b shows the details for PE_j which does not require the feedback signal f. Figure 8c shows the details for PE _j which requires the feedback signal f.

We note that the number of bits processed by each PE is ⌈m/T⌉ = m′/T, where m′ is given in Eq. 24. Each PE operates on one bit at each clock cycle. For the case of m = 7 and T = 3, each PE will need to store three bits for A′ and C′ as shown by the two sets of FIFO buffers in Fig. 8b or Fig. 8c.

We summarize the operation of each PE _j (0 ≤ j < T) for Design #4:

1.
For the first m′/T time steps (i.e. 0 ≤ n < m′/T), MUX M ₁ is set to accept the upper input corresponding to the augmented multiplier polynomial A′. PE _j will accept bit ${a^{\prime }}^{0}_{k}$ at time n such that:
$$\begin{array}{@{}rcl@{}} j &=& k \text{ mod}~T \quad 0 \leq k < m^{\prime} \end{array} $$
(51)

$$\begin{array}{@{}rcl@{}} n &=& m^{\prime}/T - \left \lfloor k/T\right \rfloor -1 \end{array} $$
(52)
These bits will be loaded in FIFO _a.
2.
For the first m′/T time steps, also, MUX M ₃ is set to accept the zero input to load the m′/T bits of FIFO _c with zero values.
3.
For times n ≥ m′/T, FIFO _a is set to accept the lower input corresponding to pipelined input $a^{i-1}_{j-1}$.
4.
For times n ≥ m′/T, input multiplicand bit b _i is broadcast to all PEs at iteration n where:
$$ i = \left \lfloor \frac{n}{T} \right \rfloor $$
(53)
5.
For times n ≥ m′/T, MUX M ₃ is set to accept FIFO _c output.
6.
PE _j uses the feedback signal f at time n when j and n satisfy either of the two conditions:
$$ j= \mu_{2}\quad \text{and}\quad n \ \text{mod}~T = m^{\prime}/T -1 $$
(54)
or
$$ j= (k+\mu_{2}) \ \text{mod}~T \quad \text{and}\quad n \ \text{mod} ~T = 0 $$
(55)
7.
Augmented output product C′ is available at times n where n satisfies the inequalities:
$$ mm^{\prime}/ T - T \leq n <mm^{\prime}/ T $$
(56)

5.5 Design #5

In this design we use nonlinear scheduling and projection operations but the projection operation now uses a two-level nonlinear operation to give us more freedom in choosing the time to complete the algorithm, the number of processors and the word width of each processor.

This is accomplished by the nonlinear projection function d ₂₂ in Table 2. A point $\mathbf {p} = [i j]^{t} \in \mathbb {D}$ will be mapped by the projection matrix $ \mathbf {P}_{22} = [\begin {array}{cc} 0& \left \lfloor \frac {\left (.\ +\mu _{2} \right ) \text {mod}~T}{W} \right \rfloor \end {array}] $ onto the point:

$$ \overline{\mathbf{p}} = \mathbf{P}_{21}\mathbf{p} -\delta_{22}=\left \lfloor \frac{\left (j\ +\mu_{2} \right ) \text{mod}~T}{W} \right \rfloor $$

(57)

where μ ₂ = T⌈m/T⌉−m and δ ₂₂ = 0.

The resulting processor array will consist of T/W PE’s and each PE will process W bits at a time. We assume here that T is an integer multiple of W. The details of each bit of a PE will be similar to those shown in Fig. 8b or Fig. 8c. The operation of the processor array will be similar to Design# 4. The area complexity of this design is similar to Design# 4, but the time complexity will be different. It is practical in a multiprocessor system where the number of processors is already fixed.

5.6 Comparing the Proposed Designs

We provide a summary of the advantages and disadvantages of the five proposed designs in Section 5. The summary is provided in the form of Table 3. Based on Table 3, we conclude that Design #5 is the optimum design in terms of being able to control the number of PEs and control the number of bits processed by each PE. This design is adaptable for implementation in software multithreaded systems or hardware semi-systolic array systems.

Table 3 Summary of the advantages and disadvantages of the five proposed designs in Section 5.

Full size table

6 Complexity Comparison

The area and delay complexities of the five proposed designs can be determined from Figs. 4, 5, 7, 8. Table 4 compares the proposed designs to the closest competitors [9–11, 13–15, 25, 27, 28, 31, 32] in terms of area (gates, multiplexers, and flip-flops), latency, and critical path delay.

Table 4 Comparison between different finite field multipliers.

Full size table

In Table 4 we have:

1.
T _A is AND gate delay
2.
T _{M
U
X} is MUX delay
3.
T _N is NAND gate delay
4.
T _X is XOR gate delay
5.
M ₁ = W⌈m/W⌉m
6.
M ₂ = ⌈m/W⌉(W ² + 2W)
7.
M ₃ = W⌈m/W⌉
8.
M ₄ = m ² + m − 1
9.
M ₅ = W⌈m/W⌉(m + 1)
10.
M ₆ = (W − 1)m + (W ² + W)/2
11.
$M_{7} = \sqrt {mW}(2+m) + W$
12.
$M_{8} = m^{2}+m-2\sqrt {m}$
13.
F ₁ = 2⌈m/W⌉W + (2W + 1)⌈m/W⌉
14.
F ₂ = 7m + m(⌈logm⌉) + 3
15.
F ₃ = (5/2 × m ²) + (1/2 × m) + 7
16.
F ₄ = m + (3T + 1)⌈m/T⌉
17.
τ ₁ = T _A + T _X
18.
τ ₂ = T _A + 2T _X
19.
τ ₅ = T _N + T _X
20.
τ ₆ = T _A + (⌈log2W⌉+1)T _X
21.
τ ₇≈3T _A + ⌈log2W⌉T _X
22.
τ ₈ = T _{M
U
X} + T _A + 2T _X
23.
τ ₉ = T _{M
U
X} + T _X
24.
Designs of Meher [25] have inverters equal to the number of NAND gates. These inverters are not shown in the table.

Design#1 in Fig. 4 requires m PE’s and each PE consists of one AND gate, one XOR gate, one MUX, and two flip flops, except two PE’s that have and extra XOR gate. The output is obtained after m clock cycles. The critical path duration is τ ₈ = T _{M
U
X} + T _A + 2T _X.

Design#2 in Fig. 5 requires ⌈m/W⌉ PE’s and each PE consists of W AND gates, W XOR gates, W MUXs, and 2W flip flops, except two PE’s that have an extra XOR gate. The output is obtained after m clock cycles. The critical path duration is τ ₈ = T _{M
U
X} + T _A + 2T _X.

Design#3 in Fig. 7 requires m PE’s and each PE consists of one AND gate, two XOR gates, two MUX’s, and two flip flops, except two PE’s that have and extra XOR gate. The output is obtained after m clock cycles. The critical path duration is τ ₉ = T _{M
U
X} + T _X.

Design#4 in Fig. 8 requires T PE’s, where each PE consists of one AND gate, one XOR Gates, two MUX’s and two FIFO buffers consisting of ⌈m/T⌉ flip-flops each. Two PE’s have an extra XOR gate and a MUX. The output is obtained after m⌈m/T⌉ clock cycles. The critical path delay is given by τ ₈ = T _{M
U
X} + T _A + 2T _X.

Design#5 requires T/W PE’s, where each PE consists of W AND gate, W XOR gate, 2W MUX’s and 2W FIFO buffers consisting of ⌈m/T⌉ flip-flops each. Two PE’s will require an extra XOR gate and a MUX. After a latency of ⌈m(m/T)/W⌉ clock cycles, the output is produced. The critical path delay is given by τ ₈ = T _{M
U
X} + T _A + 2T _X. It is practical in a multiprocessor system where the number of processors is already fixed.

In this table, the designs of Katti [13], Lee [14], Lee [15], Orlando [27], Gebali [28], Jain [33], Xie [35], Designs #1, #3, and #4 are implemented using bit-level systolic and semi-systolic architectures. Meher [25] proposed systolic structures with different number of bits per processor. The design of Talapatra [34] is implemented using digit-level systolic structure. Morales [9] has two different non-systolic structures, the first one is bit-level and the second one is digit level. Moreover, the design of Morales [10] and Sarmadi [11] are implemented using digit level non-systolic structure. The proposed Designs #2, #5 are implemented using digit-level semi-systolic structures. The designs of Sarmadi [11], Orlando[27], Gebali [28] and Designs #4 and #5 are also called scalable designs because they use fixed size core multiplier and does not need to change the core when m changes; it only reuses the core multiplier.

We described some of the most efficient and recent designs of Table 4 in VHDL at the register transfer level and synthesized them to the gate level for field size of m = 233, digit size W = 4, T = 16 and ω = 3 using a 0.18 μm, 1.8 V, standard-cell CMOS technology. We used the Synopsys synthesis tools package version 2005.09-SP2 for the logic synthesis and power analysis. All the synthesis results are obtained under typical operating conditions (1.8 V, 25 °C). Simulations were performed with Mentor Graphics ModelSim SE 6.0a. Table 5 compares the ASIC implementation of the different finite field multipliers. In this table, the column entitled “Latency” represents the total number of clock cycles required to complete single multiplication operation. The column entitled “Area” represents the area of the multipliers as the number of gate equivalents and the column entitled “Clock frequency” represents the speed of the multiplier, while “Multiplication delay” and “Power” represent the total amount of time and power consumption, respectively, required by the multiplier to complete a single operation. The throughput rate was calculated using the synthesis results in order to measure the degree of optimization achieved in each multiplier.

Table 5 Performance comparison between ASIC implementations of different finite field multipliers for m = 233, T = 16 and W = 4 and ω = 3.

Full size table

From this table, we notice that the proposed scalable designs (Designs #4, #5) have lower area (56.8 % - 94.6 %) and power consumption (55.2 % - 84.2 %) than all compared designs, except the scalable design of Gebali [28], that make them very suitable for applications that have more restrictions on area and power consumption. On the other hand, scalable Design #5 has a significant higher throughput values compared to all scalable designs (73.8 % - 80.1 %) including the design of Gebali [28], but it has a significant lower throughput values compared to all other non scalable designs. Among the proposed designs, Design #3 achieves reasonable throughput with moderate area and power consumptions.

7 Summary

We discussed powerful technique for exploring parallelism in a given algorithm. Two basic operations are necessary to explore parallelism. The first operation is scheduling the tasks which converts the dependence graph (DG) of variables in 𝔻 into a directed acyclic graph DAG. The second operation is task projection for assigning tasks to processors. Nonlinear scheduling and projection operations were also discussed. As a working example, we used finite field multiplication over GF (2^m) based on irreducible trinomials. In this work, we derived the dependence graph for the iterative algorithm and from it we found two valid linear affine timing or scheduling functions. One scheduling function is associated with the simplest three possible projection vectors and the other one is associated with only two projection vectors. Therefore, we have five semi-systolic array designs for design exploration. ASIC implementations of the proposed designs and some of the previously published competitive recent designs shows that the proposed scalable designs have lower area (56.8 % - 94.6 %) and power consumption (55.2 % - 84.2 %) compared to all designs, except the scalable design of Gebali [28], while one of the proposed scalable designs has a significant throughput (73.8 % - 80.1 %) over all other scalable ones. This makes the proposed designs suited to embedded applications that requires low power consumption and moderate speed.

References

Koblitz, N. (1987). Elliptic curve cryptosystems. Mathematics of Computation, 48, 203–209.
Article MathSciNet MATH Google Scholar
Fan, H., & Dai, Y. (2005). Fast bit parallel GF(2ⁿ) multiplier for all trinomials. IEEE Transactions Comps, 54(4), 485– 490.
Article Google Scholar
Reyhani-Masoleh, A., & Hassan, M. (2004). Low complexity bit parallel architectures for polynomial basis multiplication over GIf (2^m). IEEE Transactions Comps, 53(8), 945–959.
Article Google Scholar
Wu, H., & Hasan, M. (1998). Low complexity bit-parallel multiplier for a class of finite fields. IEEE Transactions Comps, 47(8), 883–887.
Article MathSciNet Google Scholar
Fan, H., & Hasan, M. (2006). Fast bit parallel-shifted polynomial basis multipliers in GF (2ⁿ). IEEE Transactions Circulatory and System I. Regular Papers, 53(12), 2606–2615.
Article MathSciNet Google Scholar
Zhang, T., & Parhi, K. (2001). Systematic Design of Original and Modified Mastrovito Multipliers for General Irreducible Polynomials. IEEE Transactions on Comps., 50(7), 734–749.
Article MathSciNet Google Scholar
Wu, H. (2002). Bit-par. Finite Field mult. and squarer using polynomial basis. IEEE Transactions on Comps., 51(7), 750–758.
Article Google Scholar
Imana, J., & et al. (2006). Bit-parallel finite field multipliers for irreducible trinomials. IEEE Transactions on Comps., 55(5), 520–533.
Article Google Scholar
Morales-Sandoval, M., Feregrino-Uribe, C., & Kitsos, P. (2011). Bit-serial and digit-serial GF(2^m) Montgomery multipliers using linear feedback shift registers. IET Computers & Digital Techniques, 5(2), 86–94.
Article Google Scholar
Morales-Sandoval, M., Feregrino-Uribe, C., Kitsos, P., & Cumplido, R. (2013). Area/performance trade-off analysis of an FPGA digit-serial GF(2^m) Montgomery multiplier based on LFSR. Computers and Electrical Engineering, 39(2), 542– 549.
Article Google Scholar
Bayat-Sarmadi, S., Mozaffari-Kermani, M., Azarderakhsh, R., & Chiou-Yng, L. (2014). A dual basis super-serial multiplier suitable for lightweight cryptographic applications. IEEE Transactions on Circular and System-II, 61(2), 125–129.
Google Scholar
Tsai, W.C., & Wang, S.J. (2000). Two systolic architectures for multiplication in GF (2^m). IEE Proc. Comparative Digital Technical, 147(6), 375–382.
Article Google Scholar
Katti, R., & Brennan, J. (2003). Low complexity multiplication in finite field using ring representation. IEEE Transactions Comps, 52(4), 418–427.
Article Google Scholar
Lee, S., Jung, S., Kim, C., Yoon, J., Koh, J., & Kim, D. (2003). Design of bit parallel multiplier with lower time complexity. In Information Security and Cryptology (pp. 127–139).
Lee, C.Y., & Chiou, C.W. (2005). Efficient design of low-complexity bit-parallel systolic hankel mult. to implement mult. in normal and dual bases of GF(2^m). IEICE Transactions on Fundación of Electronic, Commission and Computer Science, E88-A(11), 3169–3179.
Article Google Scholar
kwon, S. (2003). A low complexity and a low latency bit parallel systolic multiplier over GF(2^m) using an optimal normal basis of type II. In Proceedings of ARITH, 16, 196–202.
Google Scholar
Lee, C.Y. (2003). Low-latency bit-par. systolic mult. for irreducible x ^m + x ⁿ + 1 with GCD (m, n) = 1. in normal and dual bases of GF(2^m). IEICE Transactions on Fundación of Electrical, Communications and Computer Science, E86-A(11), 2844– 2852.
Google Scholar
Kim, H., Hong, P., & Kwon, S. (2005). A digit-serial multiplier for finite Field GF (2^m). IEEE Transactions Very Large Scale Integrated System (VLSI), 13(4), 476–483.
Article Google Scholar
Meher, P.K. (2007). Systolic formulation for low-complexity serial-parallel implementation of unified finite field multiplication over GF (2^m). In In Proceedings 18th IEEE International Conference Applied-Specific System, Architectures Processors (pp. 134–139).
Moon, S., Park, J., & Lee, Y. (2001). Fast VLSI arithmetic algorithms for high-security elliptic curve cryptographic applications. IEEE Transactions on Consumer Electron, 47(3), 700– 708.
Article Google Scholar
Chiou, W., Lin, C., Chou, H., & Shu, F. (2003). Low-complexity finite field multiplier using irreducible trinomials. Electron Letters, 39(24), 1709–1711.
Article Google Scholar
Tang, W., Wu, H., & Ahmadi, M. (2005). VLSI implementation of bit-parallel word-serial multiplier in GF(2²³³). In Proceedings Third International IEEE-NEWCAS Conference (pp. 399–402).
Kim, H., Kwon, S., & Hong, C. (2005). A fast digit-serial systolic multiplier for finite field GF(2^m). In In Asia South Pacific Design Automatic Conference (pp. 1268–1271).
Garca-Martnez, M., Posada-Gomez, R., Morales-Luna, G., & Rodrguez-Henriquez, F. (2005). FPGA implementation of an efficient multiplier over finite fields GF(2^m). In International Conference Reconfigurable Computing and FPGAs (pp. 21–26).
Meher, P.K. (2008). Systolic and super systolic multipliers for finite field GF(2^m) based on irreducible trinomials. IEEE Transactions on Circle and System –1, 55(4), 1031–1040.
Article MathSciNet Google Scholar
Tenca, A., & Koç, C. (2003). A scalable architecture for modular multiplication based on montgomery’s algorithm. IEEE Transactions on Computers, 9(52), 1215–1221.
Article Google Scholar
Orlando, G., & Paar, C. (1999). A super-serial Galois fields multiplier for FPGAs and its application to public-key algorithms. In Proceedings of Seventh Annual IEEE Symposium on Field-Programmable Custom Computing Machines 1999 (FCCM’99) (pp. 232–239).
Gebali, F., & Ibrahim, A. Efficient Scalable Serial Multiplier Over GF(2^m) Based on Trinomial. Accepted for publication in a future issue of the journal of IEEE transactions on VLSI systems, 2014. doi:10.1109/TVLSI.2014.2359113.
Gebali, F. (2011). Algorithms and Parallel Computers. New York: John Wiley.
Book Google Scholar
(2000). National Institute of Standards and Technology, FIPS 186-2, Digital Signature Standard (DSS), Federal Information Processing Standards Publication 186-2.
Meher, P.K. (2009). On efficient implementation of accumulation in finite field over g f(2^m) and its applications. IEEE Transactions on VLSI System, 17(4), 541–550.
Article Google Scholar
Pan, S., & et al. (2013). Low-latency digit-serial and digit-parallel systolic multipliers for large binary extension fields. IEEE Transactions on Circle and System -I, 60(12), 3195–3204.
Google Scholar
Jain, S.K., Song, L., & Parhi, K.K. (1998). Efficient semisystolic architectures for finite-field arithmetic. IEEE Transactions Very Large Scale Integrated (VLSI) System, 6(1), 101–113.
Article Google Scholar
Talapatra, S., Rahaman, H., & Mathew, J. (2010). Low comp. digit serial system Montgomery Multiple for special class of GF(2^m). IEEE Transactions on V. Large Scale International System, 18(5), 847–852.
Article Google Scholar
Xie, J., Meher, P.K., & He, J. (2013). Low-complexity multiplier for GF(2^m) based on all-one polynomials. IEEE Transactions on VLSI System, 21(1), 168–173.
Article Google Scholar

Download references

Acknowledgments

The authors would like to acknowledge the support of a Discovery grant from the Natural Sciences and Engineering Research Council to the second author and the support of Sattam Bin AbdulAziz University and Electronics Research Institute for the first author.

Author information

Authors and Affiliations

Sattam Bin AbdulAziz University, Alkharj, Saudi Arabia
Atef Ibrahim
Electronics Research Institute, Cairo, Egypt
Atef Ibrahim
University of Victoria, Victoria, BC, Canada
Fayez Gebali

Authors

Atef Ibrahim
View author publications
You can also search for this author in PubMed Google Scholar
Fayez Gebali
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Atef Ibrahim.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Ibrahim, A., Gebali, F. Low Power Semi-systolic Architectures for Polynomial-Basis Multiplication over GF(2^m) Using Progressive Multiplier Reduction. J Sign Process Syst 82, 331–343 (2016). https://doi.org/10.1007/s11265-015-1000-x

Download citation

Received: 29 October 2014
Revised: 11 February 2015
Accepted: 25 March 2015
Published: 19 April 2015
Issue Date: March 2016
DOI: https://doi.org/10.1007/s11265-015-1000-x

Keywords

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

Low Power Semi-systolic Architectures for Polynomial-Basis Multiplication over GF(2m) Using Progressive Multiplier Reduction

Abstract

1 Introduction and Related Work

2 Problem Formulation

3 Progressive Multiplier Reduction (PMR) Technique

4 Parallelizing the PMR Technique

4.1 Study of Algorithm Variables

4.2 Scheduling Function Design for PMR Technique

4.3 Projection Function Design for PMR Method

5 Design Space Exploration for PMR Technique

5.1 Design # 1: Using s 1 and d 11

5.2 Design # 2: Using s 1 and d 12

5.3 Design #3: Using s 1 and d 13

5.4 Design #4: Using s 2 and d 21

5.5 Design #5

5.6 Comparing the Proposed Designs

6 Complexity Comparison

7 Summary

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation

Low Power Semi-systolic Architectures for Polynomial-Basis Multiplication over GF(2^m) Using Progressive Multiplier Reduction

5.1 Design # 1: Using s ₁ and d ₁₁

5.2 Design # 2: Using s ₁ and d ₁₂

5.3 Design #3: Using s ₁ and d ₁₃

5.4 Design #4: Using s ₂ and d ₂₁