1 Introduction and Related Work

Efficient arithmetic operations in finite fields are important in many applications, including coding theory, computer algebra systems, information theory, number theory, and elliptic curve cryptosystems (ECC) [1]. Multiplication over GF (2m) is the basic field operation which is frequently encountered in most of these applications. Multipliers with different bases of representation, e.g., polynomial basis, normal basis, and dual basis, have been realized for various applications. However, the polynomial basis multipliers are more efficient and more widely used compared with the multipliers based on the other two bases.

Numerous hardware architectures have been proposed for polynomial-basis finite-field multiplication over GF (2m) [210, 1225]. In terms of design style, the hardware architectures can be classified into two basic forms. The first form is systolic or semi-systolic architecture and the second form is nonsystolic architecture. Nonsystolic designs mostly aim to reduce the number of partial products to realize the multipliers with least hardware and shorter duration of latency [211]. On the other hand, the systolic designs in [12, 1828] posses advantages over the nonsystolic ones due to: regularity, modularity, simplicity of the processing elements (PEs), local interconnections, and high-throughput rates [29]. We explore several semi-systolic architectures by converting the GF(2m) multiplication into an iterative algorithm using systematic linear and nonlinear techniques that combine affine and nonlinear task scheduling and assignment of tasks to processors. The nonlinear techniques discussed here allow the designer to control the processor workload, the processor word width, and also inter-processor communication.

The paper is organized as follows: Section 2 discusses finite field multiplication over GF(2m) based on irreducible trinomials. Section 3 discusses converting field multiplication into an iterative algorithm using Progressive multiplier Reduction (PMR). Section 4 presents a systematic technique to parallelize the PMR iterative multiplication algorithm using linear and nonlinear data scheduling and projection techniques. Section 5 discusses the design space exploration for the PMR iterative multiplication algorithm. Section 6 discusses the proposed designs complexity and compares it to previous work. Finally, Section 7 provides the conclusions of this work.

2 Problem Formulation

The National Institute of Standards and Technology (NIST) recommended five irreducible field polynomials for ECC over GF (2m) [30]. Two of these polynomials are trinomials: Q(x) = x 233 + x 73 + 1 and Q(x) = x 409 + x 87 + 1. This motivated several semi-systolic implementations using these polynomials [68, 25, 28]. The field polynomial has the form:

$$ Q(x) = x^{m}+x^{k}+1 $$
(1)

Assuming α is a root of Q(x), the two field elements A and B to be multiplied are represented by the polynomials:

$$ A = \sum\limits_{h=0}^{m-1} a_{h} \ \alpha^{h} \quad \text{and} \quad B = \sum\limits_{g=0}^{m-1} b_{g} \ \alpha^{g} $$
(2)

where a h and b g G F(2) for 0 ≤ h, g < m. The reduced product C will be m-bits long:

$$\begin{array}{@{}rcl@{}} C &=& A\times B = \left [\sum\limits_{h=0}^{m-1} \sum\limits_{g=0}^{m-1} a_{h} \ b_{g} \ \alpha^{h+g} \right ]\ \text{mod} Q(\alpha)\\ &=& \sum\limits_{g=0}^{m-1} \ c_{g} \ \alpha^{g} \end{array} $$
(3)

It is not practical to perform the modulo operation on the polynomial in Eq. 3 whose degree is 2m − 2. Since the modulo operation is distributive, we can write (3) as:

$$ C = \sum\limits_{g=0}^{m-1} b_{g} \left [\ \alpha^{g} A \quad \text{mod } Q(\alpha) \right ] \quad = \sum\limits_{g=0}^{m-1} C_{g} $$
(4)

We note from Eq. 4 that each partial product is a polynomial:

$$ C_{g} = b_{g} \alpha^{g} A \quad \text{mod } Q(\alpha) $$
(5)

It is not practical to perform the reduction in Eq. 4 or Eq. 5 in one step. An attractive approach is to iteratively perform the reduction operation on the different powers of the multiplier α g A modQ(α) as will be explained in the following section.

3 Progressive Multiplier Reduction (PMR) Technique

We convert (4) into an iteration using increasing powers of α g based on Algorithm 1.

figure c

A i is given by:

$$ A^{i} = \sum\limits_{j=0}^{m-1} {a^{i}_{j}} \alpha^{j} $$
(6)

And α A i is written as:

$$ \alpha A^{i} = \sum\limits_{j=1}^{m} a^{i}_{j-1} \alpha^{j} $$
(7)

Using Eq. 1, we can write:

$$ \alpha^{m} = \alpha^{k} +1 \quad \text{mod } Q(\alpha) $$
(8)

Substituting (8) in Eq. 7 effectively accomplishes the reduction step and we get:

$$ \alpha A^{i} \ \text{mod } Q(\alpha) = a^{i}_{m-1}\left (\alpha^{k}+1\right ) + \sum\limits_{j=1}^{m-1} a^{i}_{j-1} \alpha^{j} $$
(9)

The above equation ensures the reduction step in Eq. 9 produces a polynomial A i + 1 with a degree less than m. We modify Algorithm 1 in terms of the bits as shown in Algorithm 2. In this algorithm, a j represents the j-th bit of the operand A and c j represents the j-th bit of the final product C. Also, the terms \({a^{i}_{j}}\) and \({c^{i}_{j}}\) represent the j-th bit of the operand ”A” and partial product C at iteration i, respectively.

4 Parallelizing the PMR Technique

The operations in Algorithm 2 Steps 6–13 define the iterative algorithm to implement (4). The second author developed systematic techniques to parallelize iterative algorithms that allow for exploring all possible systolic arrays and optimizing the performance according to certain specifications [29]. Early techniques represented dependence among pairs of variables as a dependence graph (DG) and had several problems: (a) They are confined to simple two-dimensional (2D) algorithms such as matrix-vector multiplication. It becomes very difficult to deal with high-dimensionality algorithms or with algorithms that contain many variables. (b) Using the DG gives few options for developing possible scheduling algorithms.

figure d

4.1 Study of Algorithm Variables

Algorithm 2 has two indices i and j and their ranges define a set of points in a convex hull 𝔻 in the 2-D integer space, i.e. 𝔻 ⊂ ℤ2 [29]. The algorithm has two input variables A and B; two intermediate variables A i and C i; and one output variable C. The input bits \({a^{0}_{j}}\) are shown in Fig. 1 at the top row. Bit b i is used only at row i. This is indicated by the horizontal lines in Fig. 1.

Figure 1
figure 1

Dependence graph of the PMR algorithm for m = 7 and k = 4.

The intermediate variable A i is updated using iteration steps 8 and 9. This is indicated by the diagonal lines in Fig. 1. The bits for intermediate variable C i are updated using step 10. This is indicated by the vertical lines in Fig. 1. The arrows indicate the direction of data flow between the nodes at each time step. The final product bits for output variable C are obtained at the bottom of the graph. Notice from step 10 that successive reduction steps are represented by the feedback lines obtained from the most significant bit (right-most bit) of a i as indicated by the dashed red lines.

4.2 Scheduling Function Design for PMR Technique

We use an affine scheduling function such that point \(\mathbf {p} =[i j]^{t} \in \mathbb {D}\) is assigned a time value n(p) given by:

$$\begin{array}{@{}rcl@{}} n(\mathbf{p}) &=& \mathbf{s\ p} -\gamma \end{array} $$
(10)
$$\begin{array}{@{}rcl@{}} & = & i \alpha + j W-\gamma \end{array} $$
(11)

where s = [α W] is the scheduling vector and γ is a scalar constant. The scheduling function assigns a time index value for each node in the graph. Thus the data moving between the nodes are now governed by a time relationship. The scheduling function converts the dependence graph 𝔻 into a directed acyclic graph DAG.

Based on the data flow in Fig. 1, we have two restrictions on our choice of s: The iterative calculation of \({a^{i}_{j}}\) implies that task at point [i + 1, j + 1] must be executed after task at point [i, j]. This restriction can be written as

$$ [\begin{array}{cc}\alpha & W \end{array}] [\begin{array}{cc}i+1 & j+1 \end{array}]^{t} >[\begin{array}{cc}\alpha & W \end{array}][\begin{array}{cc}i & j \end{array}]^{t} $$
(12)

This results in a condition on the components of s:

$$ \alpha+W > 0 $$
(13)

Another restriction on timing is due to the feedback in Fig. 1 such that task at point [i + 1,0] can only proceed after task at point [i, m − 1] has been evaluated:

$$ [\begin{array}{cc}\alpha & W \end{array}] [\begin{array}{cc}i+1 & 0 \end{array}]^{t} >[\begin{array}{cc}\alpha & W \end{array}] [\begin{array}{cc}i & m-1 \end{array}]^{t} $$
(14)

This results in another inequality:

$$ \alpha > (m-1) W $$
(15)

Based on Eqs. 13 and 15 we have two simple scheduling functions as discussed in the following paragraphs.

The first scheduling function s 1 is a linear affine mapping function that assigns a time value to each point p ∈ 𝔻 of Fig 1:

$$ n(\mathbf{p}) = \mathbf{s_{1}\ p} - \gamma_{1}= i $$
(16)
$$ \textbf{s}{1} = [\begin{array}{cc}1&0 \end{array} ] $$
(17)
$$ \gamma_{1} = 0 $$
(18)

Figure 2 shows node timing for the PMR algorithm using the scheduling functions s 1 for m = 7 and k = 4. Note that the time index n is identical to the i-axis index values such that iteration i and time index value n are related by:

$$ n = i $$
(19)

The grey boxes indicate equitemporal regions where the nodes in each region execute at the same time. The numbers on the right of the figure indicate the times. The PMR technique will require m time steps to complete when s 1 is used.

Figure 2
figure 2

Node timing for the PMR algorithm using the scheduling functions s 1 for m = 7 and k = 4.

Using s 1, the workload per time step is m which depends on the size of the polynomial being processed. In that sense, we are unable to control the workload per time step using linear affine scheduling.

The second timing function s 2 controls the workload per time step and the number of time steps required to complete the multiplication operation. As an indirect benefit, the function controls the number of processing elements. The nonlinear scheduling function has the form:

$$\begin{array}{@{}rcl@{}} n(\mathbf{p}) &=& \mathbf{s_{2}\ p} -\gamma_{2} \\ &=& i \left \lceil \frac{m}{T} \right \rceil -\left \lfloor\frac{j+\mu_{2}}{T}\right \rfloor -\gamma_{2} \end{array} $$
(20)
$$\begin{array}{@{}rcl@{}} \mathrm{\textbf{s}}_{2} &=& \left [\begin{array}{cc} \left \lceil \frac{m}{T} \right \rceil &-\left \lfloor\frac{.+\mu_{2}}{T}\right \rfloor \end{array} \right ] \end{array} $$
(21)
$$\begin{array}{@{}rcl@{}} \mu_{2} &=& T \left \lceil \frac{m}{T} \right \rceil-m \end{array} $$
(22)
$$\begin{array}{@{}rcl@{}} \gamma_{2} &=& -\left \lfloor\frac{m-1+\mu_{2}}{T}\right \rfloor \end{array} $$
(23)

where T is the number of tasks to be executed in one time step, ⌈./T⌉ and ⌊./T⌋ indicate the ceiling and floor functions, respectively, and the ’dot’ is a place holder for the argument.

Figure 3 shows node timing function s 2 when m = 7, k = 4 and T = 3. The number of tasks executed at each time step are not the same when m is not an integer multiple of T. The time index n in Eq. 20 depends now on the values of both i and j indices.

Figure 3
figure 3

Node timing for the PMR algorithm using nonlinear scheduling function s 2 for m = 7, k = 4 and T = 3. The nodes in blue indicate padding the multiplier bits to make \(m^{\prime }\) an integer multiple of T.

Figure 3 shows how the dependance graph of Fig. 1 is converted to a DAG upon using s 2. The grey areas indicate nodes having the same time index. The time index of each grey area is indicated within each area. The number of tasks executed at each time is made constant when m is increased to m′ which is an integer multiple of T:

$$ m^{\prime} = T \left \lceil \frac{m}{T} \right \rceil = m + \mu_{2} $$
(24)

For the case when m = 7 and T = 3, we get m′ = 9 as shown in the figure. The figure also shows that we chose to pad the LSB bits of A and C to obtain the augmented polynomials A′ and C′, respectively. For example, the case when m = 7 and T = 3 yields μ 2 = 2 and the multiplier has the form:

$$\begin{array}{@{}rcl@{}} A^{\prime} &=& \left [ \begin{array}{*6{c}} a^{\prime}_{0} & a^{\prime}_{1} & a^{\prime}_{2} & a^{\prime}_{3} & {\cdots} & a^{\prime}_{m^{\prime}-1} \end{array}\right ] \end{array} $$
(25)
$$\begin{array}{@{}rcl@{}} &=&\left [ \begin{array}{*6{c}} 0 & 0 & a_{0} & a_{1} & {\cdots} & a_{m-1} \end{array}\right ] \end{array} $$
(26)
$$\begin{array}{@{}rcl@{}} m^{\prime} &=& m+\mu_{2} \end{array} $$
(27)
$$\begin{array}{@{}rcl@{}} a_{j} &=& a^{\prime}_{j+\mu_{2}} \qquad 0 \leq j < m \end{array} $$
(28)

Padding on the left involves the LSB bits and leaves the MSB relatively unchanged. In other words, it becomes easy to identify the location of the MSB. However, the locations of the bits corresponding to locations 0 and k in A have now shifted in A′. This is useful since MSB bit is responsible for generating the feedback signal f. Likewise the product polynomial is padded with μ 2 bits on the left to get:

$$\begin{array}{@{}rcl@{}} C^{\prime} &=& \left [ \begin{array}{*6{c}} c^{\prime}_{0} & c^{\prime}_{1} & c^{\prime}_{2} & c^{\prime}_{3} & {\cdots} & c^{\prime}_{m^{\prime}-1} \end{array}\right ] \end{array} $$
(29)
$$\begin{array}{@{}rcl@{}} &=&\left [ \begin{array}{*6{c}} 0 & 0 & c_{0} & c_{1} & {\cdots} & c_{m-1} \end{array}\right ] \end{array} $$
(30)
$$\begin{array}{@{}rcl@{}} c_{j} &=& c^{\prime}_{j+\mu_{2}} \qquad 0 \leq j < m \end{array} $$
(31)

We chose this particular form of nonlinear scheduling function for the following reasons:

  1. 1.

    The number of nodes processed at a given time is fixed and equals T. The value of μ 2 determines how many more dummy nodes are needed to increase the number of nodes form m to m′, which is an integer multiple of T.

  2. 2.

    The feedback signal f is obtained from the MSB bit of A′, which are the rightmost nodes in Fig. 3.

  3. 3.

    Based on Eq. 21, the feedback signal is updated at times n = im/T⌉, with i = 0, 1, ⋯.

  4. 4.

    Based on Eq. 21, the feedback signal will be supplied to node 0 at times:

    $$ n= i \left \lceil \frac{m}{T} \right \rceil -\left \lceil\frac{\mu_{2}}{T}\right \rceil -\gamma_{2} \quad\text{ with}\quad i=0, 1, {\cdots} $$
    (32)

    and to node k at times:

    $$ n= i \left \lceil \frac{m}{T} \right \rceil -\left \lceil\frac{j+\mu_{2}}{T}\right \rceil -\gamma_{2} \quad\text{ with}\quad i=0, 1, \cdots $$
    (33)

Using this nonlinear scheduling function, we are now able to control the workload per time step. In this case it is equal to T = 3. The multiplication will require mm/T⌉ time steps to complete. This is longer than the scheduling function defined by s 1 but when coupled with the projection vectors defined in the next section, will result in very practical and scalable designs that are more suited for embedded applications with limited processor resources.

4.3 Projection Function Design for PMR Method

In Section 4.2 we discussed how we can associate a time index to each point (task) in the dependence graph. In this section we discuss how we can assign a processor to each node in the dependence graph. We can use the techniques proposed in [29] to derive linear affine task projection. Assume two points in DAG lie along the projection direction d such that

$$ \mathbf{p}_{2} = \mathbf{p}_{1} + e \mathbf{d} $$
(34)

where e is some constant. These two points will be mapped to the same processor if we make that projection direction d the nullvector of the projection matrix P [29]. In other words we can write:

$$ \mathbf{Pd} = \mathbf{0} $$
(35)

Typically we should ensure that d s ≠ 0. So a choice for a scheduling vector has implications for the choice of projection vectors.

A point p ∈ 𝔻 will be projected to point \(\overline {\mathbf {p}}\) in the processor array space using the affine projection operation

$$ \overline{\mathbf{p}} = \mathbf{Pp}- \delta $$
(36)

where P is a rank-deficient projection matrix and δ is a scalar constant to adjust the processor indices to start at 0 value. Reference [29] places one restriction on the projection directions; namely:

$$ \mathrm{\textbf{s}} \mathrm{\textbf{d}} \neq 0 $$
(37)

which ensures that each processor does not do m calculations at one time step and also that all processors are well utilized by working at each time step.

Most of the time, we will be seeking one-dimensional processor arrays to implement an algorithm. Since our algorithm is two-dimensional, P will reduce to a row vector such that the product in Eq. 36 will result in scalar value for the processor index that the point maps to. Generalization to processor arrays of higher dimensions is beyond the scope of this work.

The following subsections illustrate the design space exploration for the PMR technique through the different choices of scheduling functions and projection directions. Table 1 shows tthree projection directions associated with scheduling function s 1.

Table 1 Projection vectors associated with the scheduling function s 1.

Table 2 shows two projection directions associated with scheduling function s 2.

Table 2 Projection vectors associated with the scheduling function s 2.

We note that we used simple linear affine projection directions and also used complex-looking nonlinear projection directions. This latter choice will result in simple hardware for the processor array where the feedback signal is easily extracted from the processor with the highest index as will be explained in sequel.

5 Design Space Exploration for PMR Technique

The following subsections discuss the different designs associated with each choice of s and d that are listed in Tables 1 and 2.

5.1 Design # 1: Using s 1 and d 11

A point \(\mathbf {p} = [i j]^{t} \in \mathbb {D}\) is mapped by the projection matrix P 11 = [01] onto the point:

$$ \overline{\mathbf{p}} = \mathbf{P}_{11}\mathbf{p}-\delta_{11}= j $$
(38)

All nodes in a column will map to a single PE but will execute at different time steps. Figure 4 shows the hardware details for Design 1. Figure 4a shows semi-systolic array design when m = 7 and k = 4. Communication between adjacent PEs requires only a one-bit line for transmitting bits \({a^{i}_{j}}\) while the partial product bits \({c^{i}_{j}}\) are stored locally. Multiplicand bits b i , 0 ≤ i < m, are broadcast to the PEs and the updated multiplier bits \({a^{i}_{j}}\) are propagated between the PEs, as shown in Fig. 4a. The feedback signal f is obtained from the output of PE m − 1 at each clock. This signal is fed back to the two processors PE 0 and PE k .

Figure 4
figure 4

Design 1 when m = 7 and k = 4. a semi-systolic array. b PE j details when j ≠ 0 or k. c PE j details when j = 0 or k. Boxes labeled D are 1-bit flip-flops with clear and load control inputs.

Figure 4b shows the details of PE j when j ≠ 0, k. Figure 4c shows the details of PE j when j = 0, k. An extra XOR gate is needed to process the feedback signal f.

We summarize the operation of each PE j (0 ≤ j < m) for design #1.

  1. 1.

    At time n = 0, the lower FF’s in Figs. 4b and 4c accumulating the product bits C are cleared.

  2. 2.

    At time n = 0, also, MUX’s M accepts the multiplier bits A through their upper input.

  3. 3.

    At time n ≥ 0, input multiplicand bit b i is broadcast to all PE’s.

  4. 4.

    At time n > 0, the signals \({a^{i}_{j}}\) are pipelined by setting the MUX M to accept the lower input.

  5. 5.

    The feedback signal f is obtained from the output of PE m − 1. This signal is fed back to the two processors PE 0 and PE k .

  6. 6.

    At time n = m − 1, the output product C is produced where bit c j is obtained from PE j .

5.2 Design # 2: Using s 1 and d 12

Here we increase the workload of each processor from one to W while preserving the interprocessor communication to one-bit. A point \(\mathbf {p} = [i j]^{t} \in \mathbb {D}\) will be mapped, using the projection matrix \(\mathbf {P}_{12} = [\begin {array}{cc} 0& \left \lfloor (.\ + \mu _{12})/W\right \rfloor \end {array}]\), to PE x and assigned bit y in that PE. The indices x and y are given by:

$$ x =\left \lfloor \frac{j+\mu_{12}}{W} \right \rfloor +\delta_{12}, \qquad y = (j+\mu_{12}) \text{mod} W $$
(39)

where

$$ \mu_{12} = W\left \lceil \frac{m}{W} \right \rceil -m\quad \text{ and} \quad \delta_{12} = 0 $$
(40)

Figure 5 shows the hardware details for Design 2. Figure 5a shows semi-systolic array design when m = 7, k = 4 and W = 2.Figure 5b shows the details of a PE that does not use the feedback signal. Figure 5c shows the details of a PE that uses the feedback signal.

Figure 5
figure 5

Design 2 when m = 7, k = 4 and W = 2. a semi-systolic array. b PE j details when feedback signal is not used. c PE j details when feedback signal is used. Boxes labeled D are 1-bit flip-flops with clear and load control inputs and boxes labeled M are MUX’s.

The operation steps of each PE j (0 ≤ j < m) of Design #2 are similar to that of Design# 1 except the feedback signal f is routed to PE x and assigned to bit y when x and y satisfy either of the following two conditions:

$$ x= 0 \quad \text{and}\quad y = \mu_{12} $$
(41)

or

$$ x =\left \lfloor \frac{k+\mu_{12}}{W} \right \rfloor \quad \text{and}\quad y = (k+\mu_{12}) \text{mod} W $$
(42)

5.3 Design #3: Using s 1 and d 13

A point \(\mathbf {p} = [i j]^{t} \in \mathbb {D}\) will be mapped by the projection matrix P 13 = [1−1] onto the point

$$ \overline{\mathbf{p}} = \mathbf{P}_{13}\mathbf{p} -\delta_{13}= i-j $$
(43)

The resulting processor array corresponding to the projection matrix P 13 consists of 2m − 1 PEs but m PEs are active at a given time step, To improve PE utilization, we need to reduce the number of processors using nonlinear mapping operator:

$$ \overline{\mathbf{p}} = {\mathbf{P}}_{13}{\mathbf{p}} -\delta_{13} \quad \text{mod}~m $$
(44)

The activity of the processors is illustrated in Fig. 6 where the numbers inside the circles indicate the PE index.

Figure 6
figure 6

Processor activity for Design #3 when m = 7 and k = 4.

The processor array is shown in Fig. 7a.Figure 7b shows the PE details.

Figure 7
figure 7

Design #3 when m = 7 and k = 4. a semi-systolic array. b PE details. Boxes labeled D are 1-bit flip-flops with clear and load control inputs and boxes labeled M are MUX’s.

We summarize the operation of each PE j (0 ≤ j < m) for design #3.

  1. 1.

    At time n = 0, the FF at the bottom of Fig. 7b, is cleared.

  2. 2.

    At time n = 0, MUX M 1 of PE j is set to accept the upper input to load the multiplier bit \({a^{0}_{k}}\) such that:

    $$k = m-j \text{mod}~m $$
  3. 3.

    At time n > 0, input multiplicand bit b i is broadcast to the PE’s.

  4. 4.

    At time n > 0, PE j will broadcast the feedback signal f, with:

    $$ j = n+1, \quad 0 \leq n < m-1 $$
    (45)
  5. 5.

    At time n > 0, PE j will set MUX M 3 to read the feedback signal f when either of the following conditions is satisfied:

    $$ j = n $$
    (46)

    or

    $$ j = k+n-1 \quad \text{mod}~m $$
    (47)
  6. 6.

    At time n = m − 1, PE j will produce product bit c k such that:

    $$ k = m-j-1 $$
    (48)

5.4 Design #4: Using s 2 and d 21

A point \(\mathbf {p} = [i j]^{t} \in \mathbb {D}\) will be mapped by the projection matrix \(\mathbf {P}_{21} = \left [\begin {array}{cc} 0& \left (.\ +\mu _{2} \right ) \text {mod}~T\end {array}\right ]\) onto the point:

$$ \overline{\mathbf{p}} = \mathbf{P}_{21}\mathbf{p} -\delta_{21}=\left (j\ +\mu_{2} \right ) \text{mod}~T $$
(49)

where

$$ \mu_{2} = T \left\lceil m/T \right \rceil-m, \quad \text{and} \quad\delta_{21} = 0 $$
(50)

Figure 8 shows the hardware details for Design #4. Figure 8a shows semi-systolic array design when m = 7, k = 4 and T = 3. The number of PE’s is T. Therefore the combination of nonlinear scheduling and projection functions allows us to control the workload per time step and the number of PEs required.

Figure 8
figure 8

Design #4 when m = 7, k = 4 and T = 3. a Processor array. b PE details when the feedback signal is not needed. c PE details when the feedback signal is needed. M is MUX, FIFO is a \(m^{\prime }/T\)-bit FIFO.

Figure 8b shows the details for PE j which does not require the feedback signal f. Figure 8c shows the details for PE j which requires the feedback signal f.

We note that the number of bits processed by each PE is ⌈m/T⌉ = m′/T, where m′ is given in Eq. 24. Each PE operates on one bit at each clock cycle. For the case of m = 7 and T = 3, each PE will need to store three bits for A′ and C′ as shown by the two sets of FIFO buffers in Fig. 8b or Fig. 8c.

We summarize the operation of each PE j (0 ≤ j < T) for Design #4:

  1. 1.

    For the first m′/T time steps (i.e. 0 ≤ n < m′/T), MUX M 1 is set to accept the upper input corresponding to the augmented multiplier polynomial A′. PE j will accept bit \({a^{\prime }}^{0}_{k}\) at time n such that:

    $$\begin{array}{@{}rcl@{}} j &=& k \text{ mod}~T \quad 0 \leq k < m^{\prime} \end{array} $$
    (51)
    $$\begin{array}{@{}rcl@{}} n &=& m^{\prime}/T - \left \lfloor k/T\right \rfloor -1 \end{array} $$
    (52)

    These bits will be loaded in FIFO a .

  2. 2.

    For the first m′/T time steps, also, MUX M 3 is set to accept the zero input to load the m′/T bits of FIFO c with zero values.

  3. 3.

    For times nm′/T, FIFO a is set to accept the lower input corresponding to pipelined input \(a^{i-1}_{j-1}\).

  4. 4.

    For times nm′/T, input multiplicand bit b i is broadcast to all PEs at iteration n where:

    $$ i = \left \lfloor \frac{n}{T} \right \rfloor $$
    (53)
  5. 5.

    For times nm′/T, MUX M 3 is set to accept FIFO c output.

  6. 6.

    PE j uses the feedback signal f at time n when j and n satisfy either of the two conditions:

    $$ j= \mu_{2}\quad \text{and}\quad n \ \text{mod}~T = m^{\prime}/T -1 $$
    (54)

    or

    $$ j= (k+\mu_{2}) \ \text{mod}~T \quad \text{and}\quad n \ \text{mod} ~T = 0 $$
    (55)
  7. 7.

    Augmented output product C′ is available at times n where n satisfies the inequalities:

    $$ mm^{\prime}/ T - T \leq n <mm^{\prime}/ T $$
    (56)

5.5 Design #5

In this design we use nonlinear scheduling and projection operations but the projection operation now uses a two-level nonlinear operation to give us more freedom in choosing the time to complete the algorithm, the number of processors and the word width of each processor.

This is accomplished by the nonlinear projection function d 22 in Table 2. A point \(\mathbf {p} = [i j]^{t} \in \mathbb {D}\) will be mapped by the projection matrix \( \mathbf {P}_{22} = [\begin {array}{cc} 0& \left \lfloor \frac {\left (.\ +\mu _{2} \right ) \text {mod}~T}{W} \right \rfloor \end {array}] \) onto the point:

$$ \overline{\mathbf{p}} = \mathbf{P}_{21}\mathbf{p} -\delta_{22}=\left \lfloor \frac{\left (j\ +\mu_{2} \right ) \text{mod}~T}{W} \right \rfloor $$
(57)

where μ 2 = Tm/T⌉−m and δ 22 = 0.

The resulting processor array will consist of T/W PE’s and each PE will process W bits at a time. We assume here that T is an integer multiple of W. The details of each bit of a PE will be similar to those shown in Fig. 8b or Fig. 8c. The operation of the processor array will be similar to Design# 4. The area complexity of this design is similar to Design# 4, but the time complexity will be different. It is practical in a multiprocessor system where the number of processors is already fixed.

5.6 Comparing the Proposed Designs

We provide a summary of the advantages and disadvantages of the five proposed designs in Section 5. The summary is provided in the form of Table 3. Based on Table 3, we conclude that Design #5 is the optimum design in terms of being able to control the number of PEs and control the number of bits processed by each PE. This design is adaptable for implementation in software multithreaded systems or hardware semi-systolic array systems.

Table 3 Summary of the advantages and disadvantages of the five proposed designs in Section 5.

6 Complexity Comparison

The area and delay complexities of the five proposed designs can be determined from Figs. 4, 5, 7, 8. Table 4 compares the proposed designs to the closest competitors [911, 1315, 25, 27, 28, 31, 32] in terms of area (gates, multiplexers, and flip-flops), latency, and critical path delay.

Table 4 Comparison between different finite field multipliers.

In Table 4 we have:

  1. 1.

    T A is AND gate delay

  2. 2.

    T M U X is MUX delay

  3. 3.

    T N is NAND gate delay

  4. 4.

    T X is XOR gate delay

  5. 5.

    M 1 = Wm/Wm

  6. 6.

    M 2 = ⌈m/W⌉(W 2 + 2W)

  7. 7.

    M 3 = Wm/W

  8. 8.

    M 4 = m 2 + m − 1

  9. 9.

    M 5 = Wm/W⌉(m + 1)

  10. 10.

    M 6 = (W − 1)m + (W 2 + W)/2

  11. 11.

    \(M_{7} = \sqrt {mW}(2+m) + W\)

  12. 12.

    \(M_{8} = m^{2}+m-2\sqrt {m}\)

  13. 13.

    F 1 = 2⌈m/WW + (2W + 1)⌈m/W

  14. 14.

    F 2 = 7m + m(⌈logm⌉) + 3

  15. 15.

    F 3 = (5/2 × m 2) + (1/2 × m) + 7

  16. 16.

    F 4 = m + (3T + 1)⌈m/T

  17. 17.

    τ 1 = T A + T X

  18. 18.

    τ 2 = T A + 2T X

  19. 19.

    τ 5 = T N + T X

  20. 20.

    τ 6 = T A + (⌈log2W⌉+1)T X

  21. 21.

    τ 7≈3T A + ⌈log2WT X

  22. 22.

    τ 8 = T M U X + T A + 2T X

  23. 23.

    τ 9 = T M U X + T X

  24. 24.

    Designs of Meher [25] have inverters equal to the number of NAND gates. These inverters are not shown in the table.

Design#1 in Fig. 4 requires m PE’s and each PE consists of one AND gate, one XOR gate, one MUX, and two flip flops, except two PE’s that have and extra XOR gate. The output is obtained after m clock cycles. The critical path duration is τ 8 = T M U X + T A + 2T X .

Design#2 in Fig. 5 requires ⌈m/W⌉ PE’s and each PE consists of W AND gates, W XOR gates, W MUXs, and 2W flip flops, except two PE’s that have an extra XOR gate. The output is obtained after m clock cycles. The critical path duration is τ 8 = T M U X + T A + 2T X .

Design#3 in Fig. 7 requires m PE’s and each PE consists of one AND gate, two XOR gates, two MUX’s, and two flip flops, except two PE’s that have and extra XOR gate. The output is obtained after m clock cycles. The critical path duration is τ 9 = T M U X + T X .

Design#4 in Fig. 8 requires T PE’s, where each PE consists of one AND gate, one XOR Gates, two MUX’s and two FIFO buffers consisting of ⌈m/T⌉ flip-flops each. Two PE’s have an extra XOR gate and a MUX. The output is obtained after mm/T⌉ clock cycles. The critical path delay is given by τ 8 = T M U X + T A + 2T X .

Design#5 requires T/W PE’s, where each PE consists of W AND gate, W XOR gate, 2W MUX’s and 2W FIFO buffers consisting of ⌈m/T⌉ flip-flops each. Two PE’s will require an extra XOR gate and a MUX. After a latency of ⌈m(m/T)/W⌉ clock cycles, the output is produced. The critical path delay is given by τ 8 = T M U X + T A + 2T X . It is practical in a multiprocessor system where the number of processors is already fixed.

In this table, the designs of Katti [13], Lee [14], Lee [15], Orlando [27], Gebali [28], Jain [33], Xie [35], Designs #1, #3, and #4 are implemented using bit-level systolic and semi-systolic architectures. Meher [25] proposed systolic structures with different number of bits per processor. The design of Talapatra [34] is implemented using digit-level systolic structure. Morales [9] has two different non-systolic structures, the first one is bit-level and the second one is digit level. Moreover, the design of Morales [10] and Sarmadi [11] are implemented using digit level non-systolic structure. The proposed Designs #2, #5 are implemented using digit-level semi-systolic structures. The designs of Sarmadi [11], Orlando[27], Gebali [28] and Designs #4 and #5 are also called scalable designs because they use fixed size core multiplier and does not need to change the core when m changes; it only reuses the core multiplier.

We described some of the most efficient and recent designs of Table 4 in VHDL at the register transfer level and synthesized them to the gate level for field size of m = 233, digit size W = 4, T = 16 and ω = 3 using a 0.18 μm, 1.8 V, standard-cell CMOS technology. We used the Synopsys synthesis tools package version 2005.09-SP2 for the logic synthesis and power analysis. All the synthesis results are obtained under typical operating conditions (1.8 V, 25 °C). Simulations were performed with Mentor Graphics ModelSim SE 6.0a. Table 5 compares the ASIC implementation of the different finite field multipliers. In this table, the column entitled “Latency” represents the total number of clock cycles required to complete single multiplication operation. The column entitled “Area” represents the area of the multipliers as the number of gate equivalents and the column entitled “Clock frequency” represents the speed of the multiplier, while “Multiplication delay” and “Power” represent the total amount of time and power consumption, respectively, required by the multiplier to complete a single operation. The throughput rate was calculated using the synthesis results in order to measure the degree of optimization achieved in each multiplier.

Table 5 Performance comparison between ASIC implementations of different finite field multipliers for m = 233, T = 16 and W = 4 and ω = 3.

From this table, we notice that the proposed scalable designs (Designs #4, #5) have lower area (56.8 % - 94.6 %) and power consumption (55.2 % - 84.2 %) than all compared designs, except the scalable design of Gebali [28], that make them very suitable for applications that have more restrictions on area and power consumption. On the other hand, scalable Design #5 has a significant higher throughput values compared to all scalable designs (73.8 % - 80.1 %) including the design of Gebali [28], but it has a significant lower throughput values compared to all other non scalable designs. Among the proposed designs, Design #3 achieves reasonable throughput with moderate area and power consumptions.

7 Summary

We discussed powerful technique for exploring parallelism in a given algorithm. Two basic operations are necessary to explore parallelism. The first operation is scheduling the tasks which converts the dependence graph (DG) of variables in 𝔻 into a directed acyclic graph DAG. The second operation is task projection for assigning tasks to processors. Nonlinear scheduling and projection operations were also discussed. As a working example, we used finite field multiplication over GF (2m) based on irreducible trinomials. In this work, we derived the dependence graph for the iterative algorithm and from it we found two valid linear affine timing or scheduling functions. One scheduling function is associated with the simplest three possible projection vectors and the other one is associated with only two projection vectors. Therefore, we have five semi-systolic array designs for design exploration. ASIC implementations of the proposed designs and some of the previously published competitive recent designs shows that the proposed scalable designs have lower area (56.8 % - 94.6 %) and power consumption (55.2 % - 84.2 %) compared to all designs, except the scalable design of Gebali [28], while one of the proposed scalable designs has a significant throughput (73.8 % - 80.1 %) over all other scalable ones. This makes the proposed designs suited to embedded applications that requires low power consumption and moderate speed.