Quantitative Optimization and Early Cost Estimation of Low-Power Hierarchical-Architecture SRAMs Based on Accurate Cost Models

Ren, Yuan; Noll, Tobias

doi:10.1007/978-3-319-23799-2_4

Yuan Ren²⁰ &
Tobias Noll²⁰

Part of the book series: IFIP Advances in Information and Communication Technology ((IFIPAICT,volume 461))

Included in the following conference series:

IFIP/IEEE International Conference on Very Large Scale Integration - System on a Chip

779 Accesses
7 Altmetric

Abstract

Dedicated low-power SRAMs are frequently used in various system-on-chip designs and their power consumption plays an increasingly crucial role in the overall power budget. However, the broad amount of choices regarding the capacity, wordlengths and operational modes make it hard for designers to determine the optimal SRAM architecture. Additionally, many low-power techniques and circuits are frequently utilized but not supported by previously proposed cost models. In order to solve these problems, a cost-model based quantitative optimization approach is proposed. In particular, a fast and accurate power estimation model is built for aiding the low-power SRAM designs. It precisely fits the various complex SRAM circuits and architectures. The quantitative approach provides useful conclusions early in the design phase guiding further optimizations. The estimation error of the power model has been proven to be less than 10 % compared to results based on time-hungry extracted-netlist simulations in a 40-nm CMOS technology.

You have full access to this open access chapter, Download conference paper PDF

Circuit Optimization

Architecture and Cross-Layer Design Space Exploration

Keywords

1 Introduction

SRAMs are widely used in many applications as caches etc. due to their fast access speed but also contribute significantly to area cost and power consumption. Particularly in system-on-chip design, dedicated SRAMs with optimized architecture and circuits are often applied for achieving low-power. The optimization of those dedicated SRAMs for lowest possible power at a given performance is a quite challenging task because of the complexity of the design space. An attractive approach is to perform a quantitative optimization based on cost models. Such cost models not only support the optimization process but also allow for early cost estimation in the system conception phase. The focus of the study is laid on the power cost considering the increasingly significant role of power consumption of SRAMs.

From the low-power perspective, SRAMs with hierarchical architecture, as described e.g. in [1–4], are very attractive choices. A quantitative optimization approach for the hierarchical-architecture SRAMs deserves further research.

Figure 1 provides a block diagram of such a hierarchical-architecture SRAM of 2 K words with a 45 bit-wordlength. Apart from the timing control circuits, the memory matrix and the address decoders dominate the total power consumption. These two components exhibit a large design space regarding the underlying architecture and circuits. In the hierarchical architecture the memory matrix is typically organized in 2 ^m (m = 3) columns. Each column includes a local timing generator, a bit cell column and a local wordline decoder. A bit-cell column consists of 2 ⁿ (n = 4) local blocks and a local block is sub-divided into 2 ^u (u = 4) words vertically. Thereby, the long bitlines and wordlines are both divided into global and local lines for reducing the switched capacitances. Moreover, the use of local sense amplifiers further reduces the power consumption by decreasing the signal swing on the long interconnects. Furthermore, the bit cell column could consist of various efficient circuits, such as assist circuits [1], stable bit cell and bit-interleaved technique [2] and pre-charge schemes [3, 5]. Apparently a large design space exists for selecting hardware-efficient architectures and the underlying circuits. Especially for SRAMs with different capacities and features, the time to market constraints make design space difficult to be explored. Therefore, there is a strong demand to come up with a cost model, by which architecture parameters, local circuits and power reduction techniques are characterized and quantitatively analyzed.

Many available power cost models were investigated for hierarchical-architecture SRAMs using various low-power circuits and techniques. As it is illustrated in Table 1, the widely used CACTI tool [6] is a tool to understand large caches in the context of microprocessors. It focuses on microprocessor caches (including cache coherency techniques) and with fewer possibilities for choices of circuits and techniques, in [1–3, 5]. For on-chip SRAMs CACTI is found very inaccurate due to its incompatibility with specific circuits and techniques so that it cannot be used as an optimization or design guiding tool. The power model in [7] cannot be used for SRAMs with low-power architectures containing divided bitline and divided wordline structures. Moreover this, it cannot deal with a variety of efficient specific circuits. In [8] only the traditional subdivided bitline structure is discussed without considering other possible low-power architectures or circuits. Moreover they do not consider LSAs in their model since the LSAs significantly contribute to the leakage power. The approach in [9] requires a complex reference design of a whole SRAM whose characterization is time-consuming, which makes it not practical for designers at the early design stage. Moreover, neither LSA in a hierarchical architecture nor energy-efficient circuits for local blocks are included. The power model in [10] discussed a binary tree SRAM based on the approach in [7] which makes it similarly inappropriate. Moreover, the energy consumed by long interconnects existing in the binary tree organization is not considered in the total energy consumption in a proper way. Hence, these models cannot help designers in making quantitative decisions about architecture and circuits.

Table 1. Estimation approaches investigation

Full size table

A predictive and mature design flow for on-chip SRAMs must simultaneously consider energy (E), area (A) and speed (T) properties. For this reason a cost estimation environment is under elaboration serving as an efficient design aid and a quantitative optimization tool. Figure 2 sketches the overall flowchart of this environment. The SRAM specifications are given by the capacity in number of words and wordlength. The first task is to determine the optimal architecture and most effective circuit techniques. A hierarchical architecture is taken as the subject to be analyzed and optimized. The architectures are specified by the partitioning parameters (m, n, u). Here, parameter m is the column address decoder width, which denotes the number of columns (M = 2 ^m) and n is the row address decoder width, which denotes the number of rows (N = 2 ⁿ) in a memory matrix. Parameter u is the unit address decoder width, which denotes the number of cells (U = 2 ^u) in a column unit. On the whole, the three parameters define the partitioning and organization of a hierarchical-architecture SRAM with a total capacity of M*N*U words being equal to 2 ^a address bits with wordlength (w). The decomposition and combination possibilities of the three parameters are explored for selecting the optimal architecture partitioning.

Besides the choices related to partitioning regarding the design space, there exists a wide choice of various energy-efficient circuit techniques. These circuit techniques, such as the ones related to the bit cells, assist circuits and local sense amplifiers (LSA) cause additional overhead and complexity to the overall design. The consumed energy is difficult to evaluate and estimate since it depends on the context in which these circuits are used. Their non-standard features cannot longer be characterized using the available models which only include standard features. Hence there must be a quantitative benchmarking tool to assess these circuits and techniques. Accordingly, their respective power consumption should be estimated and compared while using different configurations so that the optimal one can be selected. A pre-characterization methodology is employed for capturing their distinct features and quantifying the analysis. With these 2 sets of inputs, partitioning parameters and specific circuit techniques, a design-space exploration is carried out, while building a cost (A, T, E) model for on-chip SRAMs. This way a Pareto-optimization is carried out by trading off the three costs (A, T, E). Finally the design decisions regarding architecture and circuits are made.

The idea of a cost-model based pre-characterization of elementary components (i.e. logic gates and bit cells) is further explained in Fig. 3. Given specification parameters (a, w), all the possibilities of the partitioning parameter (m, n, u) are generated and then these possibilities are evaluated by the presented cost model. An ATE cost database is built by proper circuit simulation of extracted component netlists, which depends on the use case (e.g. V _dd , V _bias), technology corners, temperature, frequency and gate sizes.

The area can be estimated by a parameterized floor plan estimation and an accumulation of elementary widths and height values (e.g. W _components , H _components).
In the energy model, the elementary energy values of basic circuits (e.g. E _gate) are accumulated with the partitioning parameters and switching activity probabilities. The interconnect energy is estimated from the wire length and the energy per unit length (E _{wire_unit}). Finally the total energy consumption is derived by an accumulation of elementary energy (e.g. E _bl) from all used circuit components.
The speed can also be derived by using the cost data base (e.g. t _slope , W _wire , C _load, C _input) and a elementary delay accumulation along the critical path involving long resistance-capacitance interconnects.

In this contribution, we focus on the energy cost model and the relevant optimization approach. In the proposed model, only a few necessary basic circuit components (i.e. basic bit cells) need to be characterized. These circuit components could be verified by a number of Monte-Carlo simulations for ensuring robustness. Moreover, the pre-characterization including simulation time and model building only requires couple of hours and afterwards the estimation results can be acquired in a few minutes. Therefore, the total effort of the cost model is much less compared to a complete reference design.

2 Hierarchical Architecture

A conventional hierarchical low-power SRAM organization including address decoders and memory matrices is illustrated in Fig. 4. It is partitioned into 2 ^(m+n) blocks which are organized as an array of M = 2 ^m block columns and N = 2 ⁿ block rows, while m and n are dependent on the decompositions of column and row address decoders. The outputs of these two pre-decoders, 2 ^m column select (CS) and 2 ⁿ row select (RS) signals, are further decoded by NOR or NAND gates for generating the 2 ^(m+n) Blockact signals for selecting a specific block. Each block is composed of U = 2 ^u words placed vertically and it includes w column units. The column unit is considered as a basic unit for the memory matrix, which includes 2 ^u cells per column unit and one LSA. A unit decoder and the row decoder generate 2 ^(u+n) global wordlines (GWL) at the output to access the selected row in the block. Afterwards, the GWLs and Blockact signals are further decoded to 2 ^(m+n+u) local wordlines (LWL) to access the selected word. In this way, bitline and wordline capacitances are reduced for meeting the low-power requirement. Also, the introduction of LSA reduces the voltage swing on global bitlines.

As discussed before, the memory matrix is organized into M = 2 ^m block columns. In these block columns various complex but efficient circuits are utilized. In Fig. 5, a block column is shown for exemplifying these specific circuits. It is composed of a local timing-control signal generator, a LWL decoder and a bit cell column. In the local timing-control signal generator, the global timing pulse signals are combined with the Blockact signals to generate local timing pulse signals for the N = 2 ⁿ blocks. Similarly in the LWL decoder, the GWL signals running through the whole memory matrix are combined with the Blockact signals to generate LWL signals for each word in the bit cell column. The bit cell column is instantiated by high-threshold voltage bit cells with reverse back-biasing long channels, equalizer pre-charge scheme, read/write assist circuits and a wide local sense amplifier [5]. A block in a bit cell column is composed of w column units. Each column unit includes U = 2 ^u 6T cells, assist circuits and a LSA.

The choice of m and w defines the parasitic wordline capacitances and the wordline structure. Either a non-divided wordline (non-DWL) or a DWL structure can be selected according to the capacity and wordlength. The parameter n defines the bitline hierarchy and therewith affects the global and local bitline capacitances. Especially, charging and discharging the bitlines contributes significantly to the overall power consumption. The number of cells u in one column unit determines to a large extent the minimum energy consumption for one operation. The n and u must be carefully selected for trading the least frequent use of LSAs and the minimum switching capacitances.

3 Partitioning Impact Analysis

SRAMs typically include two major contributors to power consumption: the address decoders and the memory matrices. In the hierarchical architecture (Fig. 4), the way of dividing and combining the address decoders determines how the memory matrix is partitioned into sub-blocks. A probabilistic estimation approach is employed for estimating the switching activity and power consumption of the address decoder especially regarding whether or not a distributed wordline structure is used. The memory matrices including complex assist and periphery circuits, which consume a large portion of power, were also modeled and characterized. Four basic circuit templates and a power estimation method are proposed to extract and describe the architecture and circuit characteristics for the hierarchical architecture. The specific circuits used within the four circuit templates can be altered without changing the estimation approach itself. Various power reduction techniques, e.g. precharge schemes in [3, 5], circuit techniques in [1, 2, 5], can be pre-characterized and benchmarked in the same configurations, which makes this model very appropriate for customized SRAM designs.

For an SRAM with a address bits comprising a capacity of 2 ^a words with a wordlength w, a flow chart of the power estimation model is shown in Fig. 6. The two portions constituting the power model are the address decoder and the memory matrix. For the address decoder, a address bits are divided into three sections (m, n, u), which are decoded by the three pre-decoders including column, row and unit decoder. In the 2nd stage, a block decoder combines RS and CS signals to generate the Blockact signals. A word-row decoder uses RS and US signals for generating the GWL signals. In the 3rd step, a word decoder uses both Blockact and GWL signals to produce the LWL signals. The three decoders are all composed of NAND or NOR gates and are arranged in a matrix-like select circuit. The sum of the power consumed by the pre-decoders and the matrix-like select circuits provides an estimate of the total power estimation of the whole address decoder. For the memory matrix, the parameters (m, n, u) are used for quantifying and analyzing the bitline and wordline structure. Since the parameters represent the number of the sub-modules and determine the final SRAM architecture, their respective impact is analyzed and attributed to the components of the memory matrix. The parameter m determines the capacitances of the horizontal global wordlines (HGBL) and the amount of pass transistors used as column selectors. The number of the Blockact and GWL signals affects the capacitances of vertical global bitlines (VGBL) and GWLs respectively. Finally, the choice of u has an impact on the power consumption from LBLs of accessed cells, assist circuits, and LSAs. Therefore, a quantitative analysis about the dependency relations is made between the combinations of (m, n, u) and all the power contributors. Given the specifications a and w, all the possible partitioning parameters (m, n, u) are evaluated by estimating the respective power consumption. Finally, the optimal parameter selection is saved for fulfilling different ATE design requirements, such as minimum power consumption.

4 Power Model of Address Decoder

4.1 Basic Circuits of Address Decoder

As shown in Fig. 6, the address decoder includes three pre-decoders and three distributed decoders. The three pre-decoders can be decoders comprising either a large fan-in or a small fan-in, depending on their input numbers. The other three intermediate decoders are regarded to be matrix-like select circuits which are composed of logic gates distributed in a matrix. A probabilistic method is employed for modeling the underlying switching activities of these logic gates, by which the transition power consumption of the matrix-like select circuit is estimated. The large fan-in decoder is composed of a matrix-like select circuit and two small fan-in decoders. Therefore, if the energies associated with small fan-in decoders and basic gates are available, the energy of the three pre-decoders and the three distributed decoders can be derived by the probabilistic method. Also, a realistic topology estimation approach is used to estimate the wire capacitances and area of different (m, n, u).

A circuit pre-characterization database is built in the pre-characterization phase, which includes the related configuration regarding the use case (V_DDH, V_DDL), process corners, temperature and frequency. The database can be acquired in a short time since the complexity of the basic circuits is much less compared to the overall SRAMs. Moreover, such a pre-characterization approach is also convenient for estimating the static power dissipation. Small fan-in decoders are usually very flexible and customized regarding their layouts and transistors, so these basic circuits were simulated based on extracted-netlists. Dynamic energy, static power, input capacitances and areas are listed in Table 2 for TT corner, 25°C, 400 MHz and 0.9 V supply in a 40-nm CMOS technology. Dynamic energy figures were obtained from random-input power simulations. Other corners were evaluated as well, but only TT corner numbers are reported in this chapter. The static power values were determined by power simulations at different frequencies and approximately linear extrapolation of the results for P(f = 0).

Table 2. Characterization database of basic decoders (TT, 25°, 400 MHz, V_DDH = 0.9 V)

Full size table

4.2 Switching Activity

Besides the energy of the basic decoders (Table 2), the matrix-like select circuits formed by NAND or NOR gates also contribute significantly to the total power consumption. Such circuits are typically acting as distributed decoders and are located among the memory matrices. As illustrated in Fig. 7, a distributed decoder composed of NOR gates has an aspect ratio given by R rows and C columns. When another address is accessed, not only the corresponding gates switch but also the other gates in the row and column are charged and then discharged. As different transitions of each gate lead to different amounts of consumed energy, the corresponding energy of each NOR gate in its transition cases must be estimated separately. Additionally, for the overall matrix-like circuits composed of NOR gates four switching cases (Fig. 7) exist. For each switching case the energy and switching probability are also derived.

In Table 3, a subset of the database for the consumed energy of the basic NOR gates is shown for each possible input transition. E.g. the transition 00 → 01 of a two-input NOR gate is depicted by the decimal equivalents 0 → 1. Hence, a transition 10 → 01 is depicted by 21 and its switching energy is denoted by $ E_{NOR}^{21} $, and so on. In particular, the static power of the four “no transition” situations (00, 11, 22, 33) is also included in Table 3, where the total energy is obtained when the frequency is set to 400 MHz. A NAND matrix-like select circuit can be pre-characterized and estimated in the same way as well.

Table 3. Energy of NOR gate for all possible input transition possibilities (TT, 25°, 400 MHz, V_DDH = 0.9 V, V_DDL = 0.3 V)

Full size table

For a distributed decoder with (R x C) NOR gates four possible switching cases exist. Case1 means no switching of the selected column or row. Case2 means a switching of the selected column within the same row. Case3 means a switching within the same column. Case4 means a switching from one gate to another gate located in a different row and a different column. In order to elaborate the switching details and its energy distribution, Case2 is exemplified in four steps as shown in Fig. 7. Since the switching happens in the same row, it means that one cross point in the matrix is selected and another one in the same row is unselected. (a) Hence, the selected NOR gate switches from 1 to 0 and another unselected one switches from 0 to 1. (b) Horizontally in the selected row, (C-2) NOR gates have no switches and their inputs stay at 1. (c) Vertically, (R-1) NOR gates switch from 2 to 3, which means they are discharged in the relevant column. Also, (R-1) NOR gates switch from 3 to 2 which means they are charged in another column. (d) The remaining NOR gates do not switch and stay at 3. Using the transition energy depicted in Table 3 the respective energy of the four switching cases is derived as

$$ E_{DecMatrix}^{Case1} = E_{NOR}^{00} + \left( {R - 1} \right) \cdot E_{NOR}^{22} + \left( {C - 1} \right) \cdot E_{NOR}^{11} + \left( {R - 1} \right) \cdot \left( {C - 1} \right) \cdot E_{NOR}^{33} $$

(1)

$$ E_{DecMatrix}^{Case2} = E_{NOR}^{01} + E_{NOR}^{10} + \left( {R - 1} \right) \cdot \left( {E_{NOR}^{23} + E_{NOR}^{32} } \right) + \left( {C - 2} \right) \cdot E_{NOR}^{11} + \left( {R - 1} \right) \cdot \left( {C - 2} \right) \cdot E_{NOR}^{33} $$

(2)

$$ E_{DecMatrix}^{Case3} = E_{NOR}^{02} + E_{NOR}^{20} + \left( {C - 1} \right) \cdot \left( {E_{NOR}^{13} + E_{NOR}^{31} } \right) + \left( {R - 2} \right) \cdot E_{NOR}^{22} + \left( {R - 2} \right) \cdot \left( {C - 1} \right) \cdot E_{NOR}^{33} $$

(3)

$$ \begin{aligned} E_{DecMatrix}^{Case4} =& E_{NOR}^{03} + E_{NOR}^{12} + E_{NOR}^{21} + \left( {R - 2} \right) \cdot \left( {E_{NOR}^{23} + E_{NOR}^{32} } \right) + E_{NOR}^{30} + \left( {C - 2} \right) \hfill \\ &\cdot \left( {E_{NOR}^{13} + E_{NOR}^{31} } \right) + \left( {R - 2} \right) \cdot \left( {C - 2} \right) \cdot E_{NOR}^{33}. \hfill \\ \end{aligned} $$

(4)

In particular, the four cases occur with different probabilities, which are associated with the number of rows and columns. These probabilities may also depend on the way the memory is used in an application but in the context of this chapter the focus is set on the random accesses. Assuming a random address access pattern for the SRAM, the probabilities are derived as follows. Dynamic energy of the matrix-circuit is estimated as

$$ \begin{aligned} E_{matrix} \left( {R,C} \right) =& E_{DecMatrix}^{Case1} \cdot \left( {{1 \mathord{\left/ {\vphantom {1 {\left( {R \cdot C} \right)}}} \right. \kern-0pt} {\left( {R \cdot C} \right)}}} \right) + {{E_{DecMatrix}^{Case2} \cdot \left( {C - 1} \right)/\left( {R \cdot C} \right) + E_{DecMatrix}^{Case3} \cdot \left( {R - 1} \right)} \mathord{\left/ {\vphantom {{E_{DecMatrix}^{Case2} \cdot \left( {C - 1} \right)/\left( {R \cdot C} \right) + E_{DecMatrix}^{Case3} \cdot \left( {R - 1} \right)} {\left( {R \cdot C} \right)}}} \right. \kern-0pt} {\left( {R \cdot C} \right)}}\hfill \\ & + E_{DecMatrix}^{Case4} \cdot \left( {1 - {1 \mathord{\left/ {\vphantom {1 {\left( {R \cdot C} \right)}}} \right. \kern-0pt} {\left( {R \cdot C} \right)}} - {{\left( {R - 1} \right)} \mathord{\left/ {\vphantom {{\left( {R - 1} \right)} {\left( {R \cdot C} \right) - {{\left( {C - 1} \right)} \mathord{\left/ {\vphantom {{\left( {C - 1} \right)} {\left( {R \cdot C} \right)}}} \right. \kern-0pt} {\left( {R \cdot C} \right)}}}}} \right. \kern-0pt} {\left( {R \cdot C} \right) - {{\left( {C - 1} \right)} \mathord{\left/ {\vphantom {{\left( {C - 1} \right)} {\left( {R \cdot C} \right)}}} \right. \kern-0pt} {\left( {R \cdot C} \right)}}}}} \right). \hfill \\ \end{aligned} $$

(5)

The equation was verified using several different combinations of rows and columns and shows 5 % estimation error compared to extracted-netlist simulation results.

4.3 Energy Cost Related to Interconnects

As technology keeps shrinking the role of interconnects becomes increasingly significant in the total power budget. Particularly interconnects incur large capacitive loads in the dense SRAM layout. As described in Fig. 1, the 1^st stage pre-decoders and the 2^nd stage decoders are typically placed around the memory matrix. The 3^rd stage LWL decoders are distributed into the block columns. Hence, the aspect ratio of the LWL, local timing circuits and the bit-cell column must be considered together. For estimating the associated interconnect lengths, a floor plan containing the dominating memory matrix and address decoder must be determined in advance.

Since different layout floor plans result in different wire and coupling capacitances, two typical placements are selected as possible layout organizations. In the reference floor plans, a matrix circuit is always much larger than the other two sub-blocks. Therefore, a compact topology exhibiting smaller area is selected. As shown in Fig. 8, a horizontal placement leads to different interconnect lengths compared to a vertical placement. For a large fan-in decoder, two pre-decoders and a matrix-like select circuit are placed in both ways for evaluation purposes. For the given floor plans the total area, interconnect lengths and wire capacitances are estimated and compared. By assessing which placement is more compact for the overall floor plan, the two arrangements are selected.

The height and width of the two pre-decoders are denoted as (H _pre1 , W _pre1 ) and (H _pre1 , W _pre2 ), which are obtained from the pre-characterization given in Table 2. The height and width of basic gates (H _Gate , W _Gate ) such as NOR or NAND gate are also available. Given the two placement possibilities, the height and width of the required wiring can be derived for horizontal (H _h , W _h ) and vertical (H _v , W _v ) placements respectively.

$$ H_{h} = H_{{^{Gate} }} \cdot R,\,W_{h} = W_{{^{Gate} }} \cdot C + W_{{^{Pre1} }} + W{}_{{^{Pre2} }} $$

(6)

$$ H_{v} = H_{{^{Gate} }} \cdot R + H_{{^{Pre1} }} + H_{{^{Pre2} }} ,\,W_{v} = W_{{^{Gate} }} \cdot C $$

(7)

As the wire lengths in the two sub-blocks are much shorter compared to the matrix circuits, the following criterion is used to select the more compact topology. If holds, the placement should be horizontal, otherwise a vertical placement is applied. Subsequently, the wire lengths can be estimated by computing the amount of gates and their individual sizes in the selected placement. The two floor plans can be used either for a large fan-in decoder or a memory matrix and its surrounding circuits. For a global SRAM floor plan (Fig. 1), the two pre-decoders are replaced by a LWL decoder and a local control timing generator. The matrix-like select circuits are replaced by a memory matrix column. Thereby, a floor plan of a block column is determined. The decision procedure to estimate the global wire capacitances is similar.

$$ abs\left( {H_{{^{Pre1} }}^{{}} + H_{{^{Pre2} }} - H_{{^{Gate} }} \cdot R} \right)\,<\,abs\left( {W_{{^{Pre1} }} + W_{{^{Pre2} }} - W_{{^{Gate} }} \cdot C} \right) $$

(8)

Considering the switching activities of the relevant wires the energies for switching the interconnects in the two possible topology scenarios are estimated as

$$ \begin{aligned} E_{h} \left( {R,C} \right) =& Vdd^{2} \cdot 0.08 \cdot \left[ {W_{h} \cdot \left( {C - 1} \right) + H_{h} \cdot (R - 1)} \right]/\left( {R \cdot C} \right) \hfill \\ & + Vdd^{2} \cdot 0.08 \cdot \left( {W_{h} + H_{h} } \right) \cdot \left( {R \cdot C - R - C - 2} \right)/\left( {R \cdot C} \right) \hfill \\ \end{aligned} $$

(9)

$$ \begin{aligned} E_{v} \left( {R,C} \right) =& Vdd^{2} \cdot 0.08 \cdot \left[ {W_{v} \cdot \left( {C - 1} \right) + H_{v} \cdot (R - 1)} \right]/\left( {R \cdot C} \right) \hfill \\ & + Vdd^{2} \cdot 0.08 \cdot \left( {W_{v} + H_{v} } \right) \cdot \left( {R \cdot C - R - C - 2} \right)/\left( {R \cdot C} \right) \hfill \\ \end{aligned} $$

(10)

The wire capacitance per unit length of a metal wire is assumed as an appropriate value (0.08fF/µm) for a 40-nm technology. This value is evaluated and modified when coupling capacitances exist in very dense layouts. Moreover, under the assumption that only the column decoder switches and the row decoder does not, the switched capacitances are only determined by the width (W _h) with a switching probability (C-1)/(R·C). In case that only the row decoder switches and the column decoder does not, the switched capacitances considering its switching probability are equal to 0.08 ·H _h ·(C-1)/(R·C). If both row and column decoders are switching, the switched capacitances are computed considering both width and height. To summarize, for the two typical floor plans wire lengths and capacitances are estimated, which leads to a decision regarding which floor plan has to be assumed.

4.4 Verification of Address Decoder Estimation Model

For low-power SRAMs with large capacities and long word length, it is inevitable that a DWL structure is superior to non-DWL. Because in the non-DWL structure long wordlines suffer from the half-select problem and numerous bitlines are needlessly precharged. As shown in Fig. 9, in a DWL structure the block row decoder which generates 2n RS signals is used twice instead of only once in non-DWL structure. Also an extra distributed GWL decoder is used in a DWL structure which brings additional area cost. But the benefit of a DWL structure is that switching occurs within a smaller memory matrix and thereby the total energy is significantly reduced. Therefore this is a tradeoff between the power and area of the address decoder and memory matrix.

For the energy estimation of the DWL-address decoder, large fan-in decoders with (n + u) inputs can be handled by a nested loop calculation using smaller fan-in decoder data from Table 2 and the relevant matrix-like selected circuits. The energies of distributed decoders in the 2^nd and 3^rd stage are estimated by the approach described above. The dynamic energy and static power figures are derived as

$$ \begin{aligned} E_{dyc\_tot} = &E_{dyc} \left( n \right) + E_{dyc} \left( u \right) + E_{matrix} \left( {N,U} \right) + E_{wire} \left( {N,U} \right) + E_{dyc} \left( m \right) + E_{matrix} \left( {M,N} \right) \hfill \\ \quad + &E_{wire} \left( {M,N} \right) + E_{matrix} \left( {N \cdot U,M} \right) + E_{wire} \left( {N \cdot U,M} \right)\end{aligned} $$

(11)

$$ \begin{aligned} P_{sta\_tot} = &P_{sta} \left( n \right) + P_{sta} \left( u \right) + P_{sta} \left( m \right) + P_{sta\_matrix} \left( {N,U} \right) \hfill \\ & + P_{static\_matrix} \left( {M,N} \right) + P_{sta\_matrix} \left( {N \cdot U,M} \right) \end{aligned} $$

(12)

The dynamic energy of three pre-decoders are represented by E _dyc (n), E _dyc (u) and E _dyc (m). The parameters m, n and u denote the input widths of the three decoders respectively. The energy can be acquired from Table 2 (optionally in combination with small fan-in decoders and matrix-like circuits). For the second stage, energy figures for word-row and block decoders are given by E _matrix (N,U) + E _wire (N,U) and E _matrix (M,N) + E _wire (M,N). N = 2 ⁿ and U = 2 ^u represent the number of rows and columns of the matrix-circuit. Note that in the 3^nd stage a matrix (N·U, M) is applied instead of a matrix (N·U, M·N) since every GWL signal only needs 2 ^m Blockact signals to select the word in that column, cf. in Fig. 9. For an address decoder in a non-DWL structure there is no use of GWLs. Therefore, the energies from E _matrix (M, N) and E _wire (M, N) are not counted into the total energy.

Figure 10 shows the simulated energy versus the estimated energy of a non-DWL-1 K address decoder and a DWL-4 K address decoder. The breakdown energies are compared accordingly. In particular, the energy associated with wiring capacitance is quite low compared to other components. This is explained by a short global interconnect length and non-significant coupling. For larger address decoders and denser layouts, the energy contribution from interconnects cannot be neglected any more. It can be seen that the 4 × 1024-NOR-Matrix circuit dominates the overall power for the 4 K DWL decoder. The comparison indicates that the estimation errors of the address decoder power model are less than 10 %.

5 Power Model of the Memory Matrix

The contribution of the memory matrix to the total memory access energy is dominated by the cycle-based pre-charge and discharge of long bitlines. For low-power memory matrix designs, assist circuits, bit cells and pre-charge schemes span a large design space complicating the power modeling. Their complex features bring significant influences on the layout placement location and the switching capacitances. Accordingly, the total energy cannot be computed by directivity accumulating their respective individual energies. Additionally, the use of LSAs in [1] brings low-voltage swing at global bitlines and high-voltage swing at local bitlines. The complexity with multiple VDD plays at larger scale which makes it more difficult to estimate the power consumption. As before, the variable partitioning parameters (m, n, u and w) result in different access gates and parasitic capacitances due to different wire lengths. Another challenge is that read, write and standby operations must be considered separately, including a hierarchical bitline structure and the memory cell toggling state. In order to solve these issues, four circuit templates are proposed to act as a black box for pre-characterization. In this way a database depending on the use case (V_DDH and V_DDL), technology corners, temperature and the characteristics of gates (width) and wires is generated. Finally, the elementary energies from assist circuits, bit cells and vertical global bitlines are separated by our estimation approach. Combined with the partitioning parameters the power consumed by the overall memory matrix is estimated accurately. Leakage power is estimated in a similar way.

5.1 Four Circuit Templates

Four circuit templates based on the circuits given in [5] are presented as basic circuit elements for characterizing the complex assist circuits and specific bit cells. As shown in Fig. 11, a single cell circuit template is presented first to separate the elementary energy from multi cell circuits. Its dynamic energy consists of contributions from the local bitline of each cell (E _lbl), the local wordline (E _lwl) and the periphery circuits including precharge circuits (E _pre), read/write assist transistors (E _ren /E _wen) and LSA (E _lsa). For pre-characterization the customized design a layout for the single cell circuit template is drawn. This way the dynamic energy (E ₁) and static power (P _static1) are obtained by extracted-netlist simulation

$$ E_{1} = E_{pre} + E_{ren} + E_{lwl} + E_{lsa} + E_{lbl} . $$

(13)

For separating the elementary energy from the local bitline of each cell (E _lbl), a column unit circuit template is drawn in Fig. 12. Its dynamic energy (E ₂) and static power (P _static2) are also obtained from extracted-netlist simulation. In the same way its total energy (E ₂) is decomposed to several elementary energies. By taking the LSA and periphery circuits apart, the energy from the 1–8 cells are linearly interpolated. Thereby, the energy consumed by local bitline of each cell (E _lbl) is derived

$$ E_{2} = E_{pre} + E_{ren} + E_{lwl} + E_{lsa} + 8 \cdot E_{lbl} $$

(14)

$$ E_{lbl} = {{\left( {E_{2} - E_{1} } \right)} \mathord{\left/ {\vphantom {{\left( {E_{2} - E_{1} } \right)} 7}} \right. \kern-0pt} 7} $$

(15)

For further separating the elementary energy from the periphery circuits, a row unit circuit template is designed as shown in Fig. 13. In the same manner this fraction is separated by calculating E ₃ and E ₂.

$$ E_{3} = 8 \cdot \left( {E_{pre} + E_{ren} + E_{lbl} } \right) + E_{lwl} + E_{lsa} $$

(16)

$$ E_{pre} + E_{ren} + E_{lwl} + E_{lsa} = {{\left( {E_{3} - E_{2} } \right)} \mathord{\left/ {\vphantom {{\left( {E_{3} - E_{2} } \right)} 7}} \right. \kern-0pt} 7} . $$

(17)

Similarly, a column circuit template is created in Fig. 14, by which the elementary energy consumed by vertical global bitlines (E _vgbl) is separated.

$$ E_{4} = E_{pre} + E_{ren} + E_{lwl} + E_{lsa} + 8 \cdot E_{lbl} + 7 \cdot E_{vgbl} $$

(18)

$$ E_{vgbl} = {{\left( {E_{4} - E_{2} } \right)} \mathord{\left/ {\vphantom {{\left( {E_{4} - E_{2} } \right)} 7}} \right. \kern-0pt} 7} . $$

(19)

Since E ₁ …E ₄ and P _static1 …P _static4 are pre-characterized by simulating the extracted-netlists of the four circuit templates, the elementary energy values E _lbl , E _{pre+ren+lwl+lsa} , E _vgbl can be derived. This way the dynamic energy for read/write operations and static power of the four circuit templates are obtained. It is assumed that a toggle condition occurs for each write operation. As before, the simulation configuration is TT corner, 25°C, 400 MHz and 0.9 V supply voltage in 40 nm CMOS technology. The voltage swing of vgbl pair was chosen to 300 mV to guarantee robust operations. The estimation approach is the same for other technology corners but the pre-characterization must be modified based on a Monte-Carlo simulation.

Read Operation. As mentioned before, read and write operations are studied separately due to their different characteristics. For a read operation of a hierarchical SRAM with the use of LSAs, the bitline/wordline capacitance, the LSA along with read/write assist and precharge circuit are the main energy consuming components. The dynamic energies E _lbl and E _lwl are the sum of the energies due to the wiring capacitance itself and the capacitances attached to the memory cells. In addition, the energy consumed by the static components of the unselected memory matrices is attributed to the dynamic energy, comprising a significant portion. The static power (P _static1…P _static4) of the four circuit templates can be acquired using the same approach as before by separating the dynamic energy of each component. Particularly the pass transistors acting as column selectors are also included in the model. The static power of the global sense amplifiers (GSA) and the pass transistors are obtained by multiplying their count with the static power of two simple circuits: a GSA circuit and a pass transistor circuit. As a consequence, according to the parameters of memory matrix defined above for a hierarchical architecture, standby power of a column block can be estimated as

$$ \begin{aligned} P_{static\_col} &= w \cdot \left( {U \cdot N \cdot P_{lbl\_staic} + \mathop P\nolimits_{gsa\_pass\_static} } \right) + w \cdot \left( {P_{{pre\_{\text{static}}}} + P_{ren\_static} + P_{lwl\_static} + P_{lsa\_static} } \right) \hfill \\ & = w \cdot \left( {U \cdot N \cdot {{\left( {P_{static2} - P_{static1} } \right)} \mathord{\left/ {\vphantom {{\left( {P_{static2} - P_{static1} } \right)} 7}} \right. \kern-0pt} 7} + P_{gsa\_pass\_static} )} \right) + w \cdot \left( {{{\left( {P_{static3} - P_{static2} } \right)} \mathord{\left/ {\vphantom {{\left( {P_{static3} - P_{static2} } \right)} 7}} \right. \kern-0pt} 7}} \right)\end{aligned} $$

(20)

Finally, the overall dynamic energy of reading a bit from the memory matrix is estimated. The partitioning parameters (m, n, u) are converted to the number (M = 2 ^m , N = 2 ⁿ , U = 2 ^u) of partitioned components in memory matrix. The total energy is calculated by parameterized accumulating the elementary energies (E _lbl, E _{pre+ren+lwl+lsa}, E _vgbl, E _gsa, E _pass). Particularly, the energy from the unselected parts in the memory matrices is calculated independently and then added to the total dynamic energy.

$$ \begin{aligned} E_{read\_bit} =& w \cdot U \cdot E_{lbl} + E_{vgbl} \cdot \left( {N - 1} \right) + E_{gsa} + M \cdot w \cdot E_{pass} \hfill \\ & + \left( {E_{pre} + E_{ren} + E_{lwl} + E_{lsa} } \right) + (M - 1) \cdot P_{static\_col} /f \hfill \\ =& w \cdot U \cdot {{\left( {E_{2} - E_{1} } \right)} \mathord{\left/ {\vphantom {{\left( {E_{2} - E_{1} } \right)} 7}} \right. \kern-0pt} 7} + \left( {N - 1} \right) \cdot {{\left( {E_{4} - E_{2} } \right)} \mathord{\left/ {\vphantom {{\left( {E_{4} - E_{2} } \right)} 7}} \right. \kern-0pt} 7} + E_{gsa} + M \cdot w \cdot E_{pass} \hfill \\& + {{\left( {E_{3} - E_{2} } \right)} \mathord{\left/ {\vphantom {{\left( {E_{3} - E_{2} } \right)} 7}} \right. \kern-0pt} 7} + (M - 1)P_{static\_col} /f .\end{aligned} $$

(21)

Write Operation. For the write operation, the method to separate the dynamic write energy of each component is similar but using the pre-characterized write energies in Table 4. Particularly, the toggle state is not considered here because the energy of the write operation is obtained assuming a toggle event for each write. A toggle state does not occur all the time and its corresponding energy can be estimated using a similar approach as in [2] using a toggling probability. As a result, the write cycle energy can be calculated as follows.

Table 4. Energy of four circuit templates (TT, 25°, 400 MHz, V_DDH = 0.9 V, V_DDL = 0.3 V)

Full size table

$$ \begin{aligned} E_{write\_bit} =& w \cdot U \cdot E_{lbl}^{'} + \left( {E_{{pr{\text{e}}}}^{ '} } \right. + E_{wen}^{'} + \left. {E_{lwl}^{'} + E_{lsa}^{'} } \right) \hfill \\ & + \left( {N - 1} \right) \cdot E_{vgbl}^{'} + (M - 1)P_{standby\_col} /f \hfill \\= &w \cdot U \cdot {{\left( {E_{2}^{'} - E_{1}^{'} } \right)} \mathord{\left/ {\vphantom {{\left( {E_{2}^{'} - E_{1}^{'} } \right)} 7}} \right. \kern-0pt} 7} + {{\left( {E_{3}^{'} - E_{2}^{'} } \right)} \mathord{\left/ {\vphantom {{\left( {E_{3}^{'} - E_{2}^{'} } \right)} 7}} \right. \kern-0pt} 7} \hfill \\ & + \left( {N - 1} \right) \cdot {{\left( {E_{4}^{'} - E_{2}^{'} } \right)} \mathord{\left/ {\vphantom {{\left( {E_{4}^{'} - E_{2}^{'} } \right)} 7}} \right. \kern-0pt} 7} + (M - 1) \cdot P_{standby\_col} /f. \end{aligned} $$

(22)

5.2 Verification of Memory Matrix Model

Several memory matrices have been simulated in 40-nm CMOS technology to validate the model equations above. In Fig. 15 the dynamic power break-down of a 64-KByte memory matrix is shown. It is observed that the energy of local bitlines dominates the power consumption in a memory matrix as compared to the other circuits.

Figure 16 shows a comparison of simulation and estimation data for four memory matrices of different capacities (64, 128, 256, 1 K). Assuming a read access operation the four extracted-netlists were simulated using the same configuration. As shown in Table 5, the dynamic energies are compared to the estimated data and the differences are below 10 %. For the leakage power, the same comparisons are performed and the estimation errors are also below 10 %.

Table 5. Model Estimation Errors of the four capacities (TT, 25°, 400 MHz, V_DDH = 0.9 V, V_DDL = 0.3 V)

Full size table

To further demonstrate the accuracy, four memory matrices with a fixed number of 64 words and with word lengths 8, 16, 32, 64 are implemented. As shown in Fig. 17 the estimation data are comparable to extracted-netlist simulation data. Both for dynamic energy and leakage power the estimation error remains below 10 %, as listed in Table 6.

Table 6. Model Estimation Errors of the four word lengths (TT, 25°, 400 MHz, V_DDH = 0.9 V, V_DDL = 0.3 V)

Full size table

6 Optimization Results

The power model created in this work takes all the dominating power contributors of on-chip SRAMs into account, which include an address decoder, memory cells and assists circuits, local and global sense amplifiers, driver circuits and interconnect capacitance. Figure 18 shows the power model applied for estimating the power consumption of SRAMs for various capacities and wordlengths. Minimum read dynamic power data is given due to more frequent read operation in caches. The model is applicable for capacities ranging from 16 to 1 M words and four wordlengths (8, 16, 32, 64). The figure illustrates how DWL and hierarchical LSA architecture affect the read power of SRAMs as function of different capacities and wordlengths. Moreover it indicates the different power contributions from address decoder and memory matrix to the dynamic read power.

In addition, the power model can be used for optimizing a specific SRAM by determining the optimal parameter combination. As discussed before, many possibilities exist for partitioning the memory matrix, the corresponding address decoder, given the three parameters partitioning parameters (m, n, u), and many options for circuit implementations. Depending on the optimization criteria parameter combinations are picked from all possible implementations options. Note that the impact of process variations on leakage power is included in the power model.

A Pareto-optimization is made by considering silicon area and read power of the different partitioning parameters. Figure 19 shows how to use this approach to optimize a 1 K Byte SRAM for achieving a power and area tradeoff. In the scatter plot, there are ten architectures presenting relatively good area and power, which are picked from all the generated architectures. Four Pareto-optimal implementations are marked, in which two architectures deliver low area and the other two deliver low read power. Depending on the user’s requirements a selection can be made, for instance the green point deliving a favorable area/power tradoff.

Figure 20 shows a direct area/power break-down for the same ten possible architectures in Fig. 19. The contributions of the address decoder and memory matrix to the overall area and read power are shown and analyzed quantitatively. Between the worst case and best case solution a difference is observed for area and power up to 41 % and 62 % respectively.

7 Conclusion

In this chapter, a new method for power optimization of on-chip SRAMs comprising a hierarchical architecture was described. The method is based on a power model including various energy-efficient circuits and techniques. The introduction of the probabilistic estimation approach and the use of circuit templates provide quantified switching activities and pre-characterized customized circuits separately. Simultaneously the hierarchical architecture regarding many partitioning choices is defined by the partitioning parameters. The power model is verified by a variety of extracted-netlist simulations and it consistently exhibits good accuracy.

As a quantitative parameter optimization tool, this approach allows a fast and accurate power estimation of SRAMs comprising various capacities and wordlengths. In a hierarchical-architecture SRAM, the impact of partitioning with circuit selections on power and area were evaluated. The optimal architecture and circuits can be identified very quickly and accurately which leads to a SRAM specification with an achievable and attractive power consumption and silicon area. Moreover, this approach allows an easy tradeoff between area and power for meeting different design requirements. Furthermore, the power model can also be employed as a customized benchmark for comparing various local circuits using the same architecture. Finally, this approach can easily be extended to other CMOS technologies due to its circuit templates and switching activity analysis.

References

Sharma, V., et al.: A 4.4 pJ/access 80 MHz, 128 kbit variability resilient SRAM with Multi-Sized sense amplifier redundancy. IEEE J. Solid-State Circuits 46(10), 2416–2430 (2011)
Article Google Scholar
Clerc, S., et al.: A 65 nm SRAM achieving 250 mV retention and 350 mV, 1 MHz, 55fJ/bit access energy, with bit-interleaved radiation soft error tolerance. In: 2012 Proceedings of the ESSCIRC (ESSCIRC), pp. 313–316. IEEE (2012)
Google Scholar
Rooseleer, B., Dehaene, W.: A 40 nm, 454 MHz 114 fJ/bit area-efficient SRAM memory with integrated charge pump. In: 2013 Proceedings of the ESSCIRC (ESSCIRC), pp. 201–204. IEEE (2013)
Google Scholar
Ren, Y., Noll, T.G.: An accurate power estimation model for low-power hierarchical-architecture SRAMs. In: 2013 IFIP/IEEE 21st International Conference on Very Large Scale Integration (VLSI-SoC), pp. 144–149.IEEE (2013)
Google Scholar
Ren, Y., et al.: Low power 6T-SRAM with tree address decoder using a new equalizer precharge scheme. In: 2012 IEEE International SOC Conference (SOCC), pp. 224–229. IEEE (2012)
Google Scholar
Muralimanohar, N., et al.: CACTI 6.0: A tool to model large caches. Technical report (2009). http://www.hpl.hp.com/techreports/2009/HPL-2009-85.pdf
Liang, X., et al.: Architectural power models for sram and cam structures based on hybrid analytical/empirical techniques. In: IEEE/ACM International Conference on Computer-Aided Design, ICCAD 2007, pp. 824–830. IEEE (2007)
Google Scholar
Do, M.Q., et al.: Leakage-Conscious Architecture-Level power estimation for partitioned and Power-Gated SRAM arrays. In: 8th International Symposium on Quality Electronic Design, ISQED 2007, pp. 185–191. IEEE, Washington, DC (2007)
Google Scholar
Donkoh, E., et al.: A hybrid and adaptive model for predicting register file and SRAM power using a reference design. In: 2012 49th ACM/EDAC/IEEE Design Automation Conference (DAC), pp. 62–67. IEEE (2012)
Google Scholar
Sun, L., et al.: Low power and robust binary tree SRAM design for embedded systems. In: 2013 International Symposium on Electronic System Design (ISED), pp. 87–92. IEEE (2013)
Google Scholar

Download references

Author information

Authors and Affiliations

Chair of Electrical Engineering and Computer Systems, RWTH Aachen University, Aachen, Germany
Yuan Ren & Tobias Noll

Authors

Yuan Ren
View author publications
You can also search for this author in PubMed Google Scholar
Tobias Noll
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Yuan Ren .

Editor information

Editors and Affiliations

University of California at San Diego, La Jolla, California, USA
Alex Orailoglu
Ozyegin University, Istanbul, Turkey
H. Fatih Ugurdag
University of Lisbon, Lisbon, Portugal
Luís Miguel Silveira
University of Massachusetts, Lowell, Massachusetts, USA
Martin Margala
Universidade Federal do Rio Grande do Sul, Porto Alegre, Rio Grande do Sul, Brazil
Ricardo Reis

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Ren, Y., Noll, T. (2015). Quantitative Optimization and Early Cost Estimation of Low-Power Hierarchical-Architecture SRAMs Based on Accurate Cost Models. In: Orailoglu, A., Ugurdag, H., Silveira, L., Margala, M., Reis, R. (eds) VLSI-SoC: At the Crossroads of Emerging Trends. VLSI-SoC 2013. IFIP Advances in Information and Communication Technology, vol 461. Springer, Cham. https://doi.org/10.1007/978-3-319-23799-2_4

Download citation

DOI: https://doi.org/10.1007/978-3-319-23799-2_4
Published: 20 November 2015
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-23798-5
Online ISBN: 978-3-319-23799-2
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Quantitative Optimization and Early Cost Estimation of Low-Power Hierarchical-Architecture SRAMs Based on Accurate Cost Models

Abstract