Academia.eduAcademia.edu

An Asynchronous Linear Predictive Analyzer

1997

Linear predictive analysis is a standard technique in modern digital speech processing. This makes it an interesting implementation area for asynchronous design. We present an asynchronous speed-independent circuit implementation of a linear predictive analysis system. The implementation is built around a program ROM into which the algorithm is encoded. The design process is carried out using the action systems formalism as the development tool. As the result we get an efficient and logically highly reliable system with a potential for low power consumption. We present various block diagrams of the resulting composition and show the details of a set of selected controllers.

An Asynchronous Linear Predictive Analyzer Juha Plosila University of Turku, Dep. of Applied Physics Lab. of Electronics and Information Technology FIN-20014 Turku, Finland Tiberiu Seceleanu Turku Centre for Computer Science (TUCS) Lemminkäisenkatu 14, FIN-20520 Turku, Finland Turku Centre for Computer Science TUCS Technical Report No 142 November 1997 ISBN 952-12-0095-2 ISSN 1239-1891 Abstract Linear predictive analysis is a standard technique in modern digital speech processing. This makes it an interesting implementation area for asynchronous design. We present an asynchronous speed-independent circuit implementation of a linear predictive analysis system. The implementation is built around a program ROM into which the algorithm is encoded. The design process is carried out using the action systems formalism as the development tool. As the result we get an efficient and logically highly reliable system with a potential for low power consumption. We present various block diagrams of the resulting composition and show the details of a set of selected controllers. Keywords: linear predictive analysis, speech compression, asynchronous circuits, action systems, implementation TUCS Research Group Programming Methodology Research Group 1 Introduction Linear predictive analysis [11, 16] is a powerful speech analysis technique with which the basic speech parameters such as pitch, formants, spectra and vocal tract area functions can be reliably and accurately estimated. Hence, linear prediction is the basic method behind the modern speech coding and compression techniques used for instance in digital mobile phones. This method has the ability to provide extremely accurate estimates of the speech parameters with a good computation speed. Linear predictive analysis is based on the idea that a speech sample can be approximated, or predicted, as a linear combination of past speech samples. A unique set of predictor coefficients can be determined by minimizing the sum of the squared differences between the actual speech samples in a finite time frame and the corresponding estimates obtained by linear prediction. This minimization problem can be solved using several approaches. A common and reliable scheme is the autocorrelation method, where the predictor coefficients are obtained by computing a set of autocorrelation coefficients. What makes linear prediction an attractive target for asynchronous techniques is the fact that it is a common factor in the present speech compression methods. In this paper, we present an asynchronous implementation of a system that computes linear predictive analysis using the autocorrelation method. As a specification language we have the action systems formalism [2] which allows us to derive the target circuit in a stepwise manner within a mathematical framework, the refinement calculus [1, 4]. The logical correctness of the design is preserved throughout the derivation, from the initial specification to the final detailed description which is implemented as a network of circuit elements. Consequently, the design process yields a logically highly reliable implementation. In this paper, the emphasis is on the implementation itself rather than in the details of the derivation. Basically, the derivation flow follows the guidelines presented in our previous work on a pipelined microprocessor [13]. We thereby provide more evidence that our approach is suitable for asynchronous design. The resulting circuit contains a 46-word program ROM (PROM) for the involved algorithms and a 2-stage pipeline. The control logic is mainly speed-independent, but the data-path completion signals for data-path components are generated via matched delays [6]. This is a compromise that requires careful timing analysis of the data path but keeps the hardware overhead reasonable. We believe that asynchronous techniques, because of their potential for low power consumption with relatively good performance, are well-suited for speech processing applications, especially for those used in battery-operated devices. Our estimations show that our design, even though it contains only a minimal pipeline structure, has a high throughput capabilty indicating that the idle periods of the system are long compared to the active periods. This, in turn, indicates potential for a low-power behavior. Furthermore, our PROM-based system is easy to upgrade: the other algorithms needed in a speech compression method can be merged into the system basically by expanding the PROM and the data path resources without changing the control logic. 1 Overview of the paper We proceed as follows. In sections 2-3 we briefly introduce the linear predictive analysis basics and the action systems framework. The initial specification of the circuit is given in section 4. The guidelines of the decomposition process are discussed in section 5. In sections 6 and 7, we describe the operation of the different functional blocks of the final composition and show the program ROM codes of the involved algorithms. The system performance issues are discussed in section 8. We end with some concluding remarks in section 9. 2 Linear prediction The following is based on the comprehensive study on linear prediction in [11, 16]. The speech production mechanism, including glottal excitation, vocal tract response, and sound radiation, can be modelled by a time varying digital filter whose system function H (z ) is of the form S (z) U (z ) G P (1) p ;k 1; k=1 ak z where U (z ) and S (z ) are z-transforms of the excitation u(n) and the speech samples s(n), respectively. The parameter G, in turn, is a gain factor. We can write for the sequences s(n) and u(n) the simple difference equation p X s(n) = ak s(n ; k) + Gu(n) (2) H (z ) = = k=1 where s^(n) = p X k=1 ak s(n ; k) (3) is called a linear predictor of the order p with the coefficients ak . The prediction error e(n), also known as the residual, is defined as e(n) = s(n) ; s^(n) = s(n) ; p X k=1 ak s(n ; k) (4) This error is an output from a system whose transfer function is A(z ) = 1 ; p X k=1 ak z;k (5) The prediction error filter A(z ) is known as the inverse filter, because the synthesis filter H (z ), defined in Eq. 1, can be written as H (z ) = 2 G A(z) (6) The total squared prediction error, which represents the energy of the error sequence e(n), is defined as E = X 2 e (n) X = n s(n) ; n p X k=1 !2 ak s(n ; k) (7) The predictor coefficients ak are determined by minimizing E by setting E ai =0  1 ip (8) yielding the following set of equations, also known as the normal equations: p X X ak s(n ; k)s(n ; i) k=1 n = X n s(n)s(n ; i)  1  i  p In other words, we have p equations from which the unknown coefficients ak , 1 can be solved. Autocorrelation method The autocorrelation function s(n) is given as R(i) = 1 X n=;1 (9)  k  p, R(i) of the speech sequence s(n)s(n + i) (10) ;1 < n < 1, and observing that By assuming that the minimization interval is infinite, R(i) is an even function, we can reduce the equations 9 to p X k=1 ak R(i ; k) = R(i)  1  i  p (11) In practise, however, we can process only finite segments of the sequence s(n), and hence s(n) is windowed using a window function w(n) yielding the new sequence sw (n): sw (n) ( = s(n)w(n) 0 0 nN ;1 otherwise (12) where N is the width of the window, or the frame size. The windowing reduces Eq. 10 to R(i) = NX ;1;i n=0 sw (n)sw (n + i)  i  0 3 (13) A very efficient method for solving the coefficients ai from Eq. 11 is the Durbin’s recursive procedure which can be written as follows: E0 := R(0) for i =  1 to p : Pi;1 ai;1 R(i ; j ). E  ( ki := R(i) ; i;1 j =1 j i ai := ki  for j = 1 to i ; 1 : aij := aij;1 ; ki aii;1 ;j  2 Ei := (1 ; ki )Ei;1 ) for i = 1 to p : (14) ai := api Here the intermediate quantities ki are called the reflection coefficients which can be used to construct the lattice form versions of the direct form filters H (z ) and A(z ) introduced above. In fact, the lattice form is preferable because of its better stability properties even though it requires more computation than the simpler direct form. 3 Action systems The action systems formalism is based on an extended version of the guarded command language of Dijkstra [7]. The statements of this language include assignment, sequential composition, assertion, conditional choice and iteration, and are defined using weakest precondition predicate transformers. Comprehensive study on this formalism can be found for example in [2, 3]. The action systems framework in asynchronous design is treated in [13, 14, 15]. ! S > where g, the guard, Actions An action is a guarded command of the form < g is a boolean condition, and S , the body, is any statement in our language. The action A is said to be enabled when the guard is true, disabled otherwise. If g is invariantly true, we often write the action A simply as < S >. We also use the following constructs:  Choice: The action A1 ] A2 tries to choose an enabled action from A1 and A2 , the choice being nondeterministic when both are enabled.  Sequential composition: A1  A2 first behaves as A1 if this is enabled, then as A2 which can be enabled by A1 or by another action outside the sequential composition. Sequencing forces A1 and A2 to be exclusive.   k Parallel composition: A1 A2 first behaves as the choice A1 ] A2 , then as A2 , if A1 was selected, or as A1, if A2 was selected. Each action is executed once. f  i = 1::n : Ai ]g, where  is either ], ;, or      An. The scope of a constructor is indicated with parenthesis, for example A ((B ] C ) k D ). Quantified instantiation: The notation , is defined to be equivalent to A1 k 4 Action systems An action system has the form: A sys (g ) :: var l init g l := g0  l0  do A1 : : : Am od j  j ]  where g , and l are lists of identifiers initialized to g0 and l0 , respectively. The identifiers l are the local variables visible only within . The identifiers g, in turn, are the global visible to other action systems as well. The local and global variables variables of are assumed to be distinct. The actions Ai of are allowed to refer to all of the state variables consisting of the local and global variables. The actions are considered atomic, i.e., if an action is selected for execution, it will be completed without interference. Therefore two actions that do not have any read-write conflicts can be executed in any order, or simultaneously. Hence, we can model parallel programs with action systems taking the view of interleaving action executions. A A A Parallel composition Consider two action systems A and B A sys (gA ) :: var lA  init gA  lA := gA0  lA0  do A1 : : : Am od j  j ]  B sys (gB ) :: var lB  init gB  lB := gB 0  lB 0  do B1 : : : Bn od j  j ]  \ \ where lA lB = , and the initializations of the global variables gA gB in the systems and are consistent with each other. The parallel composition of and , denoted , is the action system A B AkB C A B sys (gA var lA do (A1 j ] j od gB ) :: lB  init gA gB  lA lB := gA0 gB0  lA0 lB0   : : :  Am) (B1  : : :  Bn ) ] Thus, parallel composition combines the state spaces of the constituent action systems keeping the local variables lA and lB distinct. The reactive components and interact with each other via the global variables that are referenced in both components. Termination of computation is a global property of the composed action system . A C 5 B 4 Specification of the analyzer In this section, we present the formal specification of an asynchronous linear predictive analysis system which uses the autocorrelation method with the Durbin’s recursion. Note that the analyzer is viewed here as a standalone system with the output operations of its own, but in a practical speech compression method, the analysis is a part of a bigger concept. The input for the analyzer is thought to be a continuous 8 kHz stream of speech samples. The system outputs the windowed samples and the computed reflection coefficients for further processing. System parameters In order to write the initial description for our linear predictive analyzer, we must first select an appropriate window function w(n), window width N , and predictor order p.  For windowing we choose the commonly used Hamming window [17] : w(n) =  2n  0:56 ; 0:46 cos N ;1 (15)  The frame size N is set to 256 which is a somewhat more challenging value than 160 of GSM [10]. Hence, the frame duration is 32 ms assuming that the sample rate of the incoming speech sequence is 8 kHz.  The parameter p is set to 10 which can be considered an optimal value [11] and will be used for example in the future GSM [9]. In the conventional GSM system [10] the order of the predictor is only 8, but increasing it by 2 makes the analysis more accurate and the quality of the synthesized speech better. Formal specification First we define a set of array types needed in the specification. type sblk 0::255] kblk ::  rblk 0::10] ablk 0::19] : real 0 9] The initial specification of the analyzer chip parallel composition A and its abstract environment E nv is the A k E nv where where A itself is a composition of three individual systems (see Fig. 1): A=b I p k Lpa k Op I sys p (ip : chan si : real s : sblk) :: var t : bool sa sb : sblk init ip lpa t := ack ack false do  ; i = 0::255 : < ip = req if t sai] := si ] t j ] j ; od f ]g ! : ! ! sb i] := si fi ip := ack > < lpa = ack ! if :t ! s := sa ] t ! s := sb fi t lpa := :t req > 6 A ip o1 lpa Ip si s Lpa o2 k op Op so Env Figure 1: Block diagram of the specification L sys pa (lpa o1 o2 : chan s : sblk k : kblk) ::  var R : rblk a : ablk E : real abase0 abase : int init lpa o1 o2 := ack ack ack do < lpa = req WIN  ACO > ; (< o1 := req > < DUR >) ; < o1 = ack o2 := req > ; < o2 = ack lpa := ack > od j ! ! ! k j ] O sys p (o1 o2 op : bool so : real init o1 o2 op := ack ack do < o1 = req skip > ;  ; i = 0::255 : j f g ] ; ; ; ! < so := si] op := req > ; < op = ack ! skip > < o1 := ack > < o2 = req ! skip > f ; i = 0::9 : < so := ki] op := req > ; < op = ack ! skip > g ] j s : sblk k : kblk) :: ; < o2 := ack od > ] with b WIN = for j = 0 to 255 : sj ] := sj ]  (0:56 ; 0:46 cos( 2255j )) b ACO= for i = 0 to 10 : Ri] := P255;i sj ]sj + i] j =0 7 b DUR= abase0 abase1 E := 0 10 R0] for i = 0 to 9 : Pi;1 aj + abase1]Ri ; j ])=E  ( k i] := (Ri + 1] ; j =0 ai + abase0] := ki] for j = 0 to i ; 1 : aj + abase0] := aj + abase1] ; ki]ai ; j + abase1] E := (1 ; ki]2 )E  abase0 abase1 := abase1 abase0 ) The environment E E nv is given as sys nv (ip op : chan si so : real) ::  var s : real init ip op := ack ack do < ip = ack si := si0 :si0 real j ] ] j od ! 2 < op = req ! s op := so ack > ip := req > The operation of the above composition is the following. The environment outputs a sample si to by sending a request through the channel ip. The input unit p then writes si into the array sa 0::255] (sb 0::255]) which models the first (second) input buffer, and sends an acknowledgement to nv . This procedure is performed 256 times to fill the input buffer. Then p activates computation of the windowed samples and reflection coefficients by sending a request to the computation unit pa through the channel lpa. At the same time it switches the input buffer from sa (sb ) to sb (sa ) by toggling the auxiliary boolean variable t and starts to receive the next 256-sample frame. Consequently, receiving a new frame sb (sa ) and processing the current frame sa (sb ) take place in parallel. Because there is a continuous 8 kHz data stream at the input, the obvious real-time constraint is that pa must be idle and ready for a new round whenever p reaches the end of a frame and wants to switch the input buffer. This happens every 32 ms. When pa receives a request from p, it starts to compute linear predictive analysis for the sample frame s which in fact is a copy of either sa or sb depending on the state of the variable t in p. After the execution of the windowing and autocorrelation procedures WIN and ACO, the output operation of the 256 windowed samples in s is activated by sending a request to the output unit p through the channel o1 in parallel with the computation of the Durbin’s procedure DUR. When both of these operations have been completed, pa sends a request to p through the channel o2 activating the output procedure of the 10 reflection coefficients k i]. p sends an acknowledgemet through o2 when this has been completed. Finally, after receiving an acknowledgement from p, the computation unit pa sends an acknowledgement to p through lpa and is ready to receive the next frame s. The output unit p receives requests from pa through o1 and o2 . It sends, when requested, the windowed samples s j ] and the reflection coefficients k i] to nv by communicating through the channel op. The output procedures use the variable so as the common output buffer. A I E I L L I L I I O L O O L O I O L 8 E L Note that the procedure DUR in pa is a modified version of the algorithm 14. The main difference is that in DUR the need of storage has been minimized by (1) using a single variable E instead of an array and (2) using the one-dimensional array a splitted into two swapping segments instead of a two-dimensional array. Furthermore, because only the reflection coefficients k are needed, the final for-loop of 14 is omitted. Also the boundaries of the iteration counter i have been changed for convenience. In DUR we have i = 0::9 instead of i = 1::10 of the procedure 14. 5 Decomposition The initial specification given in the previous section is stepwise refined into a parallel composition of more detailed and dedicated functional units. The control and data paths are separated from each other in this process. The data path components that are extracted include memory resources, a set of registers, and an arithmetic unit containing all basic functions, i.e., multiplication, division, addition, and substraction. The control path, in turn, contains a set of controllers responsible of executing the involved algorithms operating the data path components by asynchronous communication. The block diagram of the final composition is shown in Fig. 2, where each block represents an action system, or a parallel composition of several subsystems. The refinement flow resembles the one presented for a pipelined microprocessor in our previous work [13]. The reason for this is that we have chosen here an implementation, where the windowing, autocorrelation, and Durbin’s algorithms are encoded into a program ROM rather than into a Tangram-style handshake logic [5] directly which would be possible in principle. For this, a 2-stage pipeline (fetch, execute) is constructed. Actually, the derivation is now quite straightforward, because we don’t have any pipeline hazard situations to deal with as we did in the 5-stage pipeline derivation in [13]. The structure and operation of the composition in Fig. 2, as well as its circuit implementation, are discussed in the following sections 6 and 7. 6 Implementation In this section, we explain how the system in Fig. 2 works, and how it is implemented as a digital circuit. The circuit implementation uses the 4-phase handshake protocol on the communication channels. Therefore, each channel variable of Fig. 2 must first be expanded by (at least) two boolean variables implementing the request and acknowledgement signals (req , ack). This transformation, also known as the handshake expansion [12], can be performed within the refinement calculus by using an appropriate abstraction relation [15] such as c req ( = reqc ^ :ackc ) ^ (c = ack :reqc _ ackc ) where c denotes any communication variable in the composition in question. However, the details of this transformation process are out of the scope of this paper. Hence, we give below the resulting diagrams directly, without presenting any formal proofs. 9 o2 o1 so op Op mem_op Env CONST JUMP Launch Comparator oc cc cmpr bsy Fetch_Ctrl fc Exec_Ctrl ec STOP EOC pc preg prom C O N S T P r o m Pc Pc Block ac Addr_Ctrl SELCMP Comp_Ctrl fwd * +/- / REG CMP P r e g x mem / +/- lpa CMP CONST rstpc Lpa Lpa SELJ MUX Ram2 Rom Env ip R a m 1a Ip s mem commands/ control address R a m 1b data si channels Figure 2: Final block diagram of the analyzer 10 reg 6.1 Input and output units I The input unit p, depicted in Fig. 3, contains two input buffers, the 256-word memory blocks am1a and am1b, corresponding to the arrays sa and sb of the initial specification in Sec. 4. p has own counter for address generation and a toggle mechanism for the switching between the memory blocks. The idea is that when one buffer is being filled during a 32 ms time frame, the other can be read and written freely by the system pa or p which contain the address counters of their own. R I R L O Figure 3: Block diagram of O Ip L The output unit p awaits requests from the analysis unit pa via two separate communication channels as already explained in Sec. 4. The first request activates the output procedure of the windowed samples in am1a or am1b. The second request, in turn, activates the output procedure of the reflection coefficients computed in pa. R R L 6.2 Analysis unit L The system pa carries out the computation of linear predictive analysis on a 256-sample frame. The involved algorithms: windowing, autocorrelation, and Durbin, are encoded into the 46 38-bit program ROM block rom who gets the addresses from the loadable  P 11 P P 6-bit program counter unit c. The memory structure of rom is shown in Fig. 4, and the elements of a 38-bit instruction word in Table 1. In Sec. 7, the program code for each algorithm is presented in more detail. 38 bits 8 WIN 14 ACO 24 0 8 EOC 22 STOP 45 DUR Figure 4: Memory allocation of P rom Table 1: Elements of a program instruction Name Bits EA LCA UDA EB LCB UDB ETOG ECB SELCB ERA SELFA SELA ERO EMEM 1 1 1 1 1 1 1 1 2 1 1 2 1 1 Description Enable addr. cntr A Load/count addr. cntr A Inc/dec addr. cntr A Enable addr. cntr B Load/count addr. cntr B Inc/dec addr. cntr B Enable RAM2 toggle Enable base addr. reg Select base addr. Enable offset reg. Select offset adder func. Select offset reg. input Enable addr. reg. Enable memory L E Name Bits SELRAM RW SELROM EADD SELF EMUL EDIV EREG SELM CMP SELCMP SELJ EOC STOP Total 1 1 1 1 1 1 1 4 3 1 2 3 1 1 38 Description Select RAM block Read/write RAM Select ROM block Enable adder Select adder func. Enable multiplier Enable divider Enable regs Select reg. inputs Enable comparator Select cmp. inputs Select jump addr. Enable output control Stop computation Main controllers The system pa contains five control blocks. The main controllers are: aunch, etch Ctrl, and xec Ctrl. Their job is to control overall program flow. The two other controllers, ddr Ctrl and Comp Ctrl, are slaves of xec Ctrl driving the data path components by handshake channels. Some of the building blocks needed in the circuit implementations of the controllers are introduced below in Fig. 5. The well-known C-element is not depicted, but the different asymmetric C-elements are shown, mainly because of their non-standard symbols. L F A E 12 The E- and R-elements are left-right devices which synchronize two 4-phase handshake cycles in certain ways. The E-element enhances the involved cycles by partly parallelizing their down-going parts. The R-element, in turn, releases the left-channel cycle to continue immediately after the right-channel request has been sent. E-element l1 R-element r1 l1 r1 C+ l2 C+ r2 l1 Symbol: l2 r2 l1 r1 E l2 r2 Up-asymmetric C-element Symbol: l2 r1 R r2 Down-asymmetric C-element a a c c b b a C+ c Symbol: a Symbol: b C- c b Figure 5: Circuit elements The formal action system specifications of the three main controllers are given below. The corresponding circuit diagrams are shown in Figs 6 – 8. L sys aunch (lpa pc fc oc o1 o2 : chan rstpc : bool) :: init lpa pc mc o2 rstpc := ack ack ack ack false do < lpa = req rstpc rstpc fc := true req > ; < oc = req o1 oc := req ack > ; < fc = ack o1 = ack o2 := req > ; < o2 = ack lpa := ack > od j j ^: ! ^ ! ! ! ] F sys etch Ctrl (fc pc prom ec oc : chan STOP JMP EOC : bool) :: init fc pc prom ec oc := ack ack ack ack ack do (< fc = req ec = ack oc = ack (STOP JMP) pc := req > ; < pc = ack prom := req > ; < prom = ack bsy if EOC oc := req ] EOC skip fi ec := req >) ] < fc = req ec = ack oc = ack bsy STOP JMP fc := ack od j ^ ^ j ! ^ ^: ! ^: ! ^ ^: : ! ^: ^ ! ^: ! > ] E sys xec Ctrl (ec preg cmpr ac cc : chan CMP bsy : bool) ::  init ec preg cmpr ac cc := ack ack ack ack ack  do < ec = req preg := req > ; < preg = ack if CMP cmpr := req ] CMP skip fi ac cc ; < cmpr = ack bsy ec := ack > ; < ac = ack cc = ack bsy := false > od j ! ! ! ^ ! ^ ! : j ] 13 ! bsy := req req true > acklpa req lpa acko2 req o2 acko1 req o1 E R E R rstpc rstflag req fc ackfc ackoc req oc Figure 6: Circuit diagram of req fc ackfc STOP JUMP ackoc Launch req oc C C ackec req ec C- bsy C- E req pc ackpc E req prom ackprom Figure 7: Circuit diagram of EOC F etch Ctrl Launch is the topmost controller in Lpa. It first enables initialization of P c by setting the resettable flag rstpc and activates then the fetch controller F etch Ctrl. After this, it awaits requests from F etch Ctrl to start the output unit O p two separate times. The 2-stage pipeline operation, containing the fetch and execute phases, is realized by one pipeline register reg and two dedicated controllers etch Ctrl and xec Ctrl. etch Ctrl, driven by aunch, takes care of sequentially activating c, fetching an instruction from rom, and then starting the execution controller xec Ctrl. In the first round, when the flag rstpc has been set to true by aunch, c and the flag rstpc are resetted. Otherwise c is either incremented or loaded with a jump address. Furthermore, F P L P F L P 14 P E P E ackcmpr ackec req ec req cmpr req ac C E ackac R C bsy req preg CMP ackpreg Figure 8: Circuit diagram of F ackcc req cc E xec Ctrl L if the instruction bit EOC is true, etch Ctrl commands aunch to activate the output procedure of the windowed samples in am1a or am1b. This happens in parallel with the regular activation of xec Ctrl. The continuous fetch process is stopped, when the instruction bit STOP in reg is true, the flag JUMP has been set to false by the comparator, and xec Ctrl is nomore busy. xec Ctrl first loads the output of rom into reg and activates then in parallel the address and computation controllers ddr Ctrl and omp Ctrl sending an acknowledgement back to etch Ctrl which can then start the next instruction fetch. If the instruction bit CMP is true, the comparator, which sets the flag JUMP according to the result of the comparison between two values selected by dedicated bits of the instruction, is activated as well. In this case, the acknowledgement to etch Ctrl is postponed until the comparison has been completed. If JUMP is set to true by the comparator, a jump address is loaded into c in the next fetch cycle. Otherwise c is incremented normally. E P E E F R R P A P C F P P A C Address and computation control Both ddr Ctrl and omp Ctrl are operated by a set of instruction bits stored in reg . These control bits determine which handshake cycles are generated by the controllers. If a bit is true, the involved controller activates the corresponding communication cycle. If a bit is false, the handshake in question is skipped and an immediate response is given. The action system specifications of these controllers are given below. Note that the address controller is presented as the parallel composition of the control and data parts ddr Ctrl:c and ddr Ctrl:d. The block diagram of ddr Ctrl and the circuit diagram of omp Ctrl are shown in Figs 9 and 10, respectively. We have that P A C A A Addr Ctrl =b Addr Ctrl:c k Addr Ctrl:d where 15 sys j Addr Ctrl:c (ac toggle cnta cntb cb rega rego ram fwd : chan ETOG EA EB ECB ERA ERO EMEM : bool) :: var b : bool init ac toggle cnta cntb cb rega rego ram fwd b := ack do (< ac = req skip > ; (< if ETOG toggle := req ] ETOG skip fi > < if EA cnta := req ] EA skip fi > < if EB cntb := req ] EB skip fi >) ; < toggle = ack cnta = ack cntb = ack skip > ; (< if ECB cb := req ] ECB skip fi > < if ERA rega := req ] ERA skip fi >) ; < cb = ack rega = ack b := true >) ] : : : ack ! ! : ! k ! : ! k ! : ! ^ ^ ! ! : ! k ! : ! ^ ! ((< ac = req ^ :ECB ^ :ERA) _ b ! if ERO ! rego := req ] :ERO ! fi > ; < rego = ack ! if EMEM ! ram := req ] :EMEM ! ; < ram = ack ! fwd := req > ; < fwd = ack ^ b ! ac b := ack false >) false skip ] j od sys Addr Ctrl:d (toggle cnta cntb cb rega rego : chan skip fi > CLA UDA CLB UDB SELFA : bool SELCB SELA i REGA ADR : int) :: j var t : bool CB : int init toggle cnta cntb cb rega rego t := ack : : : ack false do < toggle = req ! t := :t > ] < cnta = req ! if CLA ^ UD ! i := i + 1 ] CLA ^ :UD ! i := i ; 1 ] :CLA ! i := 0 fi cnta := ack > ] < cntb = req ! if CLB ^ UDB ! j := j + 1 ] CLB ^ :UDB ! j := j ; 1 ] :CLB ! j := 0 fi cntb := ack > ] < cb = req ! if SELCB = 0 ^ :t ! CB := 0 ] SELCB = 0 ^ t ! CB := 1 ] SELCB = 1 ! CB := 2 ] SELCB = 2 ! CB := 3 fi cb := ack > ] < rega = req ! if SELA = 0 ! REGA := i ] SELA = 1 ! REGA := j ] SELA = 2 ! if SELFA ! REGA := i + j ] :SELFA ! REGA := i ; j fi fi rega := ack > 8 ] < rego = req ! ADR rego := (2  cbase) + REGA ack > ] j od and 16 sys j C omp Ctrl (cc fwd mul div add reg : chan EMUL EDIV EADD EREG0 EREG1 EREG2 EREG3 : bool) :: init cc fwd mul div add reg := ack : : : ack do < cc = req skip > ; (< if EMUL mul := req ] EMUL skip fi > < if EDIV div := req ] EDIV skip fi > < if EADD add := req ] EADD skip fi >) ; < fwd = req mul = ack div = ack add = ack skip > ; (< if EMUL regy := req ] EMUL skip fi > < if EREG0 rege := req ] EREG0 skip > < if EREG1 regs := req ] EREG1 skip > < if EREG2 regx := req ] EREG2 skip > < if EREG3 regz := req ] EREG3 skip >) ; < regy = ack rege = ack regs = ack regx = ack regz = ack ! ! ! ! ^ ! ! ! ! ! ^ k k k k k k ^ : : : ^ : : : : : cc fwd := ack ack > j ! ! ! ^ ! ! ! ! ! ^ ! ^ ! od ] Addr Ctrl contains two 8-bit address counters and an adder-subtractor. Furthermore, it has three registers for the base address (2 bits), offset (8 bits), and effective address (10 bits). It generates the memory addresses and controls access to the memory blocks am1a or am1b, am2, and om required by the three computation phases. The address bit configuration for each memory block is presented in Fig. 11. The static mode and selection bits for these memory units come from reg , but the operation requests are sent by ddr Ctrl. An address can be generated either by a single instruction or by several separate instructions piece by piece. This makes it possible to compute a new address in parallel with accessing memories with the current address. Note that the outputs of the other address counter and the offset register are also used by the comparator in program jump control. omp Ctrl drives the functional unit composed of the register bank and the combinational arithmetic units, i.e., multiplier, divider, and adder-subtractor. The register bank contains five individual data registers used as input and output buffers by the arithmetic units and memory blocks. It can be loaded simultaneously from two different resources. Again, the static selection bits for the registers and required multiplexers come from reg , but the operations are activated by the requests sent by omp Ctrl. The functional unit is depicted in Fig. 12. R R R R P A C P C R R Data memory usage The windowing algorithm WIN uses am1a or am1b for both input and output. Furthermore, om contains the 256 window coefficients (Eq. 15) by which the original samples are multiplied. The autocorrelation phase ACO, in turn, reads the windowed samples from am1a or am1b and writes the resulting 11 autocorrelation coefficients into am2. The Durbin’s recursion DUR uses the 41-word am2 for input and output. This memory block contains, as shown in Fig. 13, two swapping 10word blocks a0 and a1 for the intermediate results, the final 10 reflection coefficients ki , and the mentioned autocorrelation coefficients Ri . Observe that because of the addressing scheme selected for am2, where the 2-bit base address and the corresponding 4-bit offset are concatenated as shown in Fig. 11, the R R R R R R 17 CONST 0 0 T SELCB1 SELCB0 t ac EA EB ECB ERA ERO ETOG EMEM ram COUNTER i cnta A d d r_ C t r l. c 0 1 UDB, CLB UDA, CLA COUNTER j SELCB1 MUXT MUX1 MUX0 cntb toggle cb BASE ( CB ) fwd SELFA +/- A rega SELA MUXA rego REGA REGA ADR Address_Ctrl.d ADR Figure 9: Block diagram of 18 Addr Ctrl ackfwd req fwd req mul req cc ackcc E ackmul EMUL req div EDIV E ackdiv req add EADD E ackadd C C req regy E ackregy C req rege EREG 0 E ackrege req regs EREG 1 E ackregs req regx EREG 2 E ackregx req regz EREG 3 E Figure 10: Circuit diagram of 19 C omp Ctrl ackregz CB REGA 2 8 ADR 8 to 6 Ram1a/Ram1b/Rom to Ram2 Figure 11: Address bits Figure 12: Block diagram of the functional unit 20 0 9 16 a0 10 25 32 a1 10 41 48 58 ki Ri 10 11 Figure 13: Memory allocation of Ram2 R memory unit am2 contains actually more storage capacity than can be used. In other words, we have a 59-word memory array of which only 41 elements are effectively in use. We have preferred this straightforward approach even though it yields some unusable memory locations. A memory-saving but clumsier alternative would be to generate the effective address for am2 by summing the base and offset. R 7 Program ROM codes In this section, the PROM implementations of the procedures WIN, ACO, and DUR are presented. We show the operations included in each instruction. The following variables and special notations are used: REGS : REGX : REGY : REGZ : REGE : i, j : t: REGA: CB: ADR: RAM1 : RAM2 : ROM : x y := z : b value : goto l : b1jb2 0 data register in the register bank data register in the register bank data register in the register bank data register in the register bank data register in the register bank address counters in ddr Ctrl toggling variable of the type bit in ddr Ctrl (initialized to ’0’ in the initial system reset) address offset register in ddr Ctrl address base register in ddr Ctrl effective address register in ddr Ctrl RAM array in am1 of p RAM array in am2 of pa ROM array in om of pa (window coefficients) A A A A R R R I L L A assign the value of z to both x and y define ’value’ to be a binary number jump to the program line l concatenate bit vectors b1 and b2 so that b1 (b2) is the most (least) significant block 21 eoc: stop : enable ouput process of the windowed samples exit the program The mappings of the algorithms are shown in Tables 2– 4. The first column contains the PROM addresses and the second column the contents of the instructions. Each instruction is a 38-bit word composed of the elements listed in Table 1. Table 2: Mapping of the windowing procedure Offset 0 1 2 3 4 5 6 7 8 Instruction i j := 0 REGA := j  ADR := REGA REGS REGE := RAM 1ADR] ROM ADR] REGZ := REGE REGX := REGS  REGZ k (i := i + 1 REGA := i ADR := REGA REGE := ROM ADR]) REGS REGA := RAM 1ADR] j REGZ := REGE k (i ADR := i + 1 REGA RAM 1ADR] := REGX ) REGX := REGS  REGZ k (j := j + 1 REGA := i ADR := REGA REGE := ROM ADR]) REGS := RAM 1ADR] k (i := (i + 1) mod 256 REGA := j  ADR := REGA) REGZ RAM 1(ADR) := REGE REGX if 1 < i ! goto 5 ] 1  i ! skip fi “First line of ACO” Table 3: Mapping of the autocorrelation procedure Offset 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 Instruction i j := 0 CB REGA := 0b11 j  ADR := REGA REGS REGZ := RAM 1ADR] REGX := REGS  REGZ k (j := (j + 1) mod 256 REGA := j  ADR := REGA) REGS REGZ := RAM 1ADR] REGY := REGS  REGZ k (j := j + 1 REGA := j  ADR := REGA) REGX := REGX + REGY k REGS REGZ := RAM 1ADR] k if 1 < REGA ! goto 11 ] 1  REGA ! skip fi REGA := i ADR := REGA RAM 2ADR] := REGX i j := i + 1 0 REGA := j + i ADR := REGA REGZ := RAM 1ADR] REGA := j  ADR := REGA REGS := RAM 1ADR] REGX := REGS  REGZ k (j := j + 1 REGA := j + i ADR := REGA) REGZ := RAM 1ADR] REGA := j  ADR := REGA REGS := RAM 1ADR] REGY := REGS  REGZ k (j := j + 1 REGA := (j + i) mod 256 ADR := REGA) REGX REGZ := REGX + REGY RAM 1ADR] k if 1 < REGA ! goto 18 ] 1  REGA ! skip fi (REGA := i ADR := REGA RAM 2ADR] := REGX ) k if i < 10 ! goto 14 ] i  10 ! skip fi “First line of DUR” 22 Table 4: Mapping of the Durbin’s procedure Offset 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 Instruction k (i j := 0 CB REGA := 0b11 i ADR := CB jREGA REGE := RAM 2ADR]) j := j + 1 REGA := j  ADR := CB jREGA REGX REGS := RAM 2ADR] eoc REGX REGE := REGX=REGE k (CB REGA := 0b10 i ADR := CB jREGA) RAM 2ADR] REGZ CB := REGX REGE 0b0jt REGY := REGS  REGZ k (ADR := CB jREGA RAM 2ADR] := REGX ) CB REGA := 0b11 i ADR := CB jREGA REGX := RAM 2ADR] REGE := REGX ; REGY k (i j := i + 1 j + 1 REGA := j  ADR := CB jREGA) REGX j := RAM 2ADR] 0 CB REGA := 0b0jt j  ADR := CB jREGA REGS := RAM 2ADR] CB REGA := 0b11 i ; j  ADR := CB jREGA REGZ := RAM 2ADR] REGY := REGS  REGZ k (j := j + 1 REGA := j ) REGX := REGX ; REGY k if REGA < i ! goto 30 ] REGA  i ! skip fi REGX := REGX=REGE t j := :t j ; 1 CB REGA := 0b0jt j  ADR := CB jREGA RAM 2ADR] := REGX j := 0 CB REGA := 0b10 i ADR := CB jREGA RAM 2ADR] := REGX t := :t CB REGA := 0b0jt i ; j  ADR := CB jREGA REGS := RAM 2ADR] REGY := REGS  REGZ k (REGA := j  ADR := CB jREGA REGX := RAM 2ADR]) REGX := REGX ; REGY k (t j := :t j + 1 REGA := j ) RAM 2ADR] := REGX k if REGA < i ! goto 37 ] REGA  i ! skip fi REGX := 1:0 k (i := i + 1 CB REGA := 0b10 j  ADR := CB jREGA REGS REGZ := RAM 2ADR]) REGY := REGS  REGZ k (i := i + 1 CB REGA := 0b11 i ADR := CB jREGA) REGS i := REGX ; REGY i ; 1 REGZ := REGE REGE := REGS  REGZ k if i < 10 ! goto 29 ] i  10 ! stop fi 23 Sample program lines line 31 : line 32 : line 33 : As an example consider the program lines 31 – 33 in Table 4: CB REGA := 0b11 i ; j  ADR := CB jREGA REGZ := RAM 2ADR] REGY := REGS  REGZ k (j := j + 1 REGA := j ) REGX := REGX ; REGY k if REGA < i ! goto 30 ] REGA > i ! skip fi The bit configurations of these three instructions are is shown in Table 5, where the most essential bit values of each instruction are printed in bold, and ’X’ denotes a ’don’t care’value. Table 5: Bit-map of the program lines 31 – 33 Bit EAC LCA UCA EB LCB UDB ETOG ECB SELCB0 SELCB1 ERA SELFA SELA0 31 0 X X 0 X X 0 1 0 1 1 0 0 32 0 X X 1 0 1 0 0 X X 1 X 1 33 0 X X 0 X X 0 0 X X 0 X X Bit SELA1 ERO EMEM SELRAM RW SELROM EADD SELF EMUL EDIV EREG0 EREG1 EREG2 31 1 1 1 1 1 0 0 X 0 0 0 0 0 32 0 0 0 X X X 0 X 1 0 0 0 0 33 X 0 0 X X X 1 0 0 0 0 0 1 Bit EREG3 SELM0 SELM1 SELM2 CMP SELCMP0 SELCMP1 SELJ0 SELJ1 SELJ2 EOC STOP 31 1 0 1 X 0 X X X X X 0 0 32 0 X X X 0 X X X X X 0 0 33 0 0 0 0 1 0 1 0 0 1 0 0 The meaning of the bit configurations in Table 5 is the following. j line 31: We have that ECB = 0b1 and the selector SELCB1 SELCB0 is 0b10. This indicates that the constant 0b11 is loaded into the address base register CB of ddr Ctrl. Since ERA = 0b1, the offset register REGA is loaded in parallel with the loading of CB . Because the function selector SELFA is 0b0, and the input selector SELA1 SELA0 of REGA is 0b10, the adder-subtractor of ddr Ctrl is activated in the subtraction mode and the difference i j is assigned to REGA. After these parallel register assignments, the address generation process is completed by loading the concatenation of CB and REGA into the effective address register ADR enabled by the bit ERO = 0b1. Then, because EMEM = 0b1 indicating a memory access operation, and because the RAM block selector SELRAM and the mode selector RW are 0b1 and 0b0, respectively, the memory array RAM 2 of the unit am2 is read using the value of ADR as the address. The result of this memory fetch operation is stored into the register REGZ of the register bank, since EREG3 = 0b1 (’enable REGZ ’) and the input selector SELM2 SELM1 SELM0 of the register bank has the value 0bX10 (’pass RAM 2 to REGS and REGZ ’). A j A ; R j j 24 A line 32: Now EB = 0b1 indicating that the counter j of ddr Ctrl is activated. Because the mode selector LCB UDB is 0b01 (’count up’), an incrementation takes place. The new value of j is stored into the address offset register REGA controlled by the enabling bit ERA = 0b1 and the input selector SELA1 SELA0 = 0b01. Since EMUL = 0b1, the multliplier is activated in parallel with the operations in the address controller. The registers REGS and REGZ of the register bank are the operands of the multiplication. The result of the multiplication, in turn, is stored into the register REGY by default. Basically, also the register REGX or REGZ could be used as the output buffer by setting the bit EREG2 or EREG3 to 0b1 and the input selector SELM2 SELM1 SELM0 to 0b001. j j j j line 33: Since EADD = 0b1, and the mode selector SELF = 0b0, the adder-subtractor of the functional unit of the system pa is activated in the subtraction mode. The registers REGX and REGY of the register bank are the operands of the subtraction. L The comparison bit CMP is 0b1 indicating that an if-statement is executed. This takes place in parallel with the subtraction described above. The comparator input selector SELCMP1 SELCMP0 is 0b10, which means that the contents of the address offset register REGA of ddr Ctrl and the value of the address counter i is compared. If REGA < i holds, the comparator sets the flag JUMP to 0b1. Because the jump address selector SELJ2 SELJ1 SELJ0 has the value 0b100, the constant 30 is loaded into the program counter c, if the flag JUMP was set. If it was not set, i.e., if the the comparator found the value of REGA to be greater than or equal to i, the program counter is not loaded but incremented normally. j A j j P 8 System performance L From the point of view of the involved algorithms themselves, the system pa is completely sequential even though it would be quite possible to process the component procedures concurrently in principle. The sequential approach makes low power consumption with a moderate area cost possible, because the system architecture is relatively simple, i.e., it does not contain heavy pipelining or very complex control logic and arbitration. However, in addition to the pipelined fetch-execute process, the system is capable of performing parallel instruction execution-level operations as well. For example, computing the next memory address, accessing the memory with the current address, and computing some arithmetic operation can all take place in parallel. The comparator, in turn, operates in parallel with the controllers ddr Ctrl and omp Ctrl. Furthermore, the register bank has two multiplexers at its input, which makes simultaneous loading of two distinct registers possible. The input values can come either from the same or different sources. The above features require a wide instruction word (38 bits, see Table 1), but they considerably improve performance. In fact, we could decrease the progran ROM width by A 25 C coding the instructions more efficiently and designing a dedicated instruction decoder for them. However, we have not considered this an advantageous trade-off. Instead we have preferred a wider instruction word with minimal coding, because this makes instruction handling straightforward and efficient. Performance estimation Below we present a very simple performance estimation for the analysis unit pa based on the timing parameters specified in the ES2 0.7 m CMOS process data sheets [8]. For this we must naturally fix the word size of the data variables which were viewed in the above formal descriptions as real-type entities with infinite precision. The word size plays an essential role also in the filter stability issues. Here we assume that the input and output values of the system pa are 16-bit entities, while the computations within pa use 32-bit arithmetics. We get for the worst case data path delay of pa about 0.533 ms. The autocorrelation algorithm clearly dominates: it alone covers about 86.3 % (0.460 ms) of the mentioned 0.533 ms. The windowing algorithm takes 8.8 % (0.047 ms), and the Durbin’s procedure 4.9 % (0.026 ms) of the total time. The delay value 0.533 ms has been computed by assuming that the down-going parts of the 4-phase handshake cycles are interleaved in such a way that they don’t effectively take any extra time in the data path. This is achieved by using E-elements in the control logic (see Figs 6 – 8, 10, and 5) and asymmetric matched delays in the data path. Furthermore, we have taken into account that the program ROM fetch delay, which is about 50 ns/instruction, limits the maximum operation speed to 20 MIPS. The system cannot operate faster than this, no matter how quickly an instruction is actually executed. The control logic has an intrinsic delay which should be added to the data path delay. We can roughly estimate that the total figure (data path delay + control path delay) is not more than 1 ms, which means that the potential throughput of pa is about 256 kword/s. Because the input stream is only 8 kword/s, there is a large enough margin to execute the other procedures of a speech compression system, including for example interpolation, filtering of the windowed samples, inverse filtering, and encoding of the reflection coefficients and the residual [17]. Taking into account that a new sample frame is processed every 32 ms, the system pa is idle for approximately 31 ms per frame. This fact, together with the properties of the asynchronous operation mode itsef, indicates a potential for low power consumption. L L L L L L 9 Conclusions We have presented here an asynchronous processor-like implementation of the linear predictive analysis algorithm. We used the action system formalism as the correctnesspreserving specification and development tool. The resulting parallel composition of action systems with the separate control and data flows enabled us to map the three involved algorithms, i.e., windowing, autocorellation, and Durbin’s recursion, onto a single hardware implementation. Although only some of the details were presented, the correspondence between the action system description and 26 its circuit implementation is clear. The PROM-based architecture we have selected has a good performance and is relatively easy to manage: the system can be provided with a new algorithm by expanding the PROM and adding, if needed, RAM resources for data and constants for the arithmetic units, comparator, and program counter. The controllers can basically remain the same. The emphasis was in the control logic. We will continue the work by concentrating on the practical details of data path design. This includes for example the actual delaymatching and the real number representation and precision issues. Then we are able to carry out a more accurate performance and power consumption analysis. References [1] R. J. R. Back. On the Correctness of Refinement Steps in Program Development. PhD thesis, Department of Computer Science, University of Helsinki, Helsinki, Finland, 1978. Report A–1978–4. [2] R. J. R. Back and R. Kurki-Suonio. Decentralization of process nets with centralized control. In Proc. of the 2nd ACM SIGACT–SIGOPS Symp. on Principles of Distributed Computing, pages 131–142, 1983. [3] R.J.R. Back and K. Sere. Stepwise refinement of action systems. Structured Programming, 12:17-30, 1991. [4] R. J. R. Back and J. von Wright. Refinement calculus, part I: Sequential nondeterministic programs. In J. W. de Bakker, W.–P. de Roever, and G. Rozenberg, editors, Stepwise Refinement of Distributed Systems: Models, Formalisms, Correctness. Proceedings. 1989, volume 430 of Lecture Notes in Computer Science, pages 42–66. Springer–Verlag, 1990. [5] K. van Berkel. Handshake Circuits: an Asynchronous Architecture for VLSI Programming. International Series on Parallel Computation, Cambridge University Press, 1993. [6] A. Davis and S.M. Nowick. Asynchronous circuit design: motivation, background and methods. In G. Birtwistle and A. Davis, editors, Asynchronous Digital Circuit Design, pages 1-49. Springer, 1995. [7] E. W. Dijkstra. A Discipline of Programming. Prentice–Hall International, 1976. [8] ES2 0.7 m CMOS. Technology and design kit documentation, Europractise, 1996. [9] GSM Enhanced Full Rate (EFR) 06.10, Version 0.2 [10] GSM Recommendation 06.10: Full Rate Speech Encoding and Decoding [11] J. Makhoul. Linear prediction: a tutorial review. In Proc of the IEEE, 63(4):561 – 580, 1975 27 [12] A. J. Martin. Compiling communicating processes into delay-insensitive VLSI circuits. Distributed computing, 1:226–234, 1986. [13] J. Plosila and K. Sere. Action systems in pipelined processor design. In Proc. of the 3rd Int. Symp. on Advanced Research in Asynchronous Circuits and Systems, pages 156 – 166, 1997. [14] J. Plosila, R.Rukšėnas, and K. Sere. Delay-Insensitive Circuits and Action Systems. TUCS Technical Report No 60, November 1996. [15] J. Plosila, R.Rukšėnas, and K. Sere. Manuscript, 1997. Action Systems Synthesis of DI Circuits. [16] L.R. Rabiner, R.W, Schafer. Digital Processing of Speech Signals. Prentice-Hall, 1978. [17] R. Steele, editor. Mobile Radio Communications, Pentech-Press, 1992. 28 Turku Centre for Computer Science Lemminkäisenkatu 14 FIN-20520 Turku Finland http://www.tucs.abo.fi University of Turku Department of Mathematical Sciences  Åbo Akademi University Department of Computer Science Institute for Advanced Management Systems Research   Turku School of Economics and Business Administration Institute of Information Systems Science  View publication stats