Lec24 Exploration

Download as pdf or txt
Download as pdf or txt
You are on page 1of 29

(PEHGGHG6\VWHP'HVLJQ

IRU:LUHOHVV$SSOLFDWLRQV
Jan M. Rabaey
BWRC
University of California @ Berkeley
http://www.eecs.berkeley.edu/~jan

DAC 2000, Los Angeles

The Distributed Approach to Information


Processing

Source: Richard Newton

1
The Smart Home

Security
Environment monitoring and control
Dense network of Object tagging
sensor and monitor nodes Identification

Wireless in the Home

Source: IEEE Spectrum,


December 99

2
The Changing Metrics

Power

Cost

Flexibility

Performance as a Functionality Constraint


(“Just-
(“Just-in-
in-Time Computing”)

The Wireless System Design Challenge

The Battery Limitation


• Projected energy per digital operation
(2004): 50 pJ
• Lithium-Ion: 220 Watt-hours/kg == 800
Joules/gr
• At 50 pJ/operation:10 teraOps/gr!
– Equivalent to continuous operation at 100 MOPS
for 30 hours (or average power dissipation of 6
mW)

3
Some interesting numbers
• Energy cost of digital computation
– 1999 (0.25µm): 1pJ/op (custom) … 1nJ/op (µproc)
– 2004 (0.1µm): 0.1pJ/op (custom) … 100pJ/op (µproc)
• Factor 1.6 per year; Factor 10 over 5 years
• Assuming reconfigurable implementation: 1 pJ/op
• Energy cost of communication
– 1999 Bluetooth (2.4 GHz band, 10m distance)
• 1 nJ/bit transmission energy (thermal limit 30 pJ/bit)
• Overall energy: 170 nJ/bit reception / 150 nJ/bit transmission (!)
• Standby power: 300 µW
– 2004 Radio (10 m)
• Only minor reduction in transmission energy
• Reduce transceiver energy with at least a factor 10-50
• Trade-off
– @10m: 5000 operations / transmitted bit
– @ 1m: 0.5 operations / transmitted bit

The Implementation Opportunities


System-on-a-Chip

500 k Gates FPGA Embedded applications where cost,


Analog

Multi-
Spectral performance, and energy are the real
RAM + 1 Gbit DRAM
Imager issues!
Preprocessing DSP and control intensive
Mixed-mode
64 SIMD Processor µC
Array + SRAM Combines programmable and
system application-specific modules
+2 Gbit
Image Conditioning DRAM
100 GOPS Recog-
nition

SOC anno 2010

4
The System-on-a-Chip Nightmare

“Femme se coiffant”
coiffant”
Pablo Ruiz Picasso
1940

The System-on-a-Chip Nightmare


System Bus
DMA CPU DSP

Mem
Ctrl.
Bridge

MPEG
The “Board-on-a-Chip”
Approach
I O O
C

Custom
Courtesy of Sonics, Inc
Interfaces

Control Wires Peripheral


Bus

5
The Wireless Challenge
&RQWURO
Call Slot Synchron-
UI Setup Allocation ization

Data Data Radio Data Mod/


Acquisition EncodingRadio Formatting Demod

'DWD
Application Network Mac/ Physical
Data Link + RF

source data streams packets bits


sec msec µsec nsec

Data and Time Granularity

The Software Radio


A/D Converter
DSP
D/A Converter

• Idea: Digitize (wideband) signal at antenna and use


signal processing to extract desired signal
• Leverages of advances in technology, circuit
design, and signal processing
• Software solution enables flexibility and adaptivity,
but at huge price in power and cost
• 16 bit A/D converter at 2.2 GHz dissipates 1 to 10 W

6
The Mostly Digital Radio

Analog Digital
cos[2π(2GHz)t]
RF input
(fc = 2GHz) I (50MS/s)
A/D
Digital
Baseband
Receiver
RF filter LNA A/D
Q (50MS/s)
chip boundary

sin[2π(2GHz)t]

Architectural Choices
Prog Mem
Flexibility

Prog M em

µP µP
Prog Mem
Sate llite MAC Addr General
Unit Gen Purpose
µP Processor
µP

Satellite Satellite Software


Programmable
Processor Processor
Dedicated DSP
Hardware
Logic Reconfigurable
Processor

Direct
Mapped
Hardware 1/Efficiency

7
The Energy-Flexibility Gap
1000
MOPS/mW (or MIPS/mW)

Dedicated
HW
100
Reconfigurable Pleiades
Energy Efficiency

Processor/Logic 10-80 MOPS/mW


10
ASIPs
DSPs 2 V DSP: 3 MOPS/mW
1
SA110
Embedded Processors 0.4 MIPS/mW
0.1
Flexibility (Coverage)

System Optimization Hierarchy

Network Functional & Performance


Requirements
Network Architecture

level
Performance analysis

Constraints

Node Functional & Performance


Requirements
Node Architecture

level
Performance analysis

8
The fully programmable approach
• Flexible platform for
experimentation on
networking and
protocol strategies
• Size: 3”x4”x2”
• Power dissipation < 2 W
(peak)
• Multiple radio modules:
Bluetooth, Proxim, …
• Collection of sensor
and monitor cards
• Fully operational by late
spring (including
software support
system)!

Digital Intercom — A Design Exercise in


Communication/Component Based
Design Basestation
• Known and tested
specification of limited
complexity allows
focus on architectural
implementation
methodology
• Two-chip
implementation
leverages separates
between analog (RF)
and digital design
concerns
Mobiles
• Duration of exercise:
Up to 20 users per cell @ 64 kbit/sec per link 1 year (summer ‘00)
TDMA selected as MAC protocol

9
Two-Chip Intercom
Custom Mixed Program- Software
Fixed
analog analog/ mable logic running on
logic
circuitry digital processor

Protocol
Σ-∆ ADC

PLL Filters ADC Digital


Analog RF
Mixer Baseband
LNA DAC processing

Chip 1 Chip 2

Direct down-conversion front-end


(Yee et al)

The Target Architecture


Fixed Hardware Embedded µP

phone
Keypad,
Physical Accelerators Appl.
book
Display

Layer (bit level)


Coding ARQ
A
Timing
D recovery Correlators MUD

Multi-model Analog RF MAC


Filters Transport

analog digital
DSP core
Programmable Hardware

10
Digital Baseband
Simulink example:
Matched filter correlator

Stateflow example:
Receiver controller

Stage 1: floating point blocks


Stage 2: fixed point blocks

Design Estimations (First order):


RF + ADC/DAC
Stage 1: Model components in MATHWORKS tools (Simulink/Stateflow)
Transmit: 30 mW granularities
at appropriate
Receive:
Stage 70 mW
2: Convert Simulink structural blocks (manually) to fixed point / bit-
true
Digital (conservative)
Stage 3: Map to
Transmit: 20HW
mWand/or SWtransistors)
(100,000 (automatic synthesis path: Simulink to HW -
Simulink to Software)
Receive: 80 mW (700,000 transistors)

Physical layer timing analysis (from Simulink)


Abstracted Simulation Results Drive Protocol Design!
Estimates for the performance of the TCI Physical layer Additional Calculations

Rates Duration
Tool:
Hz MHz s us Microsoft
Chip 2.50E+07 25.00 4.00E-08 0.04 Chips per Symbol 31
Symbol 8.06E+05 0.81 1.24E-06 1.24 Bits per Symbol 2 Excel
Bit 1.61E+06 1.61 6.20E-07 0.62
0.00
Pilot symbol 1.24E-06 1.24 Pilot sequence length 15
Pilot sequence 1.86E-05 18.60

Channel coherency time 1.00E-01


The transmit protocol will send a pilot sequence, some small number of dummy data BB clock coherency time (s) 5.00E-03
bits (PD), another pilot sequence, and the real data bits (DAT) with the constraint Max # sequential symbols (s) 4.03E+03
that DAT<safe # sequential symbols Safety margin 95.00%
TX = PS | PD | PS | DAT Safe # sequential symbols 3.83E+03

PD (# of symbols) 10
PD 0.0000124 12.40 DAT (# of symbols) 3800 OK
DAT 0.004712 4712.00

meters feet
time from RX to TX transition until : Min distance 5 16.40
first DAT clock on transmitter 4.96E-05 49.60 (1) Max distance 10 32.81

time from TXCLK on Radio A until:


RXCLK on radio B 2.58E-06
Radio Turn-around
2.58 (2)
Speed of light (feet per ns)
Speed of light (feet per s)
1
1.00E+09

time from TX on Radio A, and RX on radio B until:


Time Min LOS time of flight
Max LOS time of flight
s
1.64E-08
3.28E-08
us
0.02
0.03
Max time of flight (suggested
first DAT2 RXCLK on radio B 5.22E-05 52.18 (3) by Paul bwo Dennis) 1.00E-07 0.10

11
Physical Layer Design
Physical Layer Protocol

RF and Communications Digital Baseband Protocol/ Network

Comm Radio Operating Application/


Algorithms Mode Architecture

Analog Radio Protocol


DSP
and RF Controller Design

Digital baseband bridges gap between RF/Comm and


protocol/network

Physical to Protocol Interface


MATLAB VCC
◆ Different tools
Protocol
◆ Verification relying on co-
co-
Digital
Baseband simulation
Processing ◆ Interface design critical to
Chip 2 ensuring final designs work
together
TX
TX_CLK ƒ Define small number of
Physical interface signals
TX_DATA
to
RX
Protocol ƒ Clearly specify behavior
RX_CLK
Interface
RX_DATA Signals
ON/OFF

12
The Intercom Protocol Stack
Service Requests Voice samples

User Interface Layer


UI Mulaw Mulaw

Transport Layer
Transport

Mac Layer

Filter MAC

Data Link Layer

Transmit Receive

Synchronization

Tx_data Tx/Rx Rx_data

Refinement-based Protocol Design Methodology

A CFSM-based approach Advantages


• Combines synchronous
System and asynchronous
Spec models
• Constrained model
Input language enables verification

Simulation
CFSM model Refinement
Formal Verification

Formal Formal

Software (C) Hardware (VHDL)

13
Co-design Finite State machines
• Three-level hierarchy
– top level: asynchronous, partially ordered
(bounded buffer non- blocking single- read communication)
– middle level: synchronous FSM
(atomic event- and condition- based transition)
– bottom level: Synchronous DataFlow- like
(FSM provides tokens and selects active sub- network)

(from ee249: http://www-cad.eecs.berkeley.edu/Respep/Research/hsc/class/index.html

14
POLIS/VCC Design Flow

* (from the VCC manual)

Describing the Behavior


Layer C-code State-
(lines) transition
Diagram
(states)
User Interface 100

Mulaw 100

Transport 300

MAC 270 42
Transmit 120 16
Receive 140 2
Synchronization 17

• CFSM
• VCC, Polis

15
Formal Verification
• System satisfies certain properties?
– System described in some formal mathematical languages (e.g.
Esterel)
– Properties written in some formal logic (e.g. LTL) or formal model (e.g.
Esterel)
• Property Verification
– Invariant (only one remote can send voice data in any time slot)
– Response (if a remote sends a request to the base station, then
eventually there is an acknowledgement)
– deadlock freedom
• Refinement Checking
– Does the (low-level) implementation conform with the (high-level)
specification?
(Do the mapped CFSMs function the same as the specification?)
• Mocha (Henzinger): Modularity in Model Checking

Example of Property Verification


Remote returns to the disconnect state
'LVF
if user presses the disconnect button.
button.

$* 'LVF→ $) 1RW&RQQ 


5HPRWH

✖ 1272.
%DVH VWDWLRQ

16
Why it Fails?
• Remote accepts Disc from the user even if
it is not connected
• After the remote has sent DiscReq and
waits for acknowledgement
• However, base station ignores DiscReq if
remote is not registered

Targeted Implementation Platform


Embedded Memory
Processor Sub-system

Interconnect Network

Configurable
Baseband Programmable
Logic
Processing Protocol Stack
(Physical Layer)

Benefit: Build library of computational and


networking modules (and models)

17
Describing the Architecture
• Xtensa embedded CPU
(Tensilica, Inc)
– Configurability allows designer to keep
“minimal” hardware overhead
– ISA (compatible with 32 bit RISC) can
be extended for software optimizations ◆ Tensilica model in VCC
– Fully synthesizable inst,LD,2 inst,MUL.c,9 inst,DIV.i,118
– Complete HW/SW suite inst,LI,1 inst,MUL.s,10 inst,DIV.l,122
inst,ST,2 inst,MUL.i,18 inst,DIV.f,145
• VCC modeling for exploration inst,OP.c,2
inst,OP.s,3
inst,MUL.l,22
inst,MUL.f,45
inst,DIV.d,155
inst,IF,5
– Requires mapping of “fuzzy” inst,OP.i,1 inst,MUL.d,55 inst,GOTO,2
inst,OP.l,1 inst,DIV.c,19 inst,SUB,19
instructions of VCC processor model inst,OP.f,1 inst,DIV.s,110 inst,RET,21
to real ISA inst,OP.d,6
– Requires multiple models depending
on memory configuration
– ISS simulation to validate accuracy of
model

Describing The Architecture


The On-Chip Network

DMA CPU DSP MPEG Open Core


ProtocolTM

I O SiliconBackplane
C MEM
AgentTM

Guaranteed Bandwidth
Example: “The Silicon Backplane” (Sonics, Inc) Arbitration

18
Describing the Architecture
◆ SONICS model in VCC

Flexible bandwidth arbitration model


TDMA slot map gives slot owner right
OCP of refusal
Initiator Initiator
Core Agent
Unowned/unused slots fall to round-
robin arbitration
Latency after slice granted is user-specified
between 2-7 Bus Clock cycles
Interconnect

Arbiter

OCP
Target Target
Agent Core

TCI Architecture

ASIC SiliconBackplane Tensilica Xtensa

19
Exploring Architectural Mappings
Software
Processor
Application
Transport
Mu-law
MAC

ASIC
Accelerators
Rest

Processor Utilization - Estimation


Processor
Utilization Mulaw
Transport
User Interface
32.7%
Peak performance

Transport 0.5 MAC 0.9 MAC


Mulaw Mulaw
5.46% User Interface
Transport Transport
User Interface User Interface
2.7%

ARM ARM ARM ARM Clock


@1MHz @11MHz @200MHz @2GHz Frequency
Latency
insensitive
RTOS
overhead

20
Implementation Fabrics for Protocols
RACH
req
RACH
akn
A protocol =
Extended FSM

idle

RACH
Memory
slotset
read write

update R_ENA
idle
W_ENA

BUF
BUF
Slot_Set_Tbl
2x16

addr

slot_set Slot_no Slot Pkt


<31:0> <5:0> start end

Intercom TDMA MAC

Intercom TDMA MAC


Implementation alternatives
AS IC FP G A AR M8
P ow er 0.26m W 2.1m W 114m W
E nergy 10.2pJ/op 81.4pJ/op n*457pJ /op

ASIC: 1V, 0.25 µm CMOS process


FPGA: 1.5 V 0.25 µm CMOS low-energy FPGA
ARM8: 1 V 25 MHz processor; n = 13,000
Ratio: 1 - 8 - >> 400

Idea: Exploit model of computation: concurrent finite state machines,


communicating through message passing

21
HW Mapping Experiment: STD to Std. Cell
Area Comparison – Manual versus Automated

3500

3000

2500
Manual Design Compiler
2000 SF2VHD Design Compiler
# Gates

1500 STD2V Design Compiler

1000 STD2C2CLD Design Compiler

500

0
PhySend

HW Mapping Experiment: STD to FPGA


Area Comparison – Manual versus Automated

7000

6000

5000
Manual FPGA Express
4000 SF2V HD FPGA Express
# G ates

3000 STD2V FPGA Express

2000 STD2C2CLD FPGA Express

1000

0
PhySend

22
HW Mapping Experiment: STD to Flexible Imp.
Area Comparison - FPGA x PLD (Manual)

1400
1200
1000
Xilinx FPGA
800
Altera PLD
# Gates

600 CoolRunner
400
200
0
TCI CRC TCI CRC+FSM PhysSend

HW Mapping Experiment: Flexible versus Fixed


Area Comparison – FPGA x Std.Cell (Manual)
1400

1200

1000

800 Xilinx FPGA


# Gates

600 Design Compiler


400
200

0
TCI CRC TCI CRC+FSM PhysSend

23
HW Mapping Experiment: Power
Consumption
FPGA versus PLD

70

60

50

40
LCA
M A X7 0 0 0
30

20

10

0
TCI CRC T C I C R C +F SM Ph ys Se n d F S M

Hierarchy in System Optimization

Network Functional & Performance


Requirements
Network Architecture

level
Performance analysis

Constraints

Node Functional & Performance


Requirements
Node Architecture

level
Performance analysis

24
The Applications and Specs
The Obvious Choice -The
The Smart Home and Network Appliances

Security
Environment monitoring and control
Dense network of Object tagging
sensor and monitor nodes Identification

System Requirements and Constraints


• Large numbers of nodes — between 0.05 and 1
nodes/m2
• Cheap (<0.5$) and small ( < 1 cm3)
• Limited operation range of network — maximum
50-100 m
• Low data rates per node — 1-10 bits/sec average
– up to 10 kbit/sec in rare local connections to potentially support
non-latency critical voice channel
• Crucial Design Parameter:
Spatial capacity (or density) — 100-200 bits/sec/m2

25
The Software-Defined Radio

FPGA Embedded uP

Dedicated FSM

Dedicated Reconfigurable
DSP DataPath

System-Level Design Space Exploration

Implementation in hard- and software

Communication Request
Media Network layer
Access Layer
(Data type, BW, latency, BER)
Source (Point-to-Point, multi-hop, star)
(T-C-F-DMA) Dest
(xs,ys) (xd,yd)
Physical Layer
(Band,Modulation)
• Based on well-defined abstraction layers
• Step-wise refinement (partitioning, resource
mapping and sharing) enables correctness
verification
• Automatic synthesis of adaptive protocols in
hard- and software

26
PicoRadio Energy Optimization
The Cost of Communication
Assumes R-4 loss due to ground wave
90dBm (@ 1 GHz) 90dBm

s
bp
0K

Transceiver Power
10 50dBm
Transmit Power

50dBm

10dBm 10dBm

-30dBm -30dBm

-70dBm -70dBm
1m 10m 100m 1Km 10Km

Distance

Communicating over Long Distances


Multi-hop Networks
Source
Dest

Example:
• 1 hop over 50 m
1.25 nJ/bit
• 5 hops of 10 m each log(β/α)

5 × 2 pJ/bit = 10 pJ/bit
• Multi-hop reduces transmission energy by 125! Optimal number of hops needed for
(assuming path loss exponent of 4) free space path loss.

But … network discovery


and maintenance overhead

27
Network Model Node Model Process Model

Analysis Viewer
OPNET
Network Simulator

Comparing the approaches from an energy


perspective
• Energy = Eb * Packet Size
• Reactive Routing good for rarely used routes
• Proactive Routing good for frequently used routes
• Need solution that is more adequate for problem at hand:
class-
class-based and location-
location-based addressing.

Routing Overhead (bytes) Routing Overhead (bytes) Normalized

4000 16000
3500 14000
3000 12000
2500 10000
2000 8000
1500 DSDV 6000
1000 AODV 4000
500 2000
0 0
20 33 56 20 33 56
Number of Nodes Number of Nodes

(discovering one route) (discovering n routes)

28
Summary
• Low-energy design ascends to prime time
forced mainly by the last-meter problem
• System-on-a-Chip approach enables and demands
heterogeneous implementation strategies, sometimes involving
non-intuitive and innovative design platforms
• Design exploration over various fabrics and partitions has
dramatic impact on dominant metrics, such as energy and cost
• It requires orthogonalization of function and architecture,
supplemented with performance models (cost, time, energy)
• This methodology holds at all levels of the system hierarchy

29

You might also like