Schiavone Wosh2019 Tutorial
Schiavone Wosh2019 Tutorial
Schiavone Wosh2019 Tutorial
PULP
2
Integrated Systems Laboratory
13.06.2019
Near Sensor (aka Edge) Processing
ü Smart Architecture
ü Parallel Processing
ü Power-saving Design
ü Near-Threshold
ü Low Power Technology
1 ÷ 3 GOPS
1 ÷ 30 mW
Idle: ~1µW
100 uW ÷ ~10 mW
Active: ~ 50mW
| |
PULPissimo Architecture
SoC nk
s
ba
d
ave L2 private banks
rle L2
te Bank
L2Bank
In L2 L2 L2
Bank ROM
Bank Bank Bank
APB
GPIO
HWCE RISC-V
I2S
Pad Control
SPI M
DEBUG
HWCE
INTC
CAMIF
µDMA
I2C
APB
UART
EVENT
TIMER
CTRL
CLK
SoC
SDIO
| |
PULPissimo Architecture
• RISC-V based advanced microcontroller SoC nk
s
– 512kB of L2 Memory ve
d
ba
L2 private banks
ea L2
rl
– 16kB of energy efficient latch-based In
te
L2
Bank
L2Bank
L2 L2
Bank ROM
memory (L2 SCM BANK) Bank Bank Bank
APB
GPIO
HWCE RISC-V
– Camera Interface (up to 320x240@60fps)
– I2C, I2S (up to 4 digital microphones) I2S
Pad Control
– SPI M
– Interrupt controller, Bootup ROM
DEBUG
HWCE
INTC
CAMIF
µDMA
• Autonomous IO DMA Subsystem I2C
APB
(µDMA) UART
EVENT
TIMER
CTRL
CLK
SoC
SDIO
• Power management
– 2 low-power FLLs (IO, SoC)
| |
PULP Cluster Architecture
DMA
Logarithmic Interconnect
Cluster Bus
RI5CY RI5CY RI5CY RI5CY RI5CY RI5CY RI5CY RI5CY
Peripheral Int.
CORE CORE CORE CORE CORE CORE CORE CORE
Event
Shared FPU Shared FPU
Timer
| |
PULP Cluster Architecture
Cluster Bus
– Optimize cache usage RI5CY RI5CY RI5CY RI5CY RI5CY RI5CY RI5CY RI5CY
Peripheral Int.
CORE CORE CORE CORE CORE CORE CORE CORE
Event
• Multi-Core event unit for barriers Timer
Shared FPU Shared FPU
transfers
| |
The RISC-V PULP cores
| |
PULPissimo Architecture
• RISC-V based advanced microcontroller SoC nk
s
– 512kB of L2 Memory ve
d
ba
L2 private banks
ea L2
rl
– 16kB of energy efficient latch-based In
te
L2
Bank
L2Bank
L2 L2
Bank ROM
memory (L2 SCM BANK) Bank Bank Bank
APB
GPIO
HWCE RISC-V
– Camera Interface (up to 320x240@60fps)
– I2C, I2S (up to 4 digital microphones) I2S
Pad Control
– SPI M
– Interrupt controller, Bootup ROM
DEBUG
HWCE
INTC
CAMIF
µDMA
• Autonomous IO DMA Subsystem I2C
APB
(µDMA) UART
EVENT
TIMER
CTRL
CLK
SoC
SDIO
• Power management
– 2 low-power FLLs (IO, SoC)
| |
Different Workload? Different core
Ariane
RI5CY+FPU
RI5CY
Zero-riscy
Micro-riscy
| |
Different Workload? Different core
Ariane
RI5CY+FPU
RI5CY
Zero-riscy
Micro-riscy
| |
Different Workload? Different core
Ariane
RI5CY+FPU
RI5CY
Zero-riscy
Micro-riscy
| |
RI5CY Processor: our workhorse core
• 4-stage pipeline
– RV32IMFCXpulp
– 70K GF22 nand2 equivalent gate
(GE) + 30KGE for FPU
– Coremark/MHz 3.19 https://github.com/pulp-platform/riscv
– Includes various extensions
• pSIMD
• Fixed point
• Bit manipulations
• HW loops
• Silicon Proven • NEW Floating Point Unit:
– SMIC130, UMC65, TSMC55LP, – Iterative DIV/SQRT (9 cycles)
TSMC40LP, GF22FDX – Parametrizable latency for MUL, ADD,
SUB, Cast
– Single cycle load, store
| |
RI5CY simplified pipeline
Read AGU
RF
E
PC
P
Align and
IF/
ID
X/
Write
gen Decompress /E
C ID W RF
X
B
Decode E
operand
s fwd X
Jumps
Branches
| |
PULP Cores Memory Interface (1/2)
§ Request with Address (32bits) and request (1bit) signal
§ Byte Enable (BE) (4bits): byte, short or word memory transaction) in case of Load/Store
§ Write Enable (WE) (1 bit)
§ wdata (32bits): data to write in case of store operations
| |
PULP Cores Memory Interface (2/2)
§ Back2Back Memory Transactions
| |
Xpulp Extentions: General Purposes Extensions 1
• DSP extensions
• General purposes
Original RISC-V
• ABS, CLIP/Saturation DSP Ext
add x4, x4, x5 p.addRN x4, x5, x5, 1
• MIN, MAX addi x4, x4, 1
slri x4, x4, 1
• MAC and MSU
• Fixed Point Support
• ADD and SUB with normalization and round
• MUL and MAC with normalization and round
§Possibility to share some resources
§ ABS reuses the adder and comparator in the ALU
§ Clip adds a comparator but reuses adder and previous comparator
§ Normalization done by connecting adder output to the shifter
§ Round done by exploiting multi-operand adders
| |
Xpulp Extentions: packed-SIMD 1/4
• packed-SIMD extensions
• RISC-V reserved the “RVP” extensions but it is still an on-
going topic
• It also includes DSP extensions
• Differently from “RVV” vectorial extensions, vectors are packet
to the integer RF
• Make usage of resources the best in performance with little overhead
• Target for embedded systems, RVV is for high performance
• pSIMD in 32bit machines
• Vectors are either 4 8bits-elements or 2 16bits-elements
• pSIMD instructions
Computation add, sub, shift, avg, abs, dot product
Compare min, max, compare
Manipulate extract, pack, shuffle
| |
Xpulp Extentions: packed-SIMD 2/4
• Same Register-file
• The instruction encode how to interpret the content of the register
| |
Xpulp Extentions: packed-SIMD 4/4
§ Shuffle instructions
§ In order to use the vector unit the elements have to be aligned in the
register file
§ Shuffle allows to recombine bytes into 1 register
Mask bits rD
§ pv.shuffle2.b rD, rA, rB
rD{3} = (rB[26]==0) ? rA:rD {rB[25:24]}
rD{2} = (rB[18]==0) ? rA:rD {rB[17:16]} rA
rD{1} = (rB[10]==0) ? rA:rD {rB[ 9: 8]}
rD{0} = (rB[ 2]==0) ? rA:rD {rB[ 1: 0]}
rB
rD =
| |
ISA Extensions: Putting it All Together
| |
IIS - PULP
66
ALU architecture
| |
MUL architecture
§ (blue) 16x16 with sign selection for
short multiplications [with round and
normalization]. 5 cycles FSM for
higher 64-bits (mulh* instructions)
35:2 compressor
| |
2D Convolution with Xpulp Extensions:
performance + less memory pressure
§ Convolution in registers
§ 5x5 convolutional filter
| |
2D Convolution with Xpulp Extensions:
performance + less memory pressure
§ Convolution in registers
§ 5x5 convolutional filter
§ 7 Sum-of-dot-product
§ 4 move
§ 1 shuffle
§ 3 lw/sw
§ ~ 5 control instructions
| |
2D Convolution with Xpulp Extensions:
performance + less memory pressure
§ Convolution in registers
§ 5x5 convolutional filter
§ 7 Sum-of-dot-product
§ 4 move
§ 1 shuffle
§ 3 lw/sw
§ ~ 5 control instructions
| |
PULP core examples – RV32IMC vs RV32IMCXpulp General code
start_loop: start_loop:
addi a6,t1,-32 addi t3,t5,-32
c.mv a7,t5 //address of matA c.mv a7,s2 //address of matB
addi t3,a0,-32 addi t1,t4,-32
• 2 bytes saved loop0: lp.setupi x0,32,stop0
(X instructions not c.mv a4,t3 //address of matA c.mv a3,t1 //address of matA
compressed) c.mv a2,a7 c.mv a2,a7
• Number of instructions c.li a5,0 c.li a5,0
loop1: sub a4,t4,t1 //loop count1
reduced (21 vs 18) lbu a3,0(a4) //load byte lp.setup x1,a4,stop1 //hw loop
• Removed branch penalties lbu a1,0(a2) p.lbu a0,1(a3!) //load byte with post increment
c.addi a4,a4,1 //post increment p.lbu a1,32(a2!)
mul a3,a3,a1 //mul p.mac a5,a0,a1 //mac
c.add a5,a5,a3 //acc after mul stop1: andi a5,a5,255
andi a5,a5,255 p.sb a5,1(t3!) //store result with post increment
c.addi a2,a2,1 stop0: c.addi a7,a7,1
bne a4,a0,loop1 // branch penalty addi t5,t5,32
sb a5,0(a6) addi t4,a3,32
c.addi a6,a6,1 bne t5,t6,start_loop
addi a7,a7,32
bne a6,t1,loop0 //branch penalty
addi t1,a6,32
addi a0,a4,32
bne t1,t4,start_loop
| |
PULP core examples – RV32IMCXpulp General code vs Opt code
… … //iterate #COL/4
lp.setup x1,a4,stop1 lp.setup x1,a6,stop1
p.lbu a0,1(a3!) p.lw a1,4(t1!) //load 4-bytes with post inc
p.lbu a1,1(a2!) p.lw a5,4(t3!)
stop1: p.mac a5,a0,a1 stop1: pv.sdotsp.b a7,a1,a5 //4 mac
…. ….
| |
PULP cores Interrupts
• Asynchronous events
• If interrupt is taken, jump to xtvec
• xtvec holds the base address to jump
• + 4*interrupt ID for computing the actual address
• No delegation supported
• All interrupts are handled in machine mode
• External interrupt controller interact with peripheral
subsystem and SW events
| |
PULP cores interrupts protocol
req
id 5
Interrupt
PULP Core
Controller ack
id 5
| |
PULP cores interrupts protocol – timing diagram
| |
Wait For Interrupt & Power manager
• The pipeline and state registers are clock gated when not used
• The ALU, Integer Multiplier and Dot Product units have different operands registers
• In the ID stage, the decoded instruction can be part of one of this 3 domains, the others 2 are
clock gated
| |
Performance Counter 1/3
PC IF ID EX WB
| |
Performance Counter 2/3
PC IF ID EX WB
//JALR STALL
mul x10, x10, x10
jalr x11, x10, 0x4
| |
Performance Counter 3/3
| |
Example Performance Counter
| |
Simulation Tracer
• For every instruction executed, the core prints on a file the
• “TIME STAMP – PC – INSTRUCTION – OPERANDs and RESULTs”
Instr encoding
PC
Disassembled instruction
| |
Hybrid Logaritmic Interconnect
| |
PULPissimo Architecture
• RISC-V based advanced microcontroller SoC nk
s
– 512kB of L2 Memory ve
d
ba
L2 private banks
ea L2
rl
– 16kB of energy efficient latch-based In
te
L2
Bank
L2Bank
L2 L2
Bank ROM
memory (L2 SCM BANK) Bank Bank Bank
APB
GPIO
HWCE RISC-V
– Camera Interface (up to 320x240@60fps)
– I2C, I2S (up to 4 digital microphones) I2S
Pad Control
– SPI M
– Interrupt controller, Bootup ROM
DEBUG
HWCE
INTC
CAMIF
µDMA
• Autonomous IO DMA Subsystem I2C
APB
(µDMA) UART
EVENT
TIMER
CTRL
CLK
SoC
SDIO
• Power management
– 2 low-power FLLs (IO, SoC)
| |
Non Interleaved L2
APB AXI32 ROM L2_PRI[0] L2_PRI[1]
XBAR BRIDGE
Interleaved L2
XBAR L2
2 3 0 1
AXI64_to_LINT32
64bit
Axi bus
| |
Interconnect performance
In t
erl
Memory Bank Memory Bank Memory Bank Memory Bank
ea
ve
Memory Memory Memory Memory
Memory Memory Memory Memory
db
Cut Cut Cut Cut
Memory
Cut Memory
Cut Memory
Cut Memory
Cut
an
Cut Cut Cut Cut
ks
L2 multiport w/interleaving support
SOC
CLK
| 27.02.2019 |91
Non Interleaved L2
APB AXI32 ROM L2_PRI[0] L2_PRI[1]
UDMA_RX AND CORE_DATA
WANT TO WRITE TO BANK 1
OF INTERLEAVED L2. LINT 2 LINT 2
APB AXI32
ONE IS STALLED, THE OTHER
MAKES THE TRANSACTION
(BANK CONFLICT)
XBAR BRIDGE
Interleaved L2
XBAR L2
2 3 0 1
AXI64_to_LINT32
64bit
Axi bus
| |
Peripheral Interconnect
| |
PULPissimo Architecture
• RISC-V based advanced microcontroller SoC nk
s
– 512kB of L2 Memory ve
d
ba
L2 private banks
ea L2
rl
– 16kB of energy efficient latch-based In
te
L2
Bank
L2Bank
L2 L2
Bank ROM
memory (L2 SCM BANK) Bank Bank Bank
APB
GPIO
HWCE RISC-V
– Camera Interface (up to 320x240@60fps)
– I2C, I2S (up to 4 digital microphones) I2S
Pad Control
– SPI M
– Interrupt controller, Bootup ROM
DEBUG
HWCE
INTC
CAMIF
µDMA
• Autonomous IO DMA Subsystem I2C
APB
(µDMA) UART
EVENT
TIMER
CTRL
CLK
SoC
SDIO
• Power management
– 2 low-power FLLs (IO, SoC)
| |
Peripheral Bus
Only one APB request! If
more, stalled in the HYBRID
LOGARITMIC INTERCONNECT
(AS BANK CONFLICT)
SOC ADV.
FLL GPIO UDMA EVENT INTC DEBUG TIMER HWCE
CTRL TIMER
PERIPHERAL BUS
0x1A1_00000
| |
µDMA: An Autonomous I/O
Subsystem
| |
I/O requirements
Up to 2.4GBit/s
46Mbit/s
320x240@25fps
New SD standard up
HIGH NCE
ORM
A to 800Mbit/s
RF
PE uC
QuadSPI up to
3Mbit/s per 400Mbit/s
channel
RF
Transceivers
> 100Mbit/s
Peak BW
> 1Gbit/s
| 27.02.2019 |100
PULPissimo Architecture
• RISC-V based advanced microcontroller SoC nk
s
– 512kB of L2 Memory ve
d
ba
L2 private banks
ea L2
rl
– 16kB of energy efficient latch-based In
te
L2
Bank
L2Bank
L2 L2
Bank ROM
memory (L2 SCM BANK) Bank Bank Bank
APB
GPIO
HWCE RISC-V
– Camera Interface (up to 320x240@60fps)
– I2C, I2S (up to 4 digital microphones) I2S
Pad Control
– SPI M
– Interrupt controller, Bootup ROM
DEBUG
HWCE
INTC
CAMIF
µDMA
• Autonomous IO DMA Subsystem I2C
APB
(µDMA) UART
EVENT
TIMER
CTRL
CLK
SoC
SDIO
• Power management
– 2 low-power FLLs (IO, SoC)
| |
uDMA Subsystem APB UDMA_TX UDMA_RX
uDMA Subsystem
µDMA core
cfg_data cfg_data
CONFIG Registers CONFIG Registers
tx_data 2xtx_data
PERIPH TX PROTOCOL DSP
stream
rx_data
PERIPH RX PROTOCOL rx_data
| |
Offload pipeline
DOUBLE BUFFERING
CPU Offload
Start DMA
Offload
Start DMA
Offload
Start DMA
Acc. DMA
Copy Copy Copy
L2 to Acc L2 to Acc L2 to Acc
Process Process Process
Acc. Processing Buffer0 Buffer1 Buffer0
TIME
| 27.02.2019 |111
Hardware Accelerator for
Neural Networks
| |
PULPissimo Architecture
• RISC-V based advanced microcontroller SoC nk
s
– 512kB of L2 Memory ve
d
ba
L2 private banks
ea L2
rl
– 16kB of energy efficient latch-based In
te
L2
Bank
L2Bank
L2 L2
Bank ROM
memory (L2 SCM BANK) Bank Bank Bank
APB
GPIO
HWCE RISC-V
– Camera Interface (up to 320x240@60fps)
– I2C, I2S (up to 4 digital microphones) I2S
Pad Control
– SPI M
– Interrupt controller, Bootup ROM
DEBUG
HWCE
INTC
CAMIF
µDMA
• Autonomous IO DMA Subsystem I2C
APB
(µDMA) UART
EVENT
TIMER
CTRL
CLK
SoC
SDIO
• Power management
– 2 low-power FLLs (IO, SoC)
| |
1. Motivation
STATIC MUXING
TP/32 x 32-bit
OUTPUT THRESHOLD
SINK TP-bit
| |
Operation in time
CONTROLLER
REG FILE PROG
| |
PULPissimo Architecture
• RISC-V based advanced microcontroller SoC nk
s
– 512kB of L2 Memory ve
d
ba
L2 private banks
ea L2
rl
– 16kB of energy efficient latch-based In
te
L2
Bank
L2Bank
L2 L2
Bank ROM
memory (L2 SCM BANK) Bank Bank Bank
APB
GPIO
HWCE RISC-V
– Camera Interface (up to 320x240@60fps)
– I2C, I2S (up to 4 digital microphones) I2S
Pad Control
– SPI M
– Interrupt controller, Bootup ROM
DEBUG
HWCE
INTC
CAMIF
µDMA
• Autonomous IO DMA Subsystem I2C
APB
(µDMA) UART
EVENT
TIMER
CTRL
CLK
SoC
SDIO
• Power management
– 2 low-power FLLs (IO, SoC)
| |
PULP interrupts controller (INTC)
• It generates interrupt requests from 0 to 31
• Mapped to the APB bus
• Receives events in a FIFO from the SoC Event Generator (i.e.
from peripherals)
• Unique interrupt ID (26) but different event ID
• Mask, pending interrupts, acknowledged interrupts, event id
registers
• Set, Clear, Read and Write operations by means of load and
store instructions (memory mapped operations)
• Interrupts come from:
• Timers
• GPIO (rise, fall events)
• HWCE
• Events i.e. uDMA
| |
PULP Event Generator (EVENT)
EVENT DOUBLE
BUFFERING
ACC.
READY
MASKING UNIT
EVENTS FROM
SYSTEM 8 EVT ID
TO
PER.
• GPIO INTC FIFO
VALID
• TIMERS
• HWCE
SERIALIZER
• uDMA
SOC
uDMA is waiting
READY for some event –
EVT. CFG
8 8
TO
VALID
PERIPHS
when GPIO rises
FROM APB
| |
Interrupts Source
be nt
re
m ere
fo
fro Diff
EVENTS FROM
SYSTEM
READY req
EVENT FROM EVT ID id 5
EVENT GENERATOR
VALID
PULP Core
ack
id 5
8
SW EVENTS
(INTC APB Registers)
| |
TestBench
| |
PULP TestBench
| |
HANDS-ON
| |
PULPissimo on GitHub
• PULPissimo is available @ https://github.com/pulp-platform/pulpissimo
• git clone git@github.com:pulp-platform/pulpissimo.git
| |
Dependecies
• “ips_list.yml” holds the needed sub-IPs.
| |
Compile & Simulate PULP
• PULP compilation and simulation scripts and flow are based on
modelsim
• To compile
• cd pulpissimo/sim
• make clean lib build opt
• To execute an application
• cd yourapplicationfolder
• make clean all (to compile it)
• make dis > dis.s (to generate the object dump)
• make run gui=1 (to run modelsim with GUI)
• make run
• Assembler, Simulation Trace and Performance counter to analyze
performance
| |
Programming PULP
| |
The C->ASM->MONITOR Loop
SoC n ks
• Core data stack in L2
Bank conflict on the GRANT ba
av
ed
L2
L2 private banks Private Bank0
r le L2 Bank
In
te L2 Bank
Bank
Bank
L2
Bank
L2
Bank
ROM • Instructions in L2 Private
Bank1
Logarithmic Interconnect
• HWCE data in L2
HWCE
interleaved
RISC-V
Slow bus access for the VALID None of the master ports in
the Log. Interconnect
Will create bank conflicts
| J|
The C->ASM->MONITOR Loop
• When you write your C program, you must have in mind many
things:
• What is the ISA of my core?
• Is my kernel (e.g. MatMul) using all the instructions of my ISA in an optimized way?
• à You check this by generating the assembler and double check the instructions.
Try to reverse what the compiler did as see whether you can do better or not
• If not, you can use builtins or asm volatile statements to force the use of some
instructions! (Or rewrite properly the C code)
… … //iterate #COL/4
lp.setup x1,a4,stop1 lp.setup x1,a6,stop1 I am a
p.lbu a0,1(a3!) p.lw a1,4(t1!) cool GUY
p.lbu a1,1(a2!) p.lw a5,4(t3!)
stop1: p.mac a5,a0,a1 stop1: pv.sdotsp.b a7,a1,a5
…. ….
| |
The C->ASM->MONITOR Loop
• When you write your C program, you must have in mind many
things:
• What are the performance I should expect?
• Can I achieve that performance? Why?
• If I don’t have a clue, I should open the waveform and see where the stalls are coming from
• How can I solve it? à Back to writing CODE (e.g. loop unrolling)
| |
Hands-On à The Dot Product
| |
Hands-On à The 2D Convolution
| |
Thanks a lot
| |
134