Schiavone Wosh2019 Tutorial

Download as pdf or txt
Download as pdf or txt
You are on page 1of 81

Understanding and working with

PULP

Pasquale Davide Schiavone


and the PULP team
1Integrated
1
Department of Electrical, Electronic
System laboratory, ETH, Zurich, Switzerland and Information Engineering
2Energy Efficient Embedded Systems Laboratory, University Of Bologna, Bologna, Italy

2
Integrated Systems Laboratory
13.06.2019
Near Sensor (aka Edge) Processing

ü Smart Architecture
ü Parallel Processing
ü Power-saving Design
ü Near-Threshold
ü Low Power Technology
1 ÷ 3 GOPS
1 ÷ 30 mW

Idle: ~1µW
100 uW ÷ ~10 mW
Active: ~ 50mW

| |
PULPissimo Architecture

SoC nk
s
ba
d
ave L2 private banks
rle L2
te Bank
L2Bank
In L2 L2 L2
Bank ROM
Bank Bank Bank

DEBUG Logarithmic Interconnect

APB
GPIO
HWCE RISC-V

I2S

Pad Control
SPI M

DEBUG
HWCE

INTC
CAMIF

µDMA
I2C
APB

UART

EVENT
TIMER

CTRL
CLK

SoC
SDIO

| |
PULPissimo Architecture
• RISC-V based advanced microcontroller SoC nk
s
– 512kB of L2 Memory ve
d
ba
L2 private banks
ea L2
rl
– 16kB of energy efficient latch-based In
te
L2
Bank
L2Bank
L2 L2
Bank ROM
memory (L2 SCM BANK) Bank Bank Bank

DEBUG Logarithmic Interconnect


• Rich set of peripherals:
– QSPI (up to 280 Mbps)

APB
GPIO
HWCE RISC-V
– Camera Interface (up to 320x240@60fps)
– I2C, I2S (up to 4 digital microphones) I2S

JTAG (Debug), GPIOs,

Pad Control
– SPI M
– Interrupt controller, Bootup ROM

DEBUG
HWCE

INTC
CAMIF

µDMA
• Autonomous IO DMA Subsystem I2C
APB

(µDMA) UART

EVENT
TIMER

CTRL
CLK

SoC
SDIO

• Power management
– 2 low-power FLLs (IO, SoC)
| |
PULP Cluster Architecture

CLUSTER TIGHTLY COUPLED DATA MEMORY

Data Data Data Data


Mem Mem Mem Mem

Data Data Data Data


Mem Mem Mem Mem

DMA
Logarithmic Interconnect

Cluster Bus
RI5CY RI5CY RI5CY RI5CY RI5CY RI5CY RI5CY RI5CY

Peripheral Int.
CORE CORE CORE CORE CORE CORE CORE CORE

Event
Shared FPU Shared FPU

Timer

Shared Instruction Cache

| |
PULP Cluster Architecture

• 8 RISC-V multicore cluster


– 64kB of L1 Memory CLUSTER TIGHTLY COUPLED DATA MEMORY

• Shared FPU for efficient Data


Mem
Data
Mem
Data
Mem
Data
Mem

resources minimization Data Data Data Data


Mem Mem Mem Mem
– 2 FPU, 1 every 4 cores DMA

• Shared I$ Logarithmic Interconnect

Cluster Bus
– Optimize cache usage RI5CY RI5CY RI5CY RI5CY RI5CY RI5CY RI5CY RI5CY

Peripheral Int.
CORE CORE CORE CORE CORE CORE CORE CORE

Event
• Multi-Core event unit for barriers Timer
Shared FPU Shared FPU

and clock-gate managment


• DMA for efficient L2ßà L1 data Shared Instruction Cache

transfers

| |
The RISC-V PULP cores

| |
PULPissimo Architecture
• RISC-V based advanced microcontroller SoC nk
s
– 512kB of L2 Memory ve
d
ba
L2 private banks
ea L2
rl
– 16kB of energy efficient latch-based In
te
L2
Bank
L2Bank
L2 L2
Bank ROM
memory (L2 SCM BANK) Bank Bank Bank

DEBUG Logarithmic Interconnect


• Rich set of peripherals:
– QSPI (up to 280 Mbps)

APB
GPIO
HWCE RISC-V
– Camera Interface (up to 320x240@60fps)
– I2C, I2S (up to 4 digital microphones) I2S

JTAG (Debug), GPIOs,

Pad Control
– SPI M
– Interrupt controller, Bootup ROM

DEBUG
HWCE

INTC
CAMIF

µDMA
• Autonomous IO DMA Subsystem I2C
APB

(µDMA) UART

EVENT
TIMER

CTRL
CLK

SoC
SDIO

• Power management
– 2 low-power FLLs (IO, SoC)
| |
Different Workload? Different core

Ariane

RI5CY+FPU

RI5CY

Zero-riscy

Micro-riscy

| |
Different Workload? Different core

Ariane

RI5CY+FPU

RI5CY

Zero-riscy

Micro-riscy

Now part of under the «IBEX» name

| |
Different Workload? Different core

Ariane

RI5CY+FPU

RI5CY

Zero-riscy

Micro-riscy

Going to be part of IP core family

| |
RI5CY Processor: our workhorse core

• 4-stage pipeline
– RV32IMFCXpulp
– 70K GF22 nand2 equivalent gate
(GE) + 30KGE for FPU
– Coremark/MHz 3.19 https://github.com/pulp-platform/riscv
– Includes various extensions
• pSIMD
• Fixed point
• Bit manipulations
• HW loops
• Silicon Proven • NEW Floating Point Unit:
– SMIC130, UMC65, TSMC55LP, – Iterative DIV/SQRT (9 cycles)
TSMC40LP, GF22FDX – Parametrizable latency for MUL, ADD,
SUB, Cast
– Single cycle load, store
| |
RI5CY simplified pipeline

Instr Memory Data Memory


Instr Address Instr Data Data Address Data

Read AGU
RF
E
PC
P
Align and
IF/
ID
X/
Write
gen Decompress /E
C ID W RF
X
B
Decode E
operand
s fwd X

Jumps

Branches

| |
PULP Cores Memory Interface (1/2)
§ Request with Address (32bits) and request (1bit) signal
§ Byte Enable (BE) (4bits): byte, short or word memory transaction) in case of Load/Store
§ Write Enable (WE) (1 bit)
§ wdata (32bits): data to write in case of store operations

§ Response with Grant signal and Valid signal


§ The core can be interfaced with multicycle memory accesses
§ Grant comes from the arbiter
§ Valid from the memory subsystem
§ rdata (32bits): data to read. It has to be sampled when the valid signal is high

| |
PULP Cores Memory Interface (2/2)
§ Back2Back Memory Transactions

STALL EX STAGE STALL WB STAGE


§ Slow Memory Transactions

| |
Xpulp Extentions: General Purposes Extensions 1

• Memory Access Extensions


• Misaligned memory accesses (not ISA extension)
• Load or Store 32/16bit values with non-multiple of 4/2 addresses
• Useful when dealing with packet-data (32bits values holding 2/4 elements)
• It requires 2 access to the memory, data manipulation done in the load-
store-unit
• e.g. LOAD 32bit at 0x0000_0002
• Read from memory higher 16bits at 0x0000_0000
• Read from memory lower 16bits at 0x0000_0004 and pack the
data
• Save instructions (code size) and speed up execution
• Explicit load to 0x0000_0000 and 0x0000_0004, shift and or
operations
Original RISC-V Misaligned Ext
lw x2, 0(x10) lw x2, 2(x10)
lw x3, 4(x10)
slri x2, x2, 16
slli x3, x3, 16
or x3, x3, x2 | |
Xpulp Extentions: General Purposes Extensions 2

• Memory Access Extensions


• Post Increment Load/Store
• Automatic register update with computed address
• Useful in array iterations
• Save instructions
• It requires extra write register file port or slower execution

• Register-Register Load/Store (and Post Increment)


• Immediate is only 12bits
• Use register-register address calculation for 32bits offset
Original RISC-V AutoIncrement Load/Store Ext
lw x2, 4(x10) lw x2, 4(x10!)
lw x3, 4(x12) lw x3, 4(x12!)
addi x10, x10, 4 ...
addi x12, x12, 4 LOOP
....
LOOP | |
Xpulp Extentions: General Purposes Extensions 3

• Hardware loops extensions


• HWLs or Zero Overhead Loops to remove branch overheads in for loops.
• Smaller loop benefit more!
• Loop needs to be set up beforehand and is fully defined 3 SP regs by:
• Start address à lp.starti L, Imm12 à START_REG[L] = PC + 2*Imm12
• End address à lp.endi L, Imm12 à END_REG[L] = PC + 2*Imm12
• Counter à lp.count{-,i}, L, {rs1,Imm12} à COUNT_REG[L] = rs1/Imm12
• Short-cut à lp.setup{-,i}, L, {rs1,ImmC}, Imm12
• START_REG[L] = PC + 4, END_REG[L] = PC + 2*Imm12, COUNT_REG[L] = {rs1,ImmC}

• Two sets registers implemented to support nested loops (L=0 or 1)

• Performance: Original RISC-V HW Loop Ext


• Speedup can be up to factor 2! mv x5, 0 lp.setupi 100, Lend
mv x4, 100 nop
Lstart: Lend: nop
addi x4, x4, -1
nop
nop
bne x4, x5, Lstart
| |
Xpulp Extentions: Bit Manipulation

• Bit manipulation extensions


• RISC-V reserved the “RVB” extensions but it is still an on-going topic
• PULP developed its own bit-manipulation and possibly will align with
RVB
• Contribution to the official task in the RISC-V community
• Bit Manipulation instructions list
• Extract N bits starting from M from a word and extend (or not) with sign
• Insert N bits starting from M in a word
• Clear N bits starting from M in a word
• Set N bits starting from M in a word
• Find first bit set Original RISC-V BitMan Ext

• Find last bit set mv x5, 0 p.cnt x8, x8


mv x7, 0
• Count numbers of 1 (popcount) mv x4, 32
Lstart:
• Rotate andi x6, x8, 1
add x7, x7, x6
addi x4, x4, 1
slri x8, x8, 1
bne x4, x5, Lstart | |
Xpulp Extentions: DSP

• DSP extensions
• General purposes
Original RISC-V
• ABS, CLIP/Saturation DSP Ext
add x4, x4, x5 p.addRN x4, x5, x5, 1
• MIN, MAX addi x4, x4, 1
slri x4, x4, 1
• MAC and MSU
• Fixed Point Support
• ADD and SUB with normalization and round
• MUL and MAC with normalization and round
§Possibility to share some resources
§ ABS reuses the adder and comparator in the ALU
§ Clip adds a comparator but reuses adder and previous comparator
§ Normalization done by connecting adder output to the shifter
§ Round done by exploiting multi-operand adders
| |
Xpulp Extentions: packed-SIMD 1/4
• packed-SIMD extensions
• RISC-V reserved the “RVP” extensions but it is still an on-
going topic
• It also includes DSP extensions
• Differently from “RVV” vectorial extensions, vectors are packet
to the integer RF
• Make usage of resources the best in performance with little overhead
• Target for embedded systems, RVV is for high performance
• pSIMD in 32bit machines
• Vectors are either 4 8bits-elements or 2 16bits-elements
• pSIMD instructions
Computation add, sub, shift, avg, abs, dot product
Compare min, max, compare
Manipulate extract, pack, shuffle
| |
Xpulp Extentions: packed-SIMD 2/4

• Same Register-file
• The instruction encode how to interpret the content of the register

rs1 0x03 0x02 0x01 0x00

rs2 0x0D 0x0C 0x0B 0x0A

add rD, rs1, rs2 rD = 0x03020100 + 0x0D0C0B0A


add.h rD, rs1, rs2 rD[0] = 0x0100 + 0x0B0A
rD[1] = 0x0302 + 0x0D0C
add.b rD, rs1, rs2 rD[0] = 0x00 + 0x0A
rD[1] = 0x01 + 0x0B
rD[2] = 0x02 + 0x0C
rD[3] = 0x03 + 0x0D
| |
Xpulp Extentions: packed-SIMD 3/4

• HW reuse for small overhead


• Vector modes: Vectorial Adder
• bytes, halfwords, word
§ 4 byte operations
§ With byte select
§ 2 halfword operations
§ With halfword select
§ 1 word operation
§ Play with carry chain
§ 32bit adder à 35bit adder
§ Vector halfword sub à Carry = co, 1, co, 1
§ Vector byte sub à Carry = 1, 1, 1, 1
§ word sub à Carry = co, co, co, 1

| |
Xpulp Extentions: packed-SIMD 4/4

§ Shuffle instructions
§ In order to use the vector unit the elements have to be aligned in the
register file
§ Shuffle allows to recombine bytes into 1 register
Mask bits rD
§ pv.shuffle2.b rD, rA, rB
rD{3} = (rB[26]==0) ? rA:rD {rB[25:24]}
rD{2} = (rB[18]==0) ? rA:rD {rB[17:16]} rA
rD{1} = (rB[10]==0) ? rA:rD {rB[ 9: 8]}
rD{0} = (rB[ 2]==0) ? rA:rD {rB[ 1: 0]}
rB

§ With rX{i} = rX[(i+1)*8-1:i*8]

rD =
| |
ISA Extensions: Putting it All Together

for (i = 0; i < 100; i++)


d[i] = a[i] + b[i];

Baseline Auto-incr load/store HW Loop Packed-SIMD


mv x5, 0 mv x5, 0 lp.setupi 100, Lend lp.setupi 25, Lend
mv x4, 100 mv x4, 100 lb x2, 1(x10!) lw x2, 4(x10!)
Lstart: Lstart: lb x3, 1(x11!) lw x3, 4(x11!)
lb x2, 0(x10) lb x2, 1(x10!) add x2, x3, x2 pv.add.b x2, x3, x2
lb x3, 0(x11) lb x3, 1(x11!) Lend: sb x2, 1(x12!) Lend: sw x2, 4(x12!)
addi x10,x10, 1 addi x4, x4, -1
addi x11,x11, 1 add x2, x3, x2
add x2, x3, x2 sb x2, 1(x12!)
sb x2, 0(x12) bne x4, x5, Lstart
addi x4, x4, -1
addi x12,x12, 1
bne x4, x5, Lstart

11 cycles/output 8 cycles/output 5 cycles/output 1,25 cycles/output

| |
IIS - PULP
66

ALU architecture

§ Advanced ALU for Xpulp


extensions
§ Optimized datapath to reduce
resources
§ Multiple-adders for round
§ Adder followed by shifter for
fixed point normalization
§ Clip unit uses one adder as
comparator and the main
comparator

| |
MUL architecture
§ (blue) 16x16 with sign selection for
short multiplications [with round and
normalization]. 5 cycles FSM for
higher 64-bits (mulh* instructions)

§ (purple) One single cycle mac unit


that performs MAC, MSU and MUL

§ (red) short parallel dot product

§ (grey) byte parallel dot product

§ Clock gating to reduce switching


activity between the integer and
SIMD multiplier
| |
Dot Product Multiplier

§ Dot Product: (half word example)


C[31:0] = A[31:16]*B[31:16] + A[15:0]*B[15:0] + C[31:0]
32 bit 32 bit 32 bit

=> 2 multiplications, 1 addition, 1 accumulation in 1 cycle!


Partial Product
Generator

35:2 compressor

| |
2D Convolution with Xpulp Extensions:
performance + less memory pressure

§ Convolution in registers
§ 5x5 convolutional filter

| |
2D Convolution with Xpulp Extensions:
performance + less memory pressure

§ Convolution in registers
§ 5x5 convolutional filter

§ 7 Sum-of-dot-product
§ 4 move
§ 1 shuffle
§ 3 lw/sw
§ ~ 5 control instructions

| |
2D Convolution with Xpulp Extensions:
performance + less memory pressure

§ Convolution in registers
§ 5x5 convolutional filter

§ 7 Sum-of-dot-product
§ 4 move
§ 1 shuffle
§ 3 lw/sw
§ ~ 5 control instructions

20 instr. / output pixel à Scalar version >100 instr. / output pixel

| |
PULP core examples – RV32IMC vs RV32IMCXpulp General code
start_loop: start_loop:
addi a6,t1,-32 addi t3,t5,-32
c.mv a7,t5 //address of matA c.mv a7,s2 //address of matB
addi t3,a0,-32 addi t1,t4,-32
• 2 bytes saved loop0: lp.setupi x0,32,stop0
(X instructions not c.mv a4,t3 //address of matA c.mv a3,t1 //address of matA
compressed) c.mv a2,a7 c.mv a2,a7
• Number of instructions c.li a5,0 c.li a5,0
loop1: sub a4,t4,t1 //loop count1
reduced (21 vs 18) lbu a3,0(a4) //load byte lp.setup x1,a4,stop1 //hw loop
• Removed branch penalties lbu a1,0(a2) p.lbu a0,1(a3!) //load byte with post increment
c.addi a4,a4,1 //post increment p.lbu a1,32(a2!)
mul a3,a3,a1 //mul p.mac a5,a0,a1 //mac
c.add a5,a5,a3 //acc after mul stop1: andi a5,a5,255
andi a5,a5,255 p.sb a5,1(t3!) //store result with post increment
c.addi a2,a2,1 stop0: c.addi a7,a7,1
bne a4,a0,loop1 // branch penalty addi t5,t5,32
sb a5,0(a6) addi t4,a3,32
c.addi a6,a6,1 bne t5,t6,start_loop
addi a7,a7,32
bne a6,t1,loop0 //branch penalty
addi t1,a6,32
addi a0,a4,32
bne t1,t4,start_loop
| |
PULP core examples – RV32IMCXpulp General code vs Opt code

• The innermost loop has 4x less iterations


• 4 bytes per matrix are loaded as a 32b word
• Dot product with accumulation performs in 1 cycle 4 macs

… … //iterate #COL/4
lp.setup x1,a4,stop1 lp.setup x1,a6,stop1
p.lbu a0,1(a3!) p.lw a1,4(t1!) //load 4-bytes with post inc
p.lbu a1,1(a2!) p.lw a5,4(t3!)
stop1: p.mac a5,a0,a1 stop1: pv.sdotsp.b a7,a1,a5 //4 mac
…. ….

| |
PULP cores Interrupts

• Asynchronous events
• If interrupt is taken, jump to xtvec
• xtvec holds the base address to jump
• + 4*interrupt ID for computing the actual address
• No delegation supported
• All interrupts are handled in machine mode
• External interrupt controller interact with peripheral
subsystem and SW events

| |
PULP cores interrupts protocol

• Asynchronous protocol between CORE and INTController


• The core takes few cycles before jumping
• The external interrupt controller may change ID number
• e.g. higher priority requests from peripherals
• The core tells the interrupt controller which ID has been used to calculate the
address of the interrupt vector table
• The interrupt controller clears the taken ID

req
id 5
Interrupt
PULP Core
Controller ack
id 5

| |
PULP cores interrupts protocol – timing diagram

| |
Wait For Interrupt & Power manager

• WFI instruction disables the clock


• Dynamic power saved when core is in IDLE
• Taken or Not interrupts wake up the core that starts from the instruction after WFI
• The core waits for all the inflight instructions before switching off the clock
• eg if a load is waiting for the valid signal, long divisions, floating point mac, debug

• The pipeline and state registers are clock gated when not used
• The ALU, Integer Multiplier and Dot Product units have different operands registers
• In the ID stage, the decoded instruction can be part of one of this 3 domains, the others 2 are
clock gated

| |
Performance Counter 1/3

§ Registers in the CSR space that counts events


§ Number of cycles and number of retired instructions used to calculate
“IPC – Instructions per cycle”
§ Performance counters used for counting the stalls
//LOAD STALL
§ Load stalls lw x10, 0x0(x2)
§ Value not yet returned from memory add x10, x10, 0x4

PC IF ID EX WB

A+4 to Imem Y from Imem[A] add needs addr to -


value from Dmem
lw à STALL
A+8 to Imem Y from Imem[A] add needs Bubble D from
value from Dmem
lw à FWD
A+12 to Imem Z from Imem[A+4] decode Y add

| |
Performance Counter 2/3

§ Jump stalls (jalr)


§ Stall to break path from EX stage to Imem

PC IF ID EX WB

A+4 to Imem Y from Imem[A] jalr needs mul -


x10à STALL
x10+0x4 to Bubble Jump to - -
Imem x10+0x4

//JALR STALL
mul x10, x10, x10
jalr x11, x10, 0x4

| |
Performance Counter 3/3

§ Other performance counters used to monitor


§ Number of cycles lost for fetching (Instruction Cache for instance)
§ Number of load, stores, branches, taken branches, jumps and compressed instructions

Address Perf Counter Description


0x782 LD_STALL Number of load data hazards
0x783 JR_STALL Number of jump register data hazards
0x784 IMISS Cycles waiting for instruction fetches, i.e. number of
instructions wasted due to non-ideal caching
0x785 LD Number of data memory loads executed.
Misaligned accesses are counted twice
0x786 ST Number of data memory stores executed.
Misaligned accesses are counted twice
0x787 JUMP Number of unconditional jumps (j, jal, jr, jalr)
0x788 BRANCH Number of branches.
Counts taken and not taken branches
0x789 BTAKEN Number of taken branches.
0x78A RVC Number of compressed instructions executed

| |
Example Performance Counter

…enable perf counters…


csrw 0x782,x0 //reset perf counter LD_STALL
// loop 100 times over load stall
lp.setupi x1,100,stop_loop
lw x10,4(x14!)
stop_loop: add x11,x11,x10 //stall due to load dependency
csrr x15,0x782 // à x15 contains 100

| |
Simulation Tracer
• For every instruction executed, the core prints on a file the
• “TIME STAMP – PC – INSTRUCTION – OPERANDs and RESULTs”

Instr encoding

Relative jumps/branch target

PC
Disassembled instruction

§ Trace file (build/pulpissimo/trace_core_1f_0.log):

| |
Hybrid Logaritmic Interconnect

| |
PULPissimo Architecture
• RISC-V based advanced microcontroller SoC nk
s
– 512kB of L2 Memory ve
d
ba
L2 private banks
ea L2
rl
– 16kB of energy efficient latch-based In
te
L2
Bank
L2Bank
L2 L2
Bank ROM
memory (L2 SCM BANK) Bank Bank Bank

DEBUG Logarithmic Interconnect


• Rich set of peripherals:
– QSPI (up to 280 Mbps)

APB
GPIO
HWCE RISC-V
– Camera Interface (up to 320x240@60fps)
– I2C, I2S (up to 4 digital microphones) I2S

JTAG (Debug), GPIOs,

Pad Control
– SPI M
– Interrupt controller, Bootup ROM

DEBUG
HWCE

INTC
CAMIF

µDMA
• Autonomous IO DMA Subsystem I2C
APB

(µDMA) UART

EVENT
TIMER

CTRL
CLK

SoC
SDIO

• Power management
– 2 low-power FLLs (IO, SoC)
| |
Non Interleaved L2
APB AXI32 ROM L2_PRI[0] L2_PRI[1]

Master port (initiator)


Slave port (target)
LINT 2 LINT 2
APB AXI32

XBAR BRIDGE
Interleaved L2

XBAR L2

read read write write

2 3 0 1

AXI64_to_LINT32

64bit
Axi bus

HWCE x 4 CORE_DATA DBG_RX UDMA_RX UDMA_TX CORE_INSTR

| |
Interconnect performance

• Low latency interconnect with word level interleaving to reduce contention


• 4 PORT memory capable of handling maximum BW of 4*32*Freq
• High performance plugs to the processing subsystem

In t
erl
Memory Bank Memory Bank Memory Bank Memory Bank

ea
ve
Memory Memory Memory Memory
Memory Memory Memory Memory

db
Cut Cut Cut Cut
Memory
Cut Memory
Cut Memory
Cut Memory
Cut

an
Cut Cut Cut Cut

ks
L2 multiport w/interleaving support

Low latency interconnect


TX
Peripheral Channels
APB Bridge
Peripheral RX
Channels I D
Peripheral
PWM

SOC
CLK

uDMA Subsystem CPU Subsystem APB Subsystem

| 27.02.2019 |91
Non Interleaved L2
APB AXI32 ROM L2_PRI[0] L2_PRI[1]
UDMA_RX AND CORE_DATA
WANT TO WRITE TO BANK 1
OF INTERLEAVED L2. LINT 2 LINT 2
APB AXI32
ONE IS STALLED, THE OTHER
MAKES THE TRANSACTION
(BANK CONFLICT)
XBAR BRIDGE
Interleaved L2

XBAR L2

read read write write

2 3 0 1

AXI64_to_LINT32

64bit
Axi bus

HWCE x 4 CORE_DATA DBG_RX UDMA_RX UDMA_TX CORE_INSTR

| |
Peripheral Interconnect

| |
PULPissimo Architecture
• RISC-V based advanced microcontroller SoC nk
s
– 512kB of L2 Memory ve
d
ba
L2 private banks
ea L2
rl
– 16kB of energy efficient latch-based In
te
L2
Bank
L2Bank
L2 L2
Bank ROM
memory (L2 SCM BANK) Bank Bank Bank

DEBUG Logarithmic Interconnect


• Rich set of peripherals:
– QSPI (up to 280 Mbps)

APB
GPIO
HWCE RISC-V
– Camera Interface (up to 320x240@60fps)
– I2C, I2S (up to 4 digital microphones) I2S

JTAG (Debug), GPIOs,

Pad Control
– SPI M
– Interrupt controller, Bootup ROM

DEBUG
HWCE

INTC
CAMIF

µDMA
• Autonomous IO DMA Subsystem I2C
APB

(µDMA) UART

EVENT
TIMER

CTRL
CLK

SoC
SDIO

• Power management
– 2 low-power FLLs (IO, SoC)
| |
Peripheral Bus
Only one APB request! If
more, stalled in the HYBRID
LOGARITMIC INTERCONNECT
(AS BANK CONFLICT)

SOC ADV.
FLL GPIO UDMA EVENT INTC DEBUG TIMER HWCE
CTRL TIMER

PERIPHERAL BUS

APB (COMING FROM HYBRID LOG INT)

0x1A1_00000
| |
µDMA: An Autonomous I/O
Subsystem

| |
I/O requirements
Up to 2.4GBit/s

46Mbit/s
320x240@25fps

New SD standard up
HIGH NCE
ORM
A to 800Mbit/s
RF
PE uC

QuadSPI up to
3Mbit/s per 400Mbit/s
channel
RF
Transceivers
> 100Mbit/s
Peak BW
> 1Gbit/s
| 27.02.2019 |100
PULPissimo Architecture
• RISC-V based advanced microcontroller SoC nk
s
– 512kB of L2 Memory ve
d
ba
L2 private banks
ea L2
rl
– 16kB of energy efficient latch-based In
te
L2
Bank
L2Bank
L2 L2
Bank ROM
memory (L2 SCM BANK) Bank Bank Bank

DEBUG Logarithmic Interconnect


• Rich set of peripherals:
– QSPI (up to 280 Mbps)

APB
GPIO
HWCE RISC-V
– Camera Interface (up to 320x240@60fps)
– I2C, I2S (up to 4 digital microphones) I2S

JTAG (Debug), GPIOs,

Pad Control
– SPI M
– Interrupt controller, Bootup ROM

DEBUG
HWCE

INTC
CAMIF

µDMA
• Autonomous IO DMA Subsystem I2C
APB

(µDMA) UART

EVENT
TIMER

CTRL
CLK

SoC
SDIO

• Power management
– 2 low-power FLLs (IO, SoC)
| |
uDMA Subsystem APB UDMA_TX UDMA_RX

uDMA Subsystem

µDMA core

PERIPHERAL #ID FILTER

cfg_data cfg_data
CONFIG Registers CONFIG Registers

tx_data 2xtx_data
PERIPH TX PROTOCOL DSP

stream
rx_data
PERIPH RX PROTOCOL rx_data

| |
Offload pipeline

DOUBLE BUFFERING

RX CHn RX CHn RX CHn RX CHn


I/O Buffer0 Buffer1 Buffer0 Buffer1

CPU Offload
Start DMA
Offload
Start DMA
Offload
Start DMA

Acc. DMA
Copy Copy Copy
L2 to Acc L2 to Acc L2 to Acc
Process Process Process
Acc. Processing Buffer0 Buffer1 Buffer0

TIME

§ Efficient use of system resources


§ HW support for double buffering allows continuous data
transfers
§ Multiple data streams can be time multiplexed

| 27.02.2019 |111
Hardware Accelerator for
Neural Networks

| |
PULPissimo Architecture
• RISC-V based advanced microcontroller SoC nk
s
– 512kB of L2 Memory ve
d
ba
L2 private banks
ea L2
rl
– 16kB of energy efficient latch-based In
te
L2
Bank
L2Bank
L2 L2
Bank ROM
memory (L2 SCM BANK) Bank Bank Bank

DEBUG Logarithmic Interconnect


• Rich set of peripherals:
– QSPI (up to 280 Mbps)

APB
GPIO
HWCE RISC-V
– Camera Interface (up to 320x240@60fps)
– I2C, I2S (up to 4 digital microphones) I2S

JTAG (Debug), GPIOs,

Pad Control
– SPI M
– Interrupt controller, Bootup ROM

DEBUG
HWCE

INTC
CAMIF

µDMA
• Autonomous IO DMA Subsystem I2C
APB

(µDMA) UART

EVENT
TIMER

CTRL
CLK

SoC
SDIO

• Power management
– 2 low-power FLLs (IO, SoC)
| |
1. Motivation

XNOR Neural Engine


2. BNNs
3. Architecture
4. Results
32-bit
periph target

UCODE REG FILE


CTRL FSM
PROC SLAVE
TP-bit
stream
INPUT INPUT BUFFER
SOURCE TP-bit
memory master

STATIC MUXING
TP/32 x 32-bit

XNOR & POPCOUNT


TP xnor + reduction tree to 16-bit
WEIGHT
SOURCE
POPCOUNT ACCUMULATORS
TP x 16-bit

OUTPUT THRESHOLD
SINK TP-bit

XNE | 01.10.18 | 117


1. Motivation

XNOR Neural Engine in PULPissimo


2. BNNs
3. Architecture
4. Results

Mem Mem Mem Mem Mem Mem Mem Mem


Bank Bank Bank Bank Bank Bank Bank Bank

JTAG Tightly Coupled Data Memory Interconnect


instr data
UART Ibuf
STREAMER / I$
SPI
XNOR RISCY
IS
2 I/O
intfs uDMA Neural
ENGINE
I2C
SDIO
Engine
CTRL Event Unit
CPI

APB / Peripheral Interconnect

Clock / Reset Power Debug


Timer
Generator Controller Unit
FLLs Always-On

| |
Operation in time

GET GET GET GET GET GET PUSH GET GET


STREAMER
FEAT WEIGHT WEIGHT FEAT WEIGHT WEIGHT CONV FEAT WEIGHT

FEAT REG REG REG


REGISTER FEAT FEAT FEAT

XNOR XNOR XNOR XNOR XNOR XNOR


POPCOUNT ACCUM ACCUM ACCUM ACCUM CLEAR

UCODE UPDATE UPDATE UPDATE


PROCESSOR IDX IDX IDX

CONTROLLER
REG FILE PROG

XNE | 01.10.18 | 119


Interrupt Controller and Event
Generator

| |
PULPissimo Architecture
• RISC-V based advanced microcontroller SoC nk
s
– 512kB of L2 Memory ve
d
ba
L2 private banks
ea L2
rl
– 16kB of energy efficient latch-based In
te
L2
Bank
L2Bank
L2 L2
Bank ROM
memory (L2 SCM BANK) Bank Bank Bank

DEBUG Logarithmic Interconnect


• Rich set of peripherals:
– QSPI (up to 280 Mbps)

APB
GPIO
HWCE RISC-V
– Camera Interface (up to 320x240@60fps)
– I2C, I2S (up to 4 digital microphones) I2S

JTAG (Debug), GPIOs,

Pad Control
– SPI M
– Interrupt controller, Bootup ROM

DEBUG
HWCE

INTC
CAMIF

µDMA
• Autonomous IO DMA Subsystem I2C
APB

(µDMA) UART

EVENT
TIMER

CTRL
CLK

SoC
SDIO

• Power management
– 2 low-power FLLs (IO, SoC)
| |
PULP interrupts controller (INTC)
• It generates interrupt requests from 0 to 31
• Mapped to the APB bus
• Receives events in a FIFO from the SoC Event Generator (i.e.
from peripherals)
• Unique interrupt ID (26) but different event ID
• Mask, pending interrupts, acknowledged interrupts, event id
registers
• Set, Clear, Read and Write operations by means of load and
store instructions (memory mapped operations)
• Interrupts come from:
• Timers
• GPIO (rise, fall events)
• HWCE
• Events i.e. uDMA
| |
PULP Event Generator (EVENT)

EVENT DOUBLE
BUFFERING

ACC.
READY

HIGH PRI. SEL.

MASKING UNIT
EVENTS FROM
SYSTEM 8 EVT ID
TO

PER.
• GPIO INTC FIFO
VALID
• TIMERS
• HWCE

SERIALIZER
• uDMA

SOC
uDMA is waiting
READY for some event –
EVT. CFG

EVT ID e.g. SPI starts


APB

8 8
TO
VALID
PERIPHS
when GPIO rises

FROM APB

| |
Interrupts Source

be nt
re
m ere
fo
fro Diff

EVENTS FROM
SYSTEM

MASKING UNIT and PRIORITY DECODER


23
• GPIO
• TIMERS
• HWCE

READY req
EVENT FROM EVT ID id 5
EVENT GENERATOR
VALID
PULP Core
ack
id 5

8
SW EVENTS
(INTC APB Registers)

| |
TestBench

| |
PULP TestBench

• It reads the compiled file (ADDRESS – INSTRUCTION)


• It sets with JTAG configuration registers
• It loads via JTAG the compiled file into the memory
• It writes to the FETCH_ENABLE register in the APB (Soc Control)
• Now the core starts running the application
• It waits for the END-OF-COMPUTATION bit
• When the core returns from the “main” function, it writes to a specific memory
location in the APB (SoC Control) the word “1XXX_XXXX”, where 1 indicates
the core finished its program and XXX_XXXX is the returned value
(e.g. “return 0;”)
• It reports an error if XXX_XXXX is not 0

| |
HANDS-ON

| |
PULPissimo on GitHub
• PULPissimo is available @ https://github.com/pulp-platform/pulpissimo
• git clone git@github.com:pulp-platform/pulpissimo.git

| |
Dependecies
• “ips_list.yml” holds the needed sub-IPs.

• “update-ips” to download them

• iptools downloads the IPs recursively

• iptools generates compilation scripts and [synthesis] scripts

• “ips” folder contains downloaded IPs

• “rtl” folder contains PULPissimo RTL, testbench, etc


| |
PULP IPs

• Every IP is a different GIT repository


• Easier to maintain and creates little mess on many-people
projects
• Every IP has one or more maintainers of the PULP group

• “src_files.yml” for each IP to list the RTL files


• used to generate scripts, modelsim library names, options, etc

• PULPissimo IPs are also available on GitHub

• make sdk to download and install the PULP SDK


| |
PULP SDK
• The SDK contains all the tools and runtime support for
PULP based microcontrollers

• The SDK contains from low-level bare-metal procedure for


setting the PULP cores and peripherals (e.g. crt0) all the
way up to a set of higher level functions (API) to help
applications developers to leverage all the supported
features
| |
Environment Variable

• VSIM_PATH points to your pulpissimo/sim folder


• Execute make clean lib build opt

• PULP_RISCV_GCC_TOOLCHAIN to your bin folder of the PULP


GCC

| |
Compile & Simulate PULP
• PULP compilation and simulation scripts and flow are based on
modelsim
• To compile
• cd pulpissimo/sim
• make clean lib build opt
• To execute an application
• cd yourapplicationfolder
• make clean all (to compile it)
• make dis > dis.s (to generate the object dump)
• make run gui=1 (to run modelsim with GUI)
• make run
• Assembler, Simulation Trace and Performance counter to analyze
performance

| |
Programming PULP

• When programming for embedded system, the very first


thing that should come to your mind is
• LIMITED RESOURCES
• It is completely different to write application for your personal PC than a microcontroller
• You MUST know the total memory available, the architecture, the instructions of the
core etc
• In the context of embedded programming, you have the possibility to finely optimize all
the SW stack to leverage your HW at the best

• Some tips are coming J

| |
The C->ASM->MONITOR Loop

• When you write your C program, you must have in mind


many things:
• Where are my DATA? In which BANK? Will I have BANK conflicts? With whom?
• Where are my instructions? In which BANK? Will I have BANK conflicts? With whom?
• à This tells you whether you will have stalls from outside due to the system rather
than the program per se, yet it is very important to know

SoC n ks
• Core data stack in L2
Bank conflict on the GRANT ba
av
ed
L2
L2 private banks Private Bank0
r le L2 Bank
In
te L2 Bank
Bank
Bank
L2
Bank
L2
Bank
ROM • Instructions in L2 Private
Bank1
Logarithmic Interconnect
• HWCE data in L2
HWCE
interleaved
RISC-V

Slow bus access for the VALID None of the master ports in
the Log. Interconnect
Will create bank conflicts
| J|
The C->ASM->MONITOR Loop

• When you write your C program, you must have in mind many
things:
• What is the ISA of my core?
• Is my kernel (e.g. MatMul) using all the instructions of my ISA in an optimized way?
• à You check this by generating the assembler and double check the instructions.
Try to reverse what the compiler did as see whether you can do better or not
• If not, you can use builtins or asm volatile statements to force the use of some
instructions! (Or rewrite properly the C code)

… … //iterate #COL/4
lp.setup x1,a4,stop1 lp.setup x1,a6,stop1 I am a
p.lbu a0,1(a3!) p.lw a1,4(t1!) cool GUY
p.lbu a1,1(a2!) p.lw a5,4(t3!)
stop1: p.mac a5,a0,a1 stop1: pv.sdotsp.b a7,a1,a5
…. ….
| |
The C->ASM->MONITOR Loop

• When you write your C program, you must have in mind many
things:
• What are the performance I should expect?
• Can I achieve that performance? Why?
• If I don’t have a clue, I should open the waveform and see where the stalls are coming from

• How can I solve it? à Back to writing CODE (e.g. loop unrolling)

| |
Hands-On à The Dot Product

• The dot product is an extremely common kernel in Artificial


Intelligence operations
• RISCY extensions to achieve top performance

• We are going to see


• Optimized assembly code that uses
• the MAC instruction
• Zero-overhead HW-Loop
• Automatic increment load/store
• Loop unrolling to eliminate stalls
• Optimized code that uses the SIMD extensions

| |
Hands-On à The 2D Convolution

• The 2D Convolution is the central kernel of Convolutional Neural


Network
• RISCY extensions to achieve top performance

• We are going to see


• Optimized C code that uses
• gcc vectors
• the shuffle instruction
• the dot product
• normalization and clip

| |
Thanks a lot

• Thanks a lot for your attention

• I hope you enjoyed it J

• Get ready for the Hands On session

Integrated Systems Laboratory

| |
134

You might also like