Real-Time Systems
Lecture Topic – Review of Basic Concurrent/Parallel Programming
Dr. Sam Siewert
Electrical, Computer and Energy Engineering
Embedded Systems Engineering Program
Copyright © 2019 University of Colorado
Flynn’s Taxonomy – Parallel Systems
SISD – Single core, no vector
instructions Single Instruction/Prog Multiple Instruction
Single Data SISD (Traditional Uni- MISD
SIMD Ideal for Large Bitwise, processor)
Integer, and Floating Point
Vector Math
Multiple Data SIMD (SSE 4.2, Vector MIMD
Flynn’s Taxonomy Processing)
R-Pi 3b+/4 – MIMD SPMD (Single Program MPMD (Multi-threaded
Multi-core, NEON vector
instructions Multiple Data), GP-GPU Program, Multi-Data)
MIMD and SPMD Architecture
often leverages GP-GPU Co-
Processors
DSP VLIW (SIMD) or MIMD
(e.g. Beagle Bone AI)
2
Parallel Programming for Speed-up
Sharpen single core Demonstrations
Both are threaded, but
erast.c has semaphore
locks and sharpen does
not.
Sharpen with thread grid 1. erast.c
• Without locks
do we risk data
corruption
• Indivisible test
and set?
Can use Shared Memory with POSIX Threads – but may need locking! • Concurrent
– Locking will serialize and slow down code if sequential sections are too long reader and
– erast.c vs. erastsimp.c is a good example writer?
– Can we just run lockless?
2. sharpen_grid.c
Speed up is? – Linear?, Better?, Worse?
3
Scaling and Bottlenecks
Compiler Optimization 1 - Simple and Effective: turn on compiler optimization ~ 3x
– Turn on higher levels of optimization
– Level 3 optimization: –03 for gcc or g++
– Highest is -04, but requires feedback optimization
2 - Simple and Sometimes Effective: turn on NEON SIMD ~ 1.f x
SIMD Vector Instructions
– Turn on SIMD (NEON) instruction generation on ARM A-Series
– Flynn’s taxonomy
Using Multiple Cores 3 - Harder and Mostly Effective: Grid to Map and Reduce ~ 3.2x
– Shared Memory POSIX Threads
Combine #1, #2, and #3
Co-Processing 4 - Hardest and Highly Effective: Grid programming 128 SPs
– Linux SMP
– With advanced platforms like Jetson Nano with CUDA
~ 70x
4
Theoretical Speed-Up – Linear at Best
Speed-Up
< Linear
Due to Sequential Section
(Mapping - Split)
Compared to Parallel Section
(Gridded - Apply)
…and Due to Final Step
(Combine)
5
Parallel Processing Speed-up
Grid Data Processing Speed-up
1. Multi-Core, Multi-threaded, Macro-blocks/Frames
2. SIMD, Vector Instructions Operating over Large Words (Many Times
Instruction Set Size)
3. Co-Processor Operates in Parallel to CPU(s)
SPMD – GPU or GP-GPU Co-Processor
– PCI-Express Bus Interfaces
– Transfer Program and Data to Co-Processor S is infinite here
– Threads and Blocks to Transform Data Concurrently
1
Image Data Processing – Few Data Dependencies Max _ Speed _ Up =
– Good Speed-up by Amdahl’s Law
(1 − P) + 0
– P=Parallel Portion
– (1-P)=Sequential Portion 1
Multicore _ Speed _ Up =
– S=# of Cores (Concurrency) (1 − P) + P / S
– Overhead for Co-Processor
– IO for Co-Processing
6
Conceptual View of Hardware Resources
Three-Space View of CPU-bound HPC vs. RT or Fair
Utilization
Goal is to fully use
Requirements All resources to scale!
– CPU Margin?
– IO Latency (and CPU-Use
Bandwidth) Margin?
– Memory Capacity (and
Latency) Margin? CPU, I/O,
Mem bound
Upper Right Front Corner –
Low-Margin CPU, I/O
Mem Margin IO-Use
I/O-bound
Origin – High-Margin
Memory-Use
CPU + I/O + Memory
Bound?! – Bad day!
memory-bound
7
Copyright © 2019 University of Colorado