SIMD (single instruction, multiple data) refers to computers that can perform the same operation on multiple data points simultaneously. Vector processors are CPUs that operate on arrays of data called vectors using SIMD instructions. MMX was an early SIMD extension for x86 processors that packed multiple small data types like bytes and words together, enabling the same arithmetic instruction to operate on multiple data elements in parallel. It used the FPU registers to maintain compatibility but this limited its usage. Later SIMD extensions improved on MMX.
SIMD (single instruction, multiple data) refers to computers that can perform the same operation on multiple data points simultaneously. Vector processors are CPUs that operate on arrays of data called vectors using SIMD instructions. MMX was an early SIMD extension for x86 processors that packed multiple small data types like bytes and words together, enabling the same arithmetic instruction to operate on multiple data elements in parallel. It used the FPU registers to maintain compatibility but this limited its usage. Later SIMD extensions improved on MMX.
SIMD (single instruction, multiple data) refers to computers that can perform the same operation on multiple data points simultaneously. Vector processors are CPUs that operate on arrays of data called vectors using SIMD instructions. MMX was an early SIMD extension for x86 processors that packed multiple small data types like bytes and words together, enabling the same arithmetic instruction to operate on multiple data elements in parallel. It used the FPU registers to maintain compatibility but this limited its usage. Later SIMD extensions improved on MMX.
SIMD (single instruction, multiple data) refers to computers that can perform the same operation on multiple data points simultaneously. Vector processors are CPUs that operate on arrays of data called vectors using SIMD instructions. MMX was an early SIMD extension for x86 processors that packed multiple small data types like bytes and words together, enabling the same arithmetic instruction to operate on multiple data elements in parallel. It used the FPU registers to maintain compatibility but this limited its usage. Later SIMD extensions improved on MMX.
Download as PPTX, PDF, TXT or read online from Scribd
Download as pptx, pdf, or txt
You are on page 1of 28
Single instruction, multiple
data (SIMD) Contents Parallel Processors Flynn's taxonomy What is SIMD? Types of Processing Scalar Processing Vector Processing Architecture for Vector Processing Vector processors Vector Processor Architectures Components of Vector Processors Advantages of Vector Processing Array processors Array Processor Classification Array Processor Architecture Dedicated Memory Organization Global Memory Organization ILLIAC IV ILLIAC IV Architecture Super Computers Cray X1 Multimedia Extension Parallel Processors In computers, parallel processing is the processing of program instructions by dividing them among multiple processors with the objective of running a program in less time.
In the earliest computers, only one program ran at a time. A
computation-intensive program that took one hour to run and a tape copying program that took one hour to run would take a total of two hours to run. An early form of parallel processing allowed the interleaved execution of both programs together.
The computer would start an I/O operation, and while it was
waiting for the operation to complete, it would execute the processor- intensive program. The total execution time for the two jobs would be a little over one hour. Flynn's taxonomy Flynn's taxonomy is a classification of computer architectures, proposed by Michael J. Flynn in 1966.The classification system has stuck, and has been used as a tool in design of modern processors and their functionalities. Classification The four classifications defined by Flynn are based upon the number of concurrent instruction (or control) streams and data streams available in the architecture. Single instruction stream single data stream (SISD) Single instruction stream, multiple data streams (SIMD) Single instruction, multiple threads (SIMT) Multiple instruction streams, single data stream (MISD). Evolution of Intel Vector Instructions ■ MMX (1996, Pentium) CPU-based MPEG decoding Integers only, 64-bit divided into 2 x 32 to 8 x 8 Phased out with SSE4 ■ SSE (1999, Pentium III) CPU-based 3D graphics 4-way float operations, single precision 8 new 128 bit Register, 100+ instructions ■ SSE2 (2001, Pentium 4) High-performance computing Adds 2-way float ops, double-precision; same registers as 4-way single-precision Integer SSE instructions make MMX obsolete ■ SSE3 (2004, Pentium 4E Prescott) Scientific computing New 2-way and 4-way vector instructions for complex arithmetic ■ SSSE3 (2006, Core Duo) Minor advancement over SSE3 ■ SSE4 (2007, Core2 Duo Penryn) Modern codecs, cryptography New integer instructions Better support for unaligned data, super shuffle engine What is SIMD? Single instruction, multiple data (SIMD), is a class of parallel computers in Flynn's taxonomy.
It describes computers with multiple processing
elements that perform the same operation on multiple data points simultaneously. Thus, such machines exploit data level parallelism.
There are simultaneous (parallel) computations, but
only a single process (instruction) at a given moment. How SIMD processes? Processing/Working Types of Processing Scalar Processing A CPU that performs computations on one number or set of data at a time. A scalar processor is known as a "single instruction stream single data stream" (SISD) CPU. Vector Processing A vector processor or array processor is a central processing unit (CPU) that implements an instruction set containing instructions that operate on 1-D arrays of data called vectors. Architecture for Vector Processing Two architectures suitable for vector processing are:
Pipelined vector processors
Parallel Array processors Pipelined vector processors CPU that implements an instruction set that operates on 1-D arrays, called vectors Vectors contain multiple data elements Number of data elements per vector is typically referred to as the vector length Both instructions and data are pipelined to reduce decoding time Advantages of Vector Processing Advantages: Quick fetch and decode of a single instruction for multiple operations. The instruction provides a regular source of data, which arrive at each cycle, and can be processed in a pipelined fashion efficiently. Easier Addressing of Main Memory Elimination of Memory Wastage Simplification of Control Hazards Reduced Code Size Array Processors ARRAY processor is a processor that performs computations on a large array of data. Array processor is a synchronous parallel computer with multiple ALU called processing elements ( PE) that can operate in parallel in lockstep fashion. It is composed of N identical PE under the control of a single control unit and a number of memory modules Array Processor Classification SIMD ( Single Instruction Multiple Data ) is an array processor that has a single instruction multiple data organization. It manipulates vector instructions by means of multiple functional unit responding to a common instruction. Attached array processor is an auxiliary processor attached to a general purpose computer. Its intent is to improve the performance of the host computer in specific numeric calculation tasks. SIMD-Array Processor Architecture SIMD has two basic configuration Array processors using RAM also known as ( Dedicated memory organization ). • ILLIAC-IV, CM-2,MP-1 Associative processor using content accessible memory also known as ( Global Memory Organization) • BSP MMX Multi Media Extensions Development MMX (Multimedia Extension) was introduced in 1996 (Pentium with MMX and Pentium II). SSE (Streaming SIMD Extension) was introduced with Pentium III. SSE2 was introduced with Pentium 4. SSE3 was introduced with Pentium 4 supporting hyper-threading technology. SSE3 adds 13 more instructions. MMX After analyzing a lot of existing applications such as graphics, MPEG, music, speech recognition, game, image processing, they found that many multimedia algorithms execute the same instructions on many pieces of data in a large data set. Typical elements are small, 8 bits for pixels, 16 bits for audio, 32 bits for graphics and general computing. New data type: 64-bit packed data type. Why 64 bits? Good enough Practical Data Types of MMX The four MMX technology data types are: Packed byte -- Eight bytes packed into one 64-bit quantity. Packed word -- Four 16-bit words packed into one 64-bit quantity. Packed doubleword -- Two 32-bit double words packed into one 64-bit quantity. Quadword -- One 64-bit quantity. Compatibility To be fully compatible with existing IA, no new mode or state was created. Hence, for context switching, no extra state needs to be saved. To reach the goal, MMX is hidden behind FPU. When floating-point state is saved or restored, MMX is saved or restored. It allows existing OS to perform context switching on the processes executing MMX instruction without be aware of MMX. However, it means MMX and FPU can not be used at the same time. Big overhead to switch. Although Intel defenses their decision on aliasing MMX to FPU for compatibility. It is actually a bad decision. OS can just provide a service pack or get updated. It is why Intel introduced SSE later without any aliasing. Saturation Arithmetic In an 8-bit grayscale picture, 255 is the value for pure white, and 0 is the value for pure black. In a regular register (AX, BX, CX ...) if we add one to white, we get black! This is because the regular registers "roll-over" to the next value. MMX registers get around this by a technique called "Saturation Arithmetic". In saturation arithmetic, the value of the register never rolls over to 0 again. This means that in the MMX world, we have the following equations: 255 + 100 = 255 200 + 100 = 255 0 - 100 = 0; 99 - 100 = 0 This may seem counter-intuitive at first to people who are used to their registers rolling over, but it makes sense in some situations: if we try to make white brighter, it shouldn't become black. MMX Registers MMX defines eight registers, called MM0 through MM7, and operations that operate on them. Each register is 64 bits wide and can be used to hold either 64-bit integers, or multiple smaller integers in a "packed" format: a single instruction can then be applied to two 32-bit integers, four 16-bit integers, or eight 8-bit integers at once. Instructions The MMX registers are 64 bits wide, but can be broken down as follows: 2 32 bit values 4 16 bit values 8 8 bit values The MMX registers cannot easily be used for 64 bit arithmetic. Let's say that we have 4 bytes loaded in an MMX register: 10, 25, 128, 255. We have them arranged as such: MM0: | 10 | 25 | 128 | 255 | And we do the following pseudo code operation: MM0 + 10 We would get the following result: MM0: | 10+10 | 25+10 | 128+10 | 255+10 | = | 20 | 35 | 138 | 255 | Remember that our arithmetic "saturates" in the last box, so the value doesn't go over 255. Using MMX, we are essentially performing 4 additions in the time it takes to perform 1 addition using the regular registers, using 4 times fewer instructions. MMX Instructions