Computer Science

Single Instruction Multiple Data (SIMD)

SIMD is a type of parallel computing architecture that allows multiple processing elements to simultaneously execute the same instruction on different data. It is commonly used in applications that require high-performance computing, such as video and audio processing, scientific simulations, and machine learning. SIMD can significantly improve processing speed and efficiency by reducing the number of instructions needed to perform a task.

Written by Perlego with AI-assistance

Related key terms

Concurrent Programming

Harvard Architecture

MIMD

Parallel Architectures

Pipelining

Processor Architecture

RISC Processor

Superscalar Architecture

Threading In Computer Science

Types of Processor

9 Key excerpts on "Single Instruction Multiple Data (SIMD)"

eBook - ePub
Computer Architecture
A Quantitative Approach
- John L. Hennessy, David A. Patterson(Authors)
- 2011(Publication Date)
- Morgan Kaufmann
  (Publisher)
Chapter 1 introduced, has always been just how wide a set of applications has significant data-level parallelism (DLP). Fifty years later, the answer is not only the matrix-oriented computations of scientific computing, but also the media-oriented image and sound processing. Moreover, since a single instruction can launch many data operations, SIMD is potentially more energy efficient than multiple instruction multiple data (MIMD), which needs to fetch and execute one instruction per data operation. These two answers make SIMD attractive for Personal Mobile Devices. Finally, perhaps the biggest advantage of SIMD versus MIMD is that the programmer continues to think sequentially yet achieves parallel speedup by having parallel data operations.

This chapter covers three variations of SIMD: vector architectures, multimedia SIMD instruction set extensions, and graphics processing units (GPUs).1

The first variation, which predates the other two by more than 30 years, means essentially pipelined execution of many data operations. These vector architectures are easier to understand and to compile to than other SIMD variations, but they were considered too expensive for microprocessors until recently. Part of that expense was in transistors and part was in the cost of sufficient DRAM bandwidth, given the widespread reliance on caches to meet memory performance demands on conventional microprocessors.

The second SIMD variation borrows the SIMD name to mean basically simultaneous parallel data operations and is found in most instruction set architectures today that support multimedia applications. For x86 architectures, the SIMD instruction extensions started with the MMX (Multimedia Extensions) in 1996, which were followed by several SSE (Streaming SIMD Extensions) versions in the next decade, and they continue to this day with AVX (Advanced Vector Extensions). To get the highest computation rate from an x86 computer, you often need to use these SIMD instructions, especially for floating-point programs.

The third variation on SIMD comes from the GPU community, offering higher potential performance than is found in traditional multicore computers today. While GPUs share features with vector architectures, they have their own distinguishing characteristics, in part due to the ecosystem in which they evolved. This environment has a system processor and system memory in addition to the GPU and its graphics memory. In fact, to recognize those distinctions, the GPU community refers to this type of architecture as heterogeneous
Sign up to read
Learn more about book
eBook - ePub
Elements of Parallel Computing
- Eric Aubanel(Author)
- 2016(Publication Date)
- Chapman and Hall/CRC
  (Publisher)
Chapter 5 , where we’ll also explore the other factors that degrade performance.

4.2 SIMD: STRICTLY DATA PARALLEL

SIMD architectures have played a significant role in the history of parallel computers. They continue to be important because data parallel programming is very common in scientific computing and in graphics processing. Earlier in Section 2.1.1 we studied an example of array summation with SIMD instructions.

There are many such cases where the same operation is applied independently to consecutive elements of an array. This pattern has been called strict data parallel [44 ], where a single stream of instructions executes in lockstep on elements of arrays. This contrasts with data parallel programs executing on MIMD platforms: the multiple streams have identical instructions, but they don’t operate in lockstep, relying instead on synchronization when all instruction streams must be at the same point. This latter type of programming is called Single Program Multiple Data (SPMD) and will be seen below when we examine programming on shared and distributed memory platforms.

SIMD programming can be done in several ways. A vectorizing compiler can produce vector instructions from a source program if the iterations of the loop are independent and the number of iterations is known in advance (see Section 2.1.1 ). Some programming languages feature array notation, which makes the compiler’s job much easier. In this case arrays a and b can be summed by writing something like c [0 : n − 1] ← a [0 : n − 1] + b [0 : n − 1], or even c ← a + b . This statement indicates explicitly that the operations occur independently on all elements. The compiler can then break down the sequence into chunks with a number of elements that depends on the number of bits in the data type and in the vector registers. The evocative name for this process is strip mining . For example, if we have arrays of 220 double precision elements, and the platform is Intel’s Xeon Phi co-processor, which features 512 bit vector registers, then the loop can be broken up into 217
Sign up to read
Learn more about book
eBook - ePub
Computer Architecture
Software Aspects, Coding, and Hardware
- John Y. Hsu(Author)
- 2017(Publication Date)
- CRC Press
  (Publisher)
HAPTER 8Vector and Multiple-Processor Machines

8.1 VECTOR PROCESSORS

A SIMD machine provides a general purpose set of instructions to operate on arrays, namely, vectors. As an example, one add vector instruction can add two arrays and store the result in a third array. That is, each corresponding word in the first and second array are added and stored in the corresponding word in the third array. This also means that after a single instruction is fetched and decoded, its EU (execution unit) provides control signals to fetch many operandi and execute them in a loop. As a consequence, the overhead of instruction retrievals and decodes are reduced. As a vector means an array in programming, the terms vector processor, array processor, and SIMD machine are all synonymous. A vector processor provides general purpose instructions, such as integer arithmetic, floating arithmetic, logical, shift, etc. on vectors. Each instruction contains an opcode, the size of the vector, and addresses of vectors. A SIMD or vector machine may have its data stream transmitted in serial or in parallel. A parallel data machine uses more hardware logic than a serial data machine.

8.1.1 Serial Data Transfer

The execution unit is called the processing element (PE) where the operations are performed. If one PE is connected to one processing element memory (PEM), we have a SIMD machine with serial data transfer as shown in Figure 8.1 a . That is, after decoding a vector instruction in the CU (control unit), an operand stream is fetched and executed serially in a hardware loop. That is, serial data are transferred on the data bus between the PE and PEM on a continuous basis until the execution is completed. In a serial data SIMD machine, there is one PE and one PEM. However, one instruction retrieval is followed by many operand fetches.

8.1.2 Parallel Data Transfer

If multiple PEs are tied to the CU and each PE is connected to a PEM, we have a parallel data machine, as shown in Figure 8.1b
Sign up to read
Learn more about book
eBook - ePub
Parallel Processing Algorithms For GIS
- Richard Healey, Steve Dowers, Bruce Gittings, Mike J Mineter(Authors)
- 2020(Publication Date)
- CRC Press
  (Publisher)
SI SD machines follow the von Neumann model. SIMD machines include the so-called “array-processors” such as the Connection Machine CM-2 and CM-200 and the ICL DAP. SIMD machines consist of large numbers of simple processing elements which are sent instructions by a controlling processor. Within MIMD computers, each processor (which is normally much more powerful than a single SIMD processing element) executes its own program. Examples of early MIMD machines are the Sequent Balance and Meiko CS-1, and more recently the Cray T3D. MISD computers have not been built - no application seems to require this architecture at present.

2.2.1 Classification of memory models for parallel computers

Flynn’s Taxonomy classifies computers according to their behaviour in terms of data and instructions. There is a further classification which is based on how the processors are able to access memory. With a single processor, all memory has to be addressed by that processor; with multiple processors there is a choice between allowing each processor access to the whole of the memory and allowing each to access only a certain part of memory. The former model is referred to as shared memory, the latter as distributed memory. Both techniques offer advantages and suffer disadvantages.

2.2.1.1 Shared Memory

In a shared memory environment, every processor in the computer can access every memory location. This offers a simple and fast mechanism for communication between processors, although mechanisms must be found to deal with contention for the memory hardware, which arises when several processors attempt to access the same location in memory simultaneously. The effects of contention have historically called into question the ability of shared memory architectures to scale well, i.e. to show a continuing improvement of performance as more processors are added.

In addition there are problems with controlling the way in which processors share variables to ensure consistency. This is usually resolved by protecting the shared variable using critical sections and locks as described in the next chapter.
Sign up to read
Learn more about book
eBook - ePub
Advanced Computer Architectures
- Sajjan G. Shiva(Author)
- 2018(Publication Date)
- CRC Press
  (Publisher)
5.7 provide further details on these language extensions. There are also compilers that translate serial programs into data-parallel object codes.

An algorithm that is efficient for SISD implementation may not be efficient for an SIMD, as illustrated by the matrix multiplication algorithm of Section 5.3 . Thus, the major challenge in programming SIMDs is in devising an efficient algorithm and corresponding data partitioning such that all the PEs in the system are kept busy throughout the execution of the application. This also requires minimizing conditional branch operations in the algorithm.

The data exchange characteristics of the algorithm dictate the type of IN needed. If the desired type of IN is not available, routing strategies that minimize the number of hops needed to transmit data between non-neighboring PEs will have to be devised.
5.6 Example Systems
The Intel and MIPS processors described in Chapter 1 have SIMD features. The supercomputer systems described in Chapter 4 also operate in an SIMD mode. This section provides brief descriptions of the hardware, software, and application characteristics of two SIMD systems. The ILLIAC-IV has been the most famous experimental SIMD architecture and is selected for its historical interest. Thinking Machine Corporation’s Connection Machine series, although no longer in production, originally envisioned for data-parallel symbolic computations, and later allowed numeric applications.
5.6.1 ILLIAC-IV
The ILLIAC-IV project was started in 1966 at the University of Illinois. The objective was to build a parallel machine capable of executing 109 instructions per second. To achieve this speed, a system with 256 processors controlled by a control processor was envisioned. The set of processors was divided into 4 quadrants of 64 processors each, each quadrant to be controlled by one control unit. Only one quadrant was built and it achieved a speed of 2 × 108 instructions per second.
Figure 5.22
Sign up to read
Learn more about book
eBook - ePub
High Performance Parallel Runtimes
Design and Implementation
- Michael Klemm, Jim Cownie(Authors)
- 2021(Publication Date)
- De Gruyter Oldenbourg
  (Publisher)
* will then expose these as cores 0 through 7.

In an operating system, these logical cores then show up as regular cores that the OS can use to schedule processes and threads. Of course, their presence adds to the complexity of the job that the OS and parallel runtime system have to perform. Multiple hardware threads that are multiplexed in a single physical core also share caches at all levels of the hierarchy. Therefore, the performance of code running on one hardware thread can have a significant effect on code running on another on the same core. Thus, the OS needs to take the SMT structure of the system into account when scheduling runnable processes and threads.

3.1.7 Single-instruction multiple-data

The final task that will add even more parallelism to the processor core is to add support for single instruction multiple data (SIMD) instructions to the execution units, introducing data parallelism according to Flynn’s taxonomy [39 ].

Instead of making the pipeline wider or deeper, the instruction set architecture is changed to offer instructions that process multiple data elements in one go. Common choices for the SIMD width of modern processors are 128-, 256-, and 512-bit SIMD registers. The Scalable Vector Extension (SVE) [121 ] for Arm processors goes even further and supports up to 2,048-bit SIMD.

Besides providing a register file that can hold the wide SIMD registers, this requires us to replicate the arithmetic logic units until the desired SIMD length is achieved. For, say, a 512-bit wide SIMD execution unit, the logic to add double-precision floating-point numbers will be present eight times, once for each double-precision floating-point number that can be stored in the SIMD register. This number will vary if the instruction set supports different data types for SIMD execution, e. g., int8_t , int16_t , float
Sign up to read
Learn more about book
eBook - ePub
Parallel and High Performance Computing
- Robert Robey, Yuliana Zamora(Authors)
- 2021(Publication Date)
- Manning
  (Publisher)
add instructions in the instruction queue, which reduces the pressure on the instruction queue and cache. The biggest benefit is that it takes about the same power to perform eight additions in a vector unit as one scalar addition. Figure 6.1 shows a vector unit that has a 512-bit vector width, offering a vector length of eight double-precision values.

Figure 6.1 A scalar operation does a single double-precision addition in one cycle. It takes eight cycles to process a 64-byte cache line. In comparison, a vector operation on a 512-bit vector unit can process all eight double-precision values in one cycle.
Let’s briefly summarize vectorization terminology:
- Vector (SIMD) lane—A pathway through a vector operation on vector registers for a single data element much like a lane on a multi-lane freeway.
- Vector width—The width of the vector unit, usually expressed in bits.
- Vector length—The number of data elements that can be processed by the vector in one operation.
- Vector (SIMD) instruction sets—The set of instructions that extend the regular scalar processor instructions to utilize the vector processor.
Vectorization is produced through both a software and a hardware component. The requirements are
- Generate instructions—The vector instructions must be generated by the compiler or manually specified through intrinsics or assembler coding.
- Match instructions to the vector unit of the processor—If there is a mismatch between the instructions and the hardware, newer hardware can usually process the instructions, but older hardware will just fail to run. (AVX instructions do not run on ten-year-old chips. Sorry!)
There is no fancy process that converts regular scalar instructions on the fly. If you use an older version of your compiler, as many programmers do, it will not have the capability to generate the instructions for the latest hardware. Unfortunately, it takes time for compiler writers to include new hardware capabilities and instruction sets. It can also take a while for the compiler writers to optimize these capabilities.
Sign up to read
Learn more about book
eBook - ePub
A Survey of Computational Physics
Introductory Computational Science
- Rubin Landau, José Páez, Cristian C. Bordeianu(Authors)
- 2011(Publication Date)
- Princeton University Press
  (Publisher)
This [B] = [A][B] multiplication is an example of data dependency, in which the data elements used in the computation depend on the order in which they are used. In contrast, the matrix multiplication [C] = [A][B] is a data parallel operation in which the data can be used in any order. So already we see the importance of communication, synchronization, and understanding of the mathematics behind an algorithm for parallel computation. The processors in a parallel computer are placed at the nodes of a communication network. Each node may contain one CPU or a small number of CPUs, and the communication network may be internal to or external to the computer. One way of categorizing parallel computers is by the approach they employ in handling instructions and data. From this viewpoint there are three types of machines: • Single-instruction, single-data (SISD): These are the classic (von Neumann) serial computers executing a single instruction on a single data stream before the next instruction and next data stream are encountered. • Single-instruction, multiple-data (SIMD): Here instructions are processed from a single stream, but the instructions act concurrently on multiple data elements. Generally the nodes are simple and relatively slow but are large in number. • Multiple instructions, multiple data (MIMD): In this category each processor runs independently of the others with independent instructions and data. These are the types of machines that employ message-passing packages, such as MPI,to communicate among processors. They may be a collection of work-stations linked via a network, or more integrated machines with thousands of processors on internal boards, such as the Blue Gene computer described in §14.13. These computers, which do not have a shared memory space, are also called multicomputers
Sign up to read
Learn more about book
eBook - ePub
Professional Parallel Programming with C#
Master Parallel Extensions with .NET 4
- Gastón C. Hillar(Author)
- 2010(Publication Date)
- Wrox
  (Publisher)
Chapter 11 Vectorization, SIMD Instructions, and Additional Parallel Libraries What's in this Chapter?

Understanding SIMD and vectorization

Understanding extended instruction sets

Working with Intel Math Kernel Library

Working with multicore-ready, highly optimized software functions

Mixing task-based programming with external optimized libraries

Generating pseudo-random numbers in parallel

Working with the ThreadLocal<T> class

Using Intel Integrated Performance Primitives

In the previous 10 chapters, you learned to create and coordinate code that runs many tasks in parallel to improve performance. If you want to improve throughput even further, you can take advantage of other possibilities offered by modern hardware related to parallelism. This chapter is about the usage of additional performance libraries and includes examples of their integration with .NET Framework 4 and the new task-based programming model. In addition, the chapter provides examples of the usage of the new thread-local storage classes and the lazy-initialization capabilities provided by these classes.

Understanding SIMD and Vectorization

The “Parallel Programming and Multicore Programming” section of Chapter 1, “Task-Based Programming,” introduced the different kinds of parallel architectures. This section also explained that most modern microprocessors can execute Single Instruction, Multiple Data (SIMD) instructions. Because the execution units for SIMD instructions usually belong to a physical core, it is possible to run as many SIMD instructions in parallel as available physical cores. The usage of these vector-processing capabilities in parallel can provide important speedups in certain algorithms.

Here's a simple example that will help you understand the power of SIMD instructions. Figure 11-1 shows a diagram that represents the PABSD
Sign up to read
Learn more about book

Index pages curate the most relevant extracts from our library of academic textbooks. They’ve been created using an in-house natural language model (NLM), each adding context and meaning to key research topics.

Explore more topic indexes

View all