Computer Science

Pipelining

Pipelining in computer science refers to a technique that allows multiple instructions to be overlapped in execution. It involves breaking down the processing of instructions into a series of stages, with each stage handling a different part of the instruction. This approach helps to improve the overall throughput and efficiency of the processor.

Written by Perlego with AI-assistance

4 Key excerpts on "Pipelining"

Index pages curate the most relevant extracts from our library of academic textbooks. They’ve been created using an in-house natural language model (NLM), each adding context and meaning to key research topics.
  • Embedded Systems
    eBook - ePub

    Embedded Systems

    Hardware, Design and Implementation

    • Krzysztof Iniewski(Author)
    • 2012(Publication Date)
    • Wiley
      (Publisher)

    ...These techniques will not embed the pipeline elements in a fixed systolic array. Instead, the pipeline element will be free to accept possibly unrelated data for an unrelated computation on the very next clock cycle, and a simple means will be given for tracking the flow of a computation and eliminating conflicts. This will enable a small number of computational units to implement a relatively complex “virtual” systolic array, or arrays, when the input does not arrive on each clock cycle. Reconfigurable devices, such as FPGAs, can of course be used for either pipelined or multicore architectures, or a combination of the two. The techniques described here are not intended to completely eliminate the need for parallelism among multiple identical units or cores. But by increasing the work done by each module or core, and reducing the number of cores required, problems of memory and bus resource contention and algorithm subdivision can be greatly reduced. 8.2 HISTORY AND RELATED METHODS The pipeline is a time-honored technique for dividing a computation into parts, so that new inputs may be accepted and outputs generated at much shorter time intervals than that required for the entire computation. Early supercomputers commonly used pipelines for floating-point computations, and medium- to high-performance modern microcomputers use pipelines for the process of fetching, decoding, executing, and storing the results of instructions. In computational uses, the most studied problem involves vector operations, in which the fetching of operands in a known sequence can be optimized to feed the pipe...

  • Microelectronics
    eBook - ePub
    • Jerry C. Whitaker(Author)
    • 2018(Publication Date)
    • CRC Press
      (Publisher)

    ...16.3(a). Several instructions execute simultaneously in the pipeline, each in a different stage. If the total delay through the pipeline is D and there are n stages in the pipeline, then the minimum clock period would be D / n and, optimally, a new instruction would be completed every clock. A deeper pipeline would have a higher value of n and thus a faster clock cycle. FIGURE 16.3 Comparison of the basic processor pipeline technique: (a) not pipelined, (b) pipelined. Today most commercial computers use Pipelining to increase performance. Significant research has gone into minimizing the clock cycle in a pipeline, determining the problems associated with Pipelining an instruction stream, and trying to overcome these problems through techniques such as prefetching instructions and data, compiler techniques, and caching of data and/or instructions. The delay through each stage of a pipeline is also determined by the complexity of the logic in each stage of the pipeline. In many cases, the actual pipeline delay is much larger than the optimal value, D / n, of the logic in each stage of the pipeline. Queues can be added between pipeline stages to absorb any differences in execution time through the combinational logic or propagation delay between chips (Fig. 16.4). Asynchronous techniques including handshaking are sometimes used between the pipeline stages to transfer data between logic or chips running on different clocks. It is generally accepted that, for computer hardware design, simpler is usually better. Systems that minimize the number of logic functions are easier to design, test, and debug, as well as less power consuming and faster (working at higher clock rates). There are two important architectures that utilize this concept most effectively, reduced instruction set computers (RISC) and SIMD machines. RISC architectures are used to tradeoff increased code length and fetching overhead for faster clock cycles and less instruction set complexity...

  • VLSI Design
    eBook - ePub
    • M. Michael Vai(Author)
    • 2017(Publication Date)
    • CRC Press
      (Publisher)

    ...However, if the application has a continuous stream of data sets to be processed, performance can still be gained by overlapping the processing of these data sets. This concept has been generalized into an approach called Pipelining that explores temporal parallelism. Fig. 11.17 Message routing in an omega interconnection network. A pipeline architecture works in the same principle as an assembly line to speed up a strictly sequential task. An assembly line arranges the products being produced to pass consecutively through workstations consisting of workers and equipment so that different parts can be added on until completed. In 1913, Ford Motor Company introduced the concept of assembly line into car production and the rate of car producing was increased 8 times. Fig. 11.18 shows a PE processing a continuous stream of tasks. For example, the PE may evaluate a mathematical expression and the tasks represent different argument sets to be used in the expression. This is analogous to the Henry Ford assembly line that produced multiple identical cars. Fig. 11.18 Continuous stream of tasks processed by a single PE. Assume that the steps performed in one task are strictly sequential and thus spatial parallelism among the steps is out of the question. Instead, we can explore parallelism between tasks by replicating the PE. Fig. 11.19 shows that the PE in Fig. 11.18 has been replicated four times. It is easy to conclude that n PEs will produce a speed-up of n. As we have previously described, this represents a theoretical upper bound of the speed-up and can only be achieved in an ideal case. A number of issues have to be considered in the calculation of a more realistic speed-up. Recall that the tasks arrive in a sequence. A multiplexing scheme is thus needed to direct an arriving task to the next available PE. Similarly, the outputs of the PEs must be directed appropriately...

  • VLSI Architectures for Modern Error-Correcting Codes
    • Xinmiao Zhang(Author)
    • 2017(Publication Date)
    • CRC Press
      (Publisher)

    ...However, the area requirement of a parallel architecture increases proportionally with the number of duplicated copies. Hence, to achieve higher throughput, Pipelining and retiming should be considered before parallel processing if higher clock frequency is allowed. It should also be noted that duality exists between Pipelining and parallel processing. If multiple independent sequences are processed in an interleaved manner by a pipelined architecture, they can be also computed by duplicated units in a parallel architecture. If the output sequence of a circuit, y (n) (n = 0,1, 2, …), can be expressed as a mathematical formula in terms of the input sequence x (n), then parallel processing can be achieved through rewriting the formula in terms of parallel inputs and outputs. To obtain a J -parallel architecture that processes J inputs and generates J outputs in each clock cycle, the inputs and outputs need to be divided into J groups x (Jk), x (Jk + 1),…, x (Jk + J —1) and y (Jk), y (Jk + 1),…, y (Jk + J —1) (k = 0,1, 2, …). Once the formulas for expressing the parallel outputs in terms of parallel inputs are derived, the parallel architecture can be drawn accordingly. Example 21 Let x (n) and y (n) (n = 0, 1, 2, …) be the input and output sequences, respectively, of the 3-tap filter shown in Fig. 2.8 (a). The function of this filter can he described by y (n) = a 0 x (n) + a 1 x (n − 1) + (n − 2). ⁢ (2.1) Here x (n —1) is the sequence x (n) delayed by one clock cycle. To achieve 3-parallel processing, the inputs are divided into three sequences x (3k), x (3k +1) and x (3k + 2). Similarly, the outputs are divided into y (3k), y (3k + 1) and FIGURE 2.8 Parallel processing of 3-tap filter. (a) serial architecture; (b) 3-parallel architecture y (3k + 2). Replacing n in (2.1) by 3k, 3k + 1 and 3k + 2, it can be derived...