Intel Pentium 4 Processor


I)                   INTRODUCTION:

The high-performance Pentium® 4 processor-based system provides an extremely powerful computing experience, whether having a broadband connection to the Web, playing cutting edge online games, watching or creating videos, or running other performance-intensive applications, this chip intensifies the 3D action of your favorite games and enables clear and smooth audio and video streaming.

Pentium® 4 processors provide the performance to power the connected home, which

means linking all your digital devices in order to extend their capabilities, from your digital camera and MP3 player to your entire home entertainment system. Increased performance and headroom allows you to take advantage of the emerging Internet and computer technologies enabling the connected home. Pentium® 4 processors also deliver high-performance when networking multiple PCs, or when attaching your PC to home consumer electronic systems and new peripherals.


II)                AN OVERVIEW OF PENTIUM 4:

The Pentium 4 carries a whopping 42 million - 14 million more than the currently

Available Pentium III Coppermine processors. This massive increase in transistor count correlates directly with die size, so naturally, the Pentium 4 is significantly larger than its predecessor.  So, why did Intel decide to make the Pentium 4 larger? Since the .13-micron process is not quite ready yet (and won't be until next year), the P4 will be etched using the same .18-micron, aluminum trace process as the Coppermine. It does not take a mathematician to realize that 42 million transistors will not fit in a smaller space than a current product with 28 million. The question then becomes - what purpose do the extra transistors serve?ã

Figure 1. Intel Pentium 4 Block Diagram.

Intel NetBurst Micro-Architecture:

For the first time since the Pentium Pro, Intel has revamped their micro-architecture; adding features that they say will allow them to deliver leading performance for the next several years.

The first important point to consider is that processor performance is not determined solely by frequency (raw MHz). Rather, it is a function of frequency multiplied by IPC, or instructions per clock cycle. In order to overcome the frequency limitations of the P6 architecture implemented in Pentium II and III systems, Intel developed an architecture that slightly reduced the number of instructions per clock but also reaped significantly higher frequency capabilities. Several new features comprise the new architecture so in order to better understand the workings of the Pentium 4 we have broken them down individually.


Quite simply, the deeper pipeline provides for increased scalability, which has allowed Intel to debug the Pentium 4 at speeds of 1.5 and 1.4GHz using the same etching process as the Pentium III.

Not all things are peachy in the land of the 20-stage pipeline.  however,

doubling the depth of the branch prediction pipe, the penalty associated with mis-predictions is greatly increased - rather than flushing 10 speculatively executed instructions, the Pentium 4 has to flush 20, and start the execution over again in the correct program branch. The recovery time on the 20-stage pipe is much longer than the 10-stage pipe, resulting in a lower average number of instructions successfully executed per clock cycle. To compensate for the lower IPC, Intel has implemented a couple of features that greatly reduce the inherent mis-predict penalty - Execution Trace Cache and the Dynamic Execution Engine.

Figure 2.

·        Execution Trace Cache

Level 1 cache is normally split between the instruction and data caches, both of which are 16KB on the Pentium III. This go 'round, Intel has decreased the data cache to 8KB and has re-implemented the instruction cache to store micro-ops in the path of the program execution so that results of program branches are integrated into the same cache line. Latency is eliminated because the execution engine can retrieve decoded operations from the cache directly, rather than fetching and decoding commonly used instructions over and over again. In addition, instructions that are not used do not get stored in the cache, making the Execution Trace Cache more efficient than previous implementations.

·        Advanced Dynamic Execution

The second key to minimizing the branch mis-predict penalty lies with Intel's Dynamic Execution Engine, which keeps the Arithmetic Logic Units busy with instructions to execute. As opposed to the Pentium III, which only provided 42 instructions from which the execution units could choose, the Pentium 4 offers126, increasing the probability that the data needed after a cache miss will be available immediately rather than having to wait to fetch it from memory. As processor frequency ramps upwards, this becomes increasingly important since system memory speed does not scale with the processor.

In addition to providing a greater window of instructions for the execution engine to choose from, enhanced branch prediction has also been provided to further reduce the number of mis-predictions. Intel estimates this number to be about 33% lower than the P6's branch prediction capabilities because of an  Twenty-two points, plus triple-word-score, plus fifty points for using all my letters.  Game's over.  I'm outta here.enhanced prediction algorithm and a 4KB branch target buffer that stores detail on the history of past branches.

·        Rapid Execution Engine

If you have yet to pick up on a recurring theme for the Pentium 4, here's a clue-execution. In order to further compensate for the lower IPC of the NetBurst Architecture, Intel has clocked the Arithmetic Logic Units at twice the frequency of the processor core. So, on a 1.5GHz Pentium 4, the ALU's are screaming at 3GHz with latency that is half the duration of the core clock.

Intel’s estimates that as processor speeds increase, the integer performance of the Pentium 4 will improve since the speed of the ALU units (which most significantly impact integer performance) escalate twice as fast.

·        400MHz Front Side Bus

One of the most dramatic additions to the NetBurst architecture is a quad-pumped 100MHz-system bus, delivering the equivalent of 3.2GB/s of bandwidth. The idea behind the accelerated 64-bit bus is to match the bandwidth of the dual RDRAM channels that also provide 3.2GB/s of theoretical bandwidth. Of course the signaling scheme put in place by Intel could not be 100% efficient, so there is also a buffer to help facilitate sustained 400MHz data transfers. With such a high-speed bus in place, the Pentium 4 is able to push more than three times the amount of data as the Pentium III (which is limited to 1.06GB/s on a 133MHz bus).

Advanced Transfer Cache

Like the Pentium III before it, the Pentium 4 boasts 256KB of on-die cache on a 256-bit bus. Unlike the Coppermine, however, the Pentium 4's L2 cache transfers data on each core clock rather than every other cycle. Given the following equation we can calculate the data transfer rate of the L2 to the CPU's core.

(256-bit (32 byte) x 1 (data transferred per clock) x 1.5GHz) = 48GB/s for Pentium 4 1.5GHz

(256-bit (32 byte) x .5 (data transferred per clock) x 1GHz) = 16GB/s for Pentium III 1GHz

Again, as processor frequencies increase, so does the memory bandwidth of the L2. For example, once Intel hits 2GHz, the L2 will be able to provide 64GB/s of bandwidth - another example of Intel striving to keep the execution units busy rather than sitting idle.

Figure 3.


In the case of the Pentium 4 a new architecture was the only route to increasing the clock speed, as the aging P6 core had already long since exceeded its design limits.

However, having a processor running at 1+ GHz is useless if it is sitting idle and waiting for data to process. Therefore Intel has to make sure that the rest of the system is capable of feeding enough data to keep it running efficiently. One of the biggest bottlenecks is the memory subsystem responsible for data storage and retrieval. A processor is capable of a 2 GB/s bandwidth will be severely bottlenecked by a memory bandwidth of only 800 MB/s. Most code is executed from main memory, and approximately 80% of a processor's cycles are devoted to manipulating this data. With current processor and memory architectures, a 1+ GHz processor demands a memory bus actually capable of that bandwidth. Significant performance benefits await adequate chipsets.



A 1+ GHz CPU runs into its own set of problems, especially that the time available to execute an instruction is reduced to the point that execution times are too short to be feasible. The CPU needs time to execute the instruction, or, in case of a pipelined CPU, needs time to execute multiple instructions.

In essence a CPU is nothing more than an extremely fast calculator, capable of only simple arithmetic and simple logical decisions. For example, take the value of ‘A’, and add it to the value of ‘B’, or determine if ‘A’ is greater than ‘B’. The processor must first know where the values are stored, and what specifically to do with the values (e.g., add, multiply). Further, once the instructions and data have been located, interpreted, and executed, the result must be stored in memory for later use. To process an instruction, the processor must: Locate and retrieve the data from memory: Fetching Interpret or translate the instruction from the software: Decoding Perform the given instruction on the given data: Executing Place the result back into a memory location: Store

Of course, the above is an extremely simplified version of the process. Suffice it to say that each time an instruction is to be performed, the processor must fetch the data, decode the instruction, execute the instruction, and store the result. All of which has to be performed in one clock cycle; the time required is known as the execution latency.

To increase the performance of a CPU, so that it executes these instructions faster and reduces the instruction latency, the obvious answer is to increase clock speed and thus complete the ‘fetch-decode-execute-store’ loop faster. That’s quite viable, and is frequently used, but can only go so far. Once Intel can’t make the CPU execute any faster, why not give it less to do per cycle? Instead of fetching, decoding, executing, and storing, suppose Intel breaks it into four steps: fetch, decode, execute and store are each done in a single clock cycle. This is a 4-stage pipeline that effectively quadruples clock speed. However, the pipelined CPU will not be any faster than the original one, as it takes the same time to finish the instruction set. The IPC, instructions per second, ratings are equal and thus both execute identically.  In reality, the different stages of the fetch-decode-execute-store loop do not need to be executed sequentially; for example, why wait to fetch the next instruction until the first fetch-decode-execute-store loop is finished? Simply start fetching the next instruction right away. As a result, only the first instruction requires four clock cycles; subsequent instructions are finished once per clock cycle after that.  I.e., after 100 clock cycles our 4-stage pipeline CPU will actually complete 97 instructions: 4 cycles for the first instruction, then one instruction per clock for the subsequent 96 clocks, and not 25, as happened earlier. This, in fact, gives a 4-stage CPU an IPC rating of about 0.9 instructions per clock cycle, much better than 0.25, but still less than the 1.0 IPC of the non-pipelined CPU. Although the IPC rating is 10% lower than Inte’s non-pipelined CPU, the clock speed is 400% faster, so Intel’s 4-stage CPU is actually a much faster design (4 x 0.9 = 3.6 times). This has been one of the most important motivations for Intel's design of the Pentium 4 micro-architecture, as the P6 architecture could not be made to run much faster than a GHz without extensive rework of its fundamentals. One of the most prominent features of the Pentium 4 architecture is therefore its deep 20-stage pipeline, implemented to reduce the execution latency and increase the scalability of architecture clock speed.

Figure 4. Pentium 4

·        Quadruple: System bus in the Pentium 4

Important for the speed are not only the features specified above, but also the Level-2-Cache and the system bus. The latter is the connection between processor and main memory and clocks usually not as fast as the processor. In principle applies: The faster the bus clock, the faster is the total output of the computer. Since the Power Macs of the first generations only had a bus clock of 50 MHz, the beige G3-Macs made up to 66.7 MHz and the PowerBooks up to 83.4MHz. The blue-white G3s and all G4s even made 100 MHz. PCs however have faster bus systems. The best Pentium II settings are at present in charge of 133 MHz. The Pentium 4 catapults the bus clock about three times upward. With the help of the Intel-i850-Chipset, which needs Rambus RDRAM as memory, system bus clocks of 400 MHz should be possible. Compared with the current G4s, this is relation of 4 to 1 in favor of the Pentium PC.

·        Cache in Full Processor Speed

The system bus clocking also determines the clocking of the Level-2-Cache. But also the onboard-, inline-, and backside-caches became important rate factors. Target of these developments was to offer a fast data supply. The system bus and the other system constituents could not keep up any longer with the continually increasing processor speed. Thus fast processors are not nor slowed down by these components the computer manufacturers tried to crate a cache, which faster than the system bus. While the first Power PC and Pentium computers still had to get by without a level-2-cache, the Pentium III computers used 256 KB, which are located in the processor core. The advantage: Compared to the backside-cache of a Power Mac, which is usually clocked with the half processor rate, the Pentium III processors can use the full clocking of the processor. The Pentium 4 offers the same concept: 256 KB in the processor core with a clocking relation of 1:1 (CPU and cache). In contrast to the Pentium III the bandwidth trebled itself, which stands for a substantial rate  thrust. But according to Intel the so-called trace-cache brings the actual advantage in performance. Making use of the Transmeta technology the code is already translated and is not decoded just in time in the L1-command memory. This procedure saves additional waiting time.


V)                CONCLUSIONS:

It’s very clear that Intel have put a lot of thought in the Pentium 4’s overall design. Intel’s main objective for the Pentium 4 was to greatly enhance multimedia performance. Intel has done this because they believe that multimedia is where the most demand is for the CPU to perform is. Intel is defiantly onto a winner with the new Pentium 4 processor, there are many reasons for this. One major reason is the large amount of new features the Pentium 4 has to offer. Things like NetBurst Architecture, Quad-Pumped FSB, Hyper Pipelined Technology; SSE2 Instructions are what is going to make the Pentium 4 a real killer. With Intel’s future plans for the Pentium 4 its only going to get better. Low latency and high bandwidth is going to be the key for the Pentium 4’s high performance cache. The high hit rate L1 cache and the extremely high bandwidth L2 cache will make the Pentium 4 a solid starting ground for any future NetBurst micro-architecture based designs. Then new 144 SSE2 instructions that the Pentium 4 features will create a major gain in performance when they are fully integrated into all new software titles.  The Pentium 4’s high clock frequencies are going to make the Pentium 4’s very attractive to end users. People who are seeking the very best and latest technology the PC market has to offer.