Intel® Processors, Hardware and software spot: Intel's Core i7 processors

Intel's Core i7 processors

Those of us who are conversant with technology are more or less conditioned to accept and even expect change as a natural part of the course of things. New gadgets and gizmos debut regularly, each one offering some set of advantages or refinements over the prior generation. As a result, well, you folks are a rather difficult lot to impress, frankly speaking. But today is a day when one should sit up and take notice. I've been reviewing processors for nearly ten years now, and the Core i7 processors we're examining here represent one of the most consequential shifts in the industry during that entire span.
Intel, as you know, has been leading its smaller rival AMD in the performance sweeps for some time now, with a virtually unbroken lead since the debut of the first Core 2 processors more than two years ago. Even so, AMD has retained a theoretical (and sometimes practical) advantage in terms of basic system architecture throughout that time, thanks to the changes it introduced with its original K8 (Athlon 64 and Opteron) processors five years back. Those changes included the integration of the memory controller onto the CPU die, the elimination of the front-side bus, and its replacement with a fast, narrow chip-to-chip interconnect known as HyperTransport. This system architecture has served AMD quite well, particularly in multi-socket servers, where the Opteron became a formidable player in very short order and has retained a foothold even with AMD's recent struggles.
Now, Intel aims to rob AMD of that advantage by introducing a new system architecture of its own, one that mirror's AMD's in key respects but is intended to be newer, faster, and better. At the heart of this project is a new microprocessor, code-named Nehalem during its development and now officially christened as the Core i7.
Yeah, I dunno about the name, either. Let's just roll with it.
The Core i7 design is based on current Core 2 processors but has been widely revised, from its front end to its memory and I/O interfaces and nearly everywhere in between. The Core i7 integrates four cores into a single chip, brings the memory controller onboard, and introduces a low-latency point-to-point interconnect called QuickPath to replace the front-side bus. Intel has modified the chip to take advantage of this new system infrastructure, tweaking it throughout to accommodate the increased flow of data and instructions through its four cores. The memory subsystem and cache hierarchy have been redesigned, and simultaneous multithreading—better known by its marketing name, Hyper-Threading—makes its return, as well. The end result blurs the line between an evolutionary new product and a revolutionary one, with vastly more bandwidth and performance potential than we've ever seen in a single CPU socket.
How well does the Core i7 deliver on that potential? Let's find out.
An overview of the Core i7

The Core i7 modifies the landscape quite a bit, but much of what you need to know about it is apparent in the picture of the processor die below, with the major components labeled.

The Core i7 die and major components. Source: Intel.

What you're seeing, incidentally, is a pretty good-sized chip—an estimated 731 million transistors arranged into a 263 mm² area via the same 45nm, high-k fabrication process used to produce "Penryn" Core 2 chips. Penryn has roughly 410 million transistors and a die area of 107 mm², but of course, it takes two Penryn dies to make one quad-core product. Meanwhile, AMD's native quad-core Phenom chips have 463 million transistors but occupy a larger die area of 283 mm² because they're made on a 65nm process and have a higher ratio of (less dense) logic to (denser) cache transistors. Then again, size is to some degree relative; the GeForce GTX 280 GPU is of a Core i7 or Phenom.

Nehalem's four cores are readily apparent across the center of the chip in the image above, as are the other components (Intel calls these, collectively, the "uncore") around the periphery. The uncore occupies a substantial portion of the die area, most of which goes to the large, shared L3 cache.
This L3 cache is the last level of a fundamentally reworked cache hierarchy. Although not clearly marked in the image above, inside of each core is a 32 kB L1 instruction cache, a 32 kB L1 data cache (it's 8-way set associative), and a dedicated 256 kB L2 cache (also 8-way set associative). Outside of the cores is the L3, which is much larger at 8 MB and smarter (16-way associative) than the L2s. This basic arrangement may be familiar from AMD's native quad-core Phenom processors, and as with the Phenom, the Core i7's L3 cache serves as the primary means of passing data between its four cores. The Core i7's cache setup differs from the Phenom's in key respects, though, including the fact that it's inclusive—that is, it replicates the contents of the higher level caches—and runs at higher clock frequencies. As a result of these and other design differences, including a revamped TLB hierarchy, the Core i7's cache latencies are much lower than the Phenom's, even though its L3 cache is four times the size.
One mechanism Intel uses to make its caches more effective is prefetching, in which the hardware examines memory access patterns and attempts to fill the caches speculatively with data that's likely to be requested soon. Intel claims the Core i7's prefetching algorithm is both more efficient than Penryn's—some server admins wound up disabling hardware prefetch in Xeons because it harmed performance with certain workloads, a measure Intel says should no longer be needed—and more aggressive, as well.
The Core i7 can get to main memory very quickly, too, thanks to its integrated memory controller, which eliminates the chip-to-chip "hop" required when going over a front-side bus to an external north bridge. Again, this is a familiar page from AMD's template, but Intel has raised the stakes by incorporating support for three channels of DDR3 memory. Officially, the maximum memory speed supported by the first Core i7 processors is 1066 MHz, which is a little conservative for DDR3, but frequencies of 1333, 1600, and 2000 MHz are possible with the most expensive Core i7, the 965 Extreme Edition. In fact, we tested it with 1600 MHz memory, since this is a more likely configuration for a thousand-dollar processor.
For a CPU, the bandwidth numbers involved here are considerable. Three channels of memory at 1066 MHz can achieve an aggregate of 25.6 GB/s of bandwidth. At 1333 MHz, you're looking at 32 GB/s. At 1600 MHz, the peak would be 38.4 GB/s, and at 2000 MHz, 48 GB/s. By contrast, the peak effective memory bandwidth on a Core 2 system would be 12.8 GB/s, limited by the throughput of a 1600MHz front-side bus. With dual channels of DDR2 memory at 1066MHz, the Phenom's peak would be 17.1 GB/s. The Core i7 is simply in another league. In fact, our Core i7-965 Extreme test rig with 1600MHz memory has the same total bus width (192 bits) and theoretical memory bandwidth as a GeForce 9600 GSO graphics card.
With the memory controller onboard and the front-side bus gone, the Core i7 communicates with the rest of the system via the QuickPath interconnect, or QPI. QuickPath is Intel's answer to HyperTransport, a high-speed, narrow, packet-based, point-to-point interconnect between the processor and the I/O chip (or other CPUs in multi-socket systems.) The QPI link on the Core i7-965 Extreme operates at 6.4 GT/s. At 16 bits per transfer, that adds up to 12.8 GB/s, and since QPI links involve dedicated bidirectional pairs, the total bandwidth is 25.6 GB/s. Lower-end Core i7 processors have 4.8 GT/s QPI links with up to 19.2 GB/s of bandwidth. Obviously, these are both just starting points, and Intel will likely ramp up QPI speeds from here in successive product generations. Still, both are somewhat faster than the HyperTransport 3 interconnects in today's Phenoms, which peak at either 16 or 14.4 GB/s, depending on the chip.

A block diagram of the Core i7 system architecture. Source: Intel.

This first, high-end desktop implementation of Nehalem is code-named Bloomfield, and it's essentially the same silicon that should go into two-socket servers eventually. As a result, Bloomfield chips come with two QPI links onboard, as the die shot above indicates. However, the second QPI link is unused. In 2P servers based on this architecture, that second interconnect will link the two sockets, and over it, the CPUs will share cache coherency messages (using a new protocol) and data (since the memory subsystem will be NUMA)—again, very similar to the Opteron.
In order to take advantage of this radically modified system architecture, the design team tweaked Nehalem's processor cores in a range of ways big and small. Although the Core 2's basic four-issue-wide design and execution resources remain more or less unchanged, almost everything around the execution units has been altered to keep them more fully occupied. The instruction decoder can fuse more types of x86 instructions together and, unlike Core 2, it can do so when running in 64-bit mode. The branch predictor's accuracy has been enhanced, too. Many of the changes involve the memory subsystem—not just the caches and memory controller, which we've already discussed, but inside the core itself. The load and store buffers been increased in size, for instance.
These modifications make sense in light of the Core i7's much higher system-level throughput, but they also help make another new mechanism in the chip work better: the resurrected Hyper-Threading, or simultaneous multithreading (SMT). Each core in Nehalem can track two independent hardware threads, much like some other Intel processors, including later versions of the Pentium 4 and, more recently, the Atom. SMT takes advantage of the explicit parallelism built into multithreaded software to keep the CPU's execution units more fully occupied, and done well, it can be a clear win, delivering solid performance gains at very little cost in terms of additional die area or power use. Intel architect Rohank Singhal outlined how Nehalem's implementation of Hyper-Threading works at this past Fall IDF. Some hardware, such as the registers, must be duplicated for each thread, but much of it can be shared. Nehalem's load, store, and reorder buffers are statically partitioned between the two threads, for example, while the reservation station and caches are shared dynamically based on demand. The execution units themselves don't need to be altered at all.
The upshot of all of this is that a single Core i7 processor supports a total of eight threads, which makes for a pretty wicked looking Task Manager window. Because of the resource sharing involved, of course, Hyper-Threading won't likely double performance, even the best-case scenario. We'll look at its precise impact on performance in the following pages.
The changes to Nehalem's cores don't stop there, either. Intel has improved the performance of the synchronization primitives used by multithreaded applications, added a handful of instructions known as SSE 4.2—including some for string handling, cyclic redundancy checks, and popcount—and introduced enhancements for hardware-assisted virtualization. There's too much to cover here, really. If you want more detailed information, I suggest you check out Singhal's IDF presentation or David Kanter's Nehalem overview.

..................................................

Intel® Processors, Hardware and software spot

Thursday, December 11, 2008

Intel's Core i7 processors

0 comments:

Archives

Sponser

Tags

Feed It,