Ever since its Conroe architecture took the world by storm, Intel, the largest CPU manufacturer in the world has been on an undeniable roll. AMD has been unable to hit back with anything concrete and competition has been lean. Intel has even gone as far as promising us something new in terms of processor architecture every two years or so. In fact, many people have made much of Intel’s “tick-tock” development model: each tick represents a dieshrink and each tock denotes a major architectural change. Late last year, Intel delivered the latest tock in the form of the processor codenamed Nehalem; finally given the marketable moniker of Core i7, with its processor family dubbed “Bloomfield”.
We tested a Nehalem processor, the i7 965 Extreme, a few months back and compared it to the (then) fastest desktop processor in the world — the Intel Core 2 Extreme QX 9770 (3.2 GHz). Needless to say, the Core i7’s performance was at least 15 per cent faster, 90 per cent of the time, in both real-world tests and synthetic benchmarks. Not only was the i7 significantly faster than any AMD Phenom processor, it was also noticeably faster than the fastest quad-core Intel Yorkfield processor around. When you consider that the Core 2 architecture isn’t very old and Intel has been continuously churning faster Core 2 processors, Nehalem becomes something rather special.
One of the firsts for Intel in the desktop space is that Nehalemis a native quad-core. Earlier, only AMD offered native quad-cores with its PhenomX4 range. Intel’s quad-core processors before the i7 were basically two dual-core CPUs slapped together on a single die; a fact that has been of great interest to AMD, as it has thumbed its nose rather regularly at what it has called a Jurassic design. The market is, however, more concerned with real-world performance rather than on-paper specifications and the Core architecture is much faster than anything AMD has been able to conjure up. And now the newcomer is out with promises of even more processing grunt. Of course, AMD isn’t whittling sticks either — its Phenom2 processors have just made an entry — but they have a very tough opponent to beat.
Intel claims to have designed the Core i7 with scalability in mind and looking at its released block diagrams we have to agree with them. In fact everything about the i7 design is modular, and while the current Nehalem CPUs are native quad-cores we could just as easily see eight-core CPUs or even dual-cores with integrated graphics, in the future. The fact is that Nehalem is designed from the ground up to be flexible and Intel can easily add or remove cores and other components as per market needs and demands. Nehalem also resurrects an old trick — hyper threading (HT) — which was dropped since the Pentium 4. Hyper threading is Intel’s name for simultaneous multithreading (SMT); a process in which operations from more than one instruction thread can be processed by a single processor core.
A quad-core Nehalem CPU can, therefore, process eight threads, while an eight core Nehalem will actually process a previously unheard of 16 threads. This gives the i7 much better multi-threaded performance while keeping thermal envelopes and silicon requirements much lower than they would be, if additional cores were added instead.
Doing more with less, seems to be Intel’s design motto for the i7. HT also means fewer wasted clock cycles as each core has a much better chance of being kept busy due to the fact that each core can be fed with two threads at a time, instead of a single thread. Nehalem’s cores are much faster than those in a Pentium 4, partly due to their being derived from the much faster Core architecture and also due to the much shorter execution pipeline; this means that HT on an i7 should be much more efficient than HT was on the Pentium 4 CPUs.
Intel’s Penryn CPUs added support for SSE 4 and Nehalem brings with it support for SSE 4.2. The main additions with the new SSE instructions are improved XML processing support, voice-recognition, error-detection code calculation, and DNA sequencing. What we’re seeing here is greater emphasis on achieving more performance per watt. While Nehalem CPUs have a higher TDP (Thermal Dissipation Point) than Penryn at 130 watts versus 95 watts, Intel claims that the performance hike is more than proportionate to the increased power requirement, or at worst, equal to it. In fact the Nehalemis designed on the Atom processor ethos: that is to maintain a 2:1 ratio for increments in performance and power consumption figures at all times.
Nehalemis a native quad-core with four cores sitting on a single die. Unlike previous quad-cores, wherein two cores shared an L2 cache, i7 offers each core discrete L1 and L2 caches, with an L3 cache in common. The Core i7 is made from a budget of 731 million transistors. In comparison, Yorkfield needed 820 million transistors. The die size of Nehalemis larger though; increasing from 214 mm^2 to 263 mm^2. That’s fewer transistors and reduced density, which should bring thermal benefits. Nehalem has the same L1 cache as previous Yorkfield CPUs: a 32 KB instruction cache and 32 KB data cache per core, but the L2 cache is significantly smaller. Each core in a quad-core Nehalem CPU gets only 256 kB (1 MB in total), while Yorkfield CPUs had as much as 3 MB of L2 cache per core. There is a single 8 MB of L3 cache to which all cores have access.
Now why did Intel make a high-end CPU with a fraction of the L2 cache? And why introduce an L3 cache which we know is slower than any L2 cache? The answer is one of the most radical architectural differences between Core i7 and Core 2 — the inclusion of an on-die memory controller or integrated memory controller (IMC). Gone is the memory controller on the Northbridge. This is where AMD fans will scream “copycat”. That’s not all. Gone is the ancient, bi-directional Front-Side Bus (FSB) in favor of what Intel calls “QuickPath Interconnect”, which is equivalent to AMDs Hyper Transport with a few differences. Intel had pretty much reached the upper limits of what it could do with its FSB and with FSB frequencies approaching 1,600 MHz, even faster memory could not alleviate the problem of keeping the CPU fed with data from memory. In fact, it is the IMC that resulted in the changed cache sizes on Nehalem, and not the other way around. Industry experts have rightly criticized Intel for hiding the huge memory latencies associated with having an FSB-based interconnect between CPU and RAM by adopting the brute force approach — increasing the processor cache to sometimes crazy levels. Obviously this is an inefficient, not to mention costly approach. Now, thanks to the IMC, Intel can make Nehalem shine with a relatively small L2 cache. Another trick is the inclusion of a new secondary Translation Look-Aside Buffer (TLB); earlier CPUs made do with a single TLB. The TLB is a CPU cache that is used by memory management hardware (like the IMC) to improve the speed of virtual address translation. Intel also copies AMD’s way of designating the specific processor regions and the Nehalemis divided into two parts termed as “core” and the rather awkward “uncore”. The core includes the physical CPU cores and L1 and L2 caches while the un-core consists of everything else on the chip, including the L3 cache and IMC.
The branch prediction mechanism of the Core i7 has also seen a significant change. A second-level branch predictor with a much larger data set and a deeper history has been added to the prediction unit in all Core 2 processor cores. The deeper history in conjunction with the larger data set allows for broader searching. Although this secondary predictor unit is slower than the primary unit, its ability to correct incorrect predictions made at the first-level and thus prevent any core overtime should more than compensate for its reduced speed of operation. Then there is a buffer called the “Renamed Returned Stack Buffer” which ensures that correct data is pulled from the processor’s stack, even in the event of an incorrect prediction. Therefore Nehalem beefs up the chances of a hit, and minimizes overhead and latency in case of a miss — a huge plus, to be sure. HT together with this two-stage branch prediction should bring immense benefits when working with multiple threads. The branch prediction could also do wonders for single-threaded applications: Since each core’s resources are shared according to the workload, if two threads are pushed at a core then each of them gets one-half of the resources. If a single thread is to be executed, it gets the full attention on the core it is pushed to, giving it greater performance. While on the topic of parallelism, Nehalem can execute more in-flight micro-ops than Penryn — the count is up from 96 to 128. Micro-ops, short for micro operations, are detailed low-level instructions issued to implement complex machine instructions. So the i7 has an increase in parallelism at the instruction level as well; a serious benefit when you consider each core supports HT.
Now a little bit on that shiny new IMC. Intel’s IMC supports three channels instead of two. So be prepared to see i7 based motherboards with six memory slots. This IMC supports only DDR3 and up to 24 GB of RAM. This is good for server configurations or workstations needing more than 8 GB of RAM. Although the IMC doesn’t support the fastest DDR3 memory, it doesn’t need to. Firstly, the theoretical bandwidth that tri-channel memory configurations offer, even at memory speeds of 1,333 MHz, is higher than a dual-channel configuration running memory clocked at 1,800 MHz. Besides greater bandwidth than all previous Intel systems, the IMC on each i7 CPU ensures that latencies will be minimal. In fact, slower memory also means tighter timings — which we’ve seen as more conducive to performance than higher memory clock speeds. All this will sound very familiar to most people following processor technology over the past five years — this is exactly what AMD told us about its memory controller — theoretical bandwidth doesn’t matter as much when using an IMC because most of the increased bandwidth gets wasted with FSB latencies. It seems Intel was listening to AMD as well. Thanks to the IMC, multi-CPU systems based on Nehalem processors will consist of a dedicated RAM pool per CPU. If one CPU needs access to some data which is in another’s data pool the QPI link between both the CPUs will come into play. This may result in some latency, but it should not be a lot. Obviously this won’t affect desktop users with a single CPU.
One of the main talking points of the Nehalem architecture is the QPI link itself. The QPI is a point-to-point interface, so there’s no conflict of interests for bandwidth. The QPI is the means for cores to talk to each other. It also connects the CPU to the motherboard chipset and allows the CPU to talk with PCI Express, USB, and SATA components. So far all the Nehalem desktop CPUs that Intel has released have a single QPI link. Server CPUs will have two links, with the second link allowing crosstalk between CPUs inmulti-socket systems. The highest- end Nehalem, the i965 Extreme Edition, has a link that allows 6.4 GT per second, (GT is an acronym for giga-transfer, yet another new term!). The lower Nehalems have a QPI link that allows 4.8 GT per second. The total bandwidth offered by each QPI is 25.6 gigabits per second and each consists of two 20-bit, bi-directional links. Note that the total bandwidth of 25.6 gigabits per second is double the bandwidth available on the 1,600 MHz FSB of Intel’s X48 chipset. In a truly modular fashion, Intel can add more QPI’s to a Nehalem CPU if the need arises.
The clock speed of all Nehalem CPUs is a derivative of the base QPI clock which is 133 MHz. This is much lower than the FSB speeds of previous CPUs from Intel; where the base FSB clock was 333 MHz, or even 400 MHz. Therefore all Nehalem CPUs have much higher multipliers. The L3 cache and the IMC operate on a difference clock called the uncore clock. This figure is 20X the base QPI clock i.e. 2.66 GHz. This is similar to AMD’s Phenom design where multiple clocks are used for different parts. In terms of power management Nehalem differs as well: While Phenom allows each core to request different clock speeds, incidentally the reason why one needs to install the AMD Cool’n’Quiet driver; the Nehalem attempts to run all cores at the same clock speed at all times. The only exception being when Turbo Mode (see below) is enabled. In other words, if a particular core isn’t being used it’s simply power gated and effectively turned off.
Talking about multipliers brings around the inevitable topic of overclocking. A decade back, just the mention of overclocking was enough to get people at Intel all riled up. But with Nehalem, Intel is in one sense, encouraging it. The i7 supports a feature called “Turbo Mode” (TM). A processor will be rated to operate within certain constraints — a maximum temperature, current and power consumption. If Turbo Boost is enabled, and if the processor’s on-die Power Control Unit, which monitors the thermal condition and power usage of the cores, indicates that it is running within these constraints, one or more cores will start to operate at higher clock speeds, enabling higher performance for the threads they are executing.
When cores are clocked-up in this way, steps of 133.33 MHz are used — the base frequency mentioned earlier. When this increase in clock speed occurs will depend on the conditions inside the machine, most particularly the cooling system. The more efficient the cooling, the more likely that clock speeds will increase. The most likely situation for Turbo Boost to be effective will be when a single- or dual-threaded compute intensive application is running, with little demand on the other cores. In that case, the core running the main thread is very likely to be clocked up, with the other cores not running, thereby reducing power consumption and heat dissipation.
This feature addresses the one criticism of multi-core processors: that not all compute intensive applications have the inherent parallelism that can benefit from multiple cores, and that such processors give no performance improvement for single-threaded applications. With Turbo Boost enabled, such an application can gain extra performance through the increased clock speed of the core on which it is running. This is perhaps the most important feature expressing the scalability that Intel claims to have built into this new architecture.
At the outset, it’s evident that Intel adopted a very cost-centric approach to designing the i7. It clearly draws on the best points of their previous architectures and with a borrowed trick or two, manages to raise the performance bar considerably. Partly inspired by the Atom design, the i7 has a very low die-area cost. We found Nehalem to be around 20 per cent faster than its predecessor and all this on a new platform that is bound to see optimizations and further performance increments. The LGA 1366 should be a good system to invest in for those looking at a powerful desktop processor; yet the costs related to purchasing a new motherboard coupled with the high prices of DDR3 means Nehalem may find few takers in price sensitive markets (such as India). As of now, the X58 platform is the only one available, although NVIDIA has obtained the necessary license from Intel and we could see NVIDIA boards supporting i7 soon. Intel also claims that future eight-core Nehalems will also be designed for socket 1366; this is serious longevity for the platform although whether it keeps its promise remains to be seen. The mess of platforms for different processors in the Pentium4 / PentiumD era is all too fresh in our minds — we hope the 1366-pin socket stays awhile; else early adopters will surely have sufficient grounds for complain.
It seems Intel deliberately chose features that would ensure it excels in the server domain — which has, by far, been its Achilles’ Heel over the past two years. In fact, a number of tricks under Nehalem’s 45nm hood won’t even be noticed unless the i7 is used inmulti-socket servers. Most of these tricks involve parallelism and of course more effective caches and prediction systems. Developers for desktops need to get cracking on getting their applications more and more multithreaded because this is where the i7 will shine and future processors will follow suit. Regardless of intent, the i7 offers a nice performance boost across applications in the desktop space as well. Gaming, 3D rendering, and video encoding are the three biggest beneficiaries. In fact it even delivers improved performance on a core-to-core basis — so Intel has another architectural winner.
Just when we thought it couldn’t get much better Intel throws us a few more delightful surprises. We’ve also heard that Intel plans to introduce a new Bloomfield derivative family named “Lynnfield”. Identical to i7; these CPUs sacrifice the QPI for a more traditional DMI and also forego tri-channel DDR3 for a dual-channel setup. These CPUs will utilize an 1156-pin socket and here’s the killer — there’s an on-package PCI Express controller that supports 16 PCI Express lanes. Yes, the PCIe lanes are on the CPU itself! These 16 lanes can be used as a single x16 connect or two x8 connects. Intel has also discussed “Havendale” which is a dual-core variant with 4 MB of L3 cache. These CPUs will be the first Intel CPUs to have a graphics core integrated as part of the CPU package. It also has the 16 lane PCI Express controller from Lynnfield, but it cannot be run as two X8 connects. Of course, Lynnfield is still a good four months or so from market; while Havendale won’t debut till 2010 end. The only thing better than living in interesting times is the knowledge that the best is yet to come!