Friday, January 9, 2009

AMD K10 Micro-Architecture

AMD K10 Micro-Architecture


Introduction:

AMD promises to introduce its new quad-core processors with K10 micro-architecture in the end of August – beginning of September this year. The first processors on this new micro-architecture will be server Opteron chips based on a core codenamed Barcelona. Unfortunately, AMD engineers failed to hit mass production quantities with the current revision of high-frequency chips. Looks like the main obstacle on the way to higher working frequencies was the fact that four cores running at high speed consume much more power than the platform TDP actually allows. With every new revision and transition to finer production technologies the power consumption will keep lowering and the working frequencies will keep growing. So far, AMD has to immediately start selling processors in order to improve its financial situation, so the first one to start selling will be the quad-core server processor model working at 2.0GHz.
In Q4 2007 AMD promises to increase Opteron working frequencies up to 2.4-2.5GHz and release desktop processors on K10 micro-architecture:
Phenom FX (codenamed Agena FX) – 4 cores, 2MB L3 cache, clock frequencies starting at 2.2-2.4GHz, AM2+ Socket, F+;
Phenom X4 (codenamed Agena) – 4 cores, 2MB L3 cache, clock frequencies starting at 2.2-2.4GHz, AM2+ Socket.
Later in early 2008 AMD promises to introduce “lite” modifications of their new processors, such as:
Phenom X2 (codenamed Kuma) – 2 cores, 2MB L3 cache, clock frequencies starting at 2.2-2.6GHz, AM2+ Socket;
Athlon X2 (codenamed Rana) – 2 cores, no L3 cache, clock frequencies start at 2.2GHz, AM2+ Socket;
Sempron (codenamed Spica) – 1 core, clock frequencies start at 2.2-2.4GHz, AM2+ Socket.
But this is all in the future. So far let’s take a look at the innovations introduced in the new AMD micro-architecture. In our today’s article I am going to try and reveal all the new architecture details and see what practical value they will have for us.

Instructions Fetch:

Processor starts the code processing from fetching instructions from the L1I instruction cache and their decoding. x86 instructions have variable length, which makes it harder to determine their boundaries before decoding starts. To ensure that the identification of the instructions length doesn’t affect the decoding speed, K8/K10 processors decode instructions while the lines are being loaded into L1I cache. Instruction labeling info is stored in special fields of the L1I cache (3bits of predecoding info per each byte of instructions). By performing the predecoding during loading into cache the instructions boundaries can be determined beyond the decoding pipes, which allows maintaining steady decoding rate independent of instructions format and length.
Processors load blocks of instructions from the cache and then pick out instructions that need to be sent for decoding. A CPU on K10 micro-architecture fetches instructions from the L1I cache in aligned 32-byte blocks, while K8 and Core 2 processors fetch instructions in 16-byte blocks. At 16 bytes per clock the instructions are fetched fast enough for K8 and Core 2 processors to send three instructions with the average length of 5 bytes for decoding every clock cycle. However, some x86 instructions may be 16 bytes long and in some algorithms the length of a few adjacent instructions may be greater than 5 bytes. As a result, it is impossible to decode three instructions per clock in such cases. (Pic.1).

Pic 1: A few adjacent long instructions limit the decoding speed during instructions fetch 16-byte blocks.


Namely, SSE2 – a simple instruction with operands of register-register type (for example, movapd xmm0, xmm1 ) – is 4 bytes long. However, if the instruction generates addressed memory requests using the base register and offset (for example, movapd xmm0, [eax+16] ), the instruction increases up to 6-9 bytes, depending on the offset. If additional registers are involved in 64-bit mode, there is one more single-byte REX-prefix added to the instruction code. This way, SSE2 instructions in 64-bit mode may become 7-10 bytes long. SSE1 instructions are 1 byte shorter, if it is a vector instruction (in other words, if it works on four 32-bit values). But if it is a scalar SSE1 instruction (on one operand) it can also be 7-10 bytes long in the same conditions.
Fetching maximum 16-byte blocks is not a limitation for K8 processor in this case, because it cannot decode vector instructions faster than 3 per 2 clocks anyway. However, for K10 architecture a 16-byte block could become a bottleneck, so increasing the maximum fetch block size to 32 bytes is an absolutely justified measure.


By the way, Core 2 processors fetch 16-byte instruction blocks, just like K8 processors, that is why they can decode efficiently 4 instructions per clock cycle if the average instruction length doesn’t exceed 4 bytes. Otherwise, the decoder will not be able to process 4 or even 3 instructions per clock efficiently enough. However, Core 2 processors feature a special internal 64-byte buffer that stores the last four requested 16-byte blocks. The instructions are fetched from this buffer at the rate 32 bytes per clock speed. This buffer allows caching short cycles, removing their fetching speed limitations and save up to 1 clock cycle each time the prediction to move to the cycle beginning is made. Although the cycles shouldn’t have more than 18 instructions, more than 4 conditional branches and no ret instructions in them.


Branch Prediction:

If the chain of instructions branches, the CPU should try to predict further direction of the program to avoid decoding interruption and continue decoding the most probable branch. In this case branch prediction algorithms are used to fetch the next instructions block. K8 processors use two-level adaptive algorithm for branch prediction. This algorithm takes into account prediction history not only for the current instruction, but also for 8 previous instructions. The main drawback of K8 branch prediction algorithms was the inability to predict indirect branches with dynamically alternating addresses.
Indirect branches are the branches that use a pointer, which is calculated dynamically during program code execution. These indirect branches are usually inserted into switch-case constructions by the compiler. They are also used during addressed function calls and virtual function calls in object-oriented programming. K8 processor always tries to use the last branch address to grasp a block of code to be fetched. If the address has changed, the pipeline is cleared. If the branch address is alternating occasionally, the processor will make prediction mistakes all the time. The prediction of dynamically changing addresses for indirect branches was first introduced in Pentium M processor. Since there is no such algorithm in K8 CPUs, they are less efficient in object-oriented codes.
As we have expected, K10 boasts improved conditional branch prediction algorithms:

It acquired prediction algorithms for dynamically changing indirect branches addresses. This algorithm uses a table of 512 elements.
The global history register increased from 8 to 12 bits. It serves to determine the succession history for previous branch instructions.
The depth of return-address stack increased from 12 to 24 positions. This stack serves to obtain the function return address quickly, so that the fetching could continue and there were no need to wait for the ret instruction to receive the stack return address.
These improvements should help K10 execute programs written in high-level object-oriented code much faster. Unfortunately, it is very hard to objectively estimate the efficiency of the K10 branch prediction unit, but according to some data, it may be lower in some cases than by Intel processors.

Decoding:

The blocks received from the instructions cache are copied into the Predecode/Pick Buffer , where instructions are singled out from the block, their types are defined, and then they are sent to the corresponding decoder pipes. Simple instructions that can be decoded with one (Single) or two (Double) micro-operations are sent to the “simple” decoder called DirectPath . Complex instructions that require 3 or more micro-operations to be decoded, are sent to the micro-program decoder aka VectorPath .


Up to 3 macro-operations (MOPs) may leave decoder pipes each clock cycle. Every clock cycle DirectPath decoder may process 3 simple single-MOP instructions, or one 2-MOP instruction and one single-MOP instruction, or 1.5 2-MOP instructions (three 2-MOP instructions in two clocks). Decoding of complex instructions may require more than 3 MOPs that is why they may take a few clocks to complete. To avoid conflicts on leaving the decoder pipes, K8 and K10 simple and complex instructions may be sent for decoding simultaneously.
MOPs consist of two micro-operations (micro-ops): one integer or floating point arithmetic operation and one memory address request. Micro-operations are singled out from the MOPs by the scheduler, which then sends them to be executed independently from one another.
MOPs leaving the decoder every clock are combined into groups of three. Sometimes the decoder may generate a group of 2 or even only 1 MOP because of the alternating DirectPath and VectorPath instructions or different delays in the selection of instructions for decoding. An incomplete group like that is filled with empty MOPs to make three, and then is sent to be executed.
Vector SSE, SSE2 and SSE3 instructions in K8 processor are split into MOP pairs that process separately the upper and lower 64-bit halves of the 128-bit SSE register in 64-bit devices. It slows down the instructions decoding by half and cuts down in half the number of instructions in the scheduler queue.
Thanks to larger 128-bit FPU units in K10 processors, there is no need to split vector SSE-instructions into 2 MOPs any more. Most SSE-instructions that K8 used to decode as DirectPath Double, are now decoded in K10 as DirectPath Single in 1 MOP. Moreover, some SSE-instructions that used to be decoded through K8 micro-program VectorPath decoder, are now decoded in K10 through simple DirectPath decoder with fewer generated MOPs: 1 or 2 depending on the operation.
Decoding of stack instructions has also been simplified. Most stack operation instructions that are usually used for CALL-RET and PUSH-POP functions are now also processed by a simple decoder in a single MOP. Moreover, special Sideband Stack Optimizer scheme transforms these instructions into an independent chain of micro-operations that can be executed in parallel.

..........................................................

0 comments: