Saturday, January 10, 2009

AMD's Next Generation Microarchitecture

AMD's Next Generation Microarchitecture

July 27, 2006, Intel officially introduced its new Core 2 processor to the public. Based on the Conroe core, it proved to be a breakthrough in terms of CPU performance. AMD just doesn’t have a processor that could match the newest top-end models of the Core 2 series.
Unfortunately for AMD, the company was overconfident with its highly successful K8 series and didn’t prepare a timely response. According to the recently published roadmaps, AMD will release a new processor currently known under its codename of K8L (other sources refer to it as the K10) in the middle of the next year, i.e. a full year after Intel’s release of the Core 2.
The K8L is being originally designed as a quad-core chip. Its four execution cores will reside on a single silicon wafer and will all be connected to a shared L3 cache, to a common crossbar and to a common memory controller.
The already known and expected micro-architectural improvements in the upcoming processor will be discussed in this article. We will compare it with the Conroe micro-architecture to evaluate its possible fortes and weaknesses as well as future perspectives.


Fig-1


Instruction Fetching:

Each clock cycle the K8 processor is fetching instructions in aligned 16-byte blocks from the L1I instruction cache into a buffer where the instructions are extracted from the block and are then sent to the decoder’s channels. The 16 bytes per cycle fetch rate allows sending 3 instructions with an average length of up to 5 bytes for decoding on each cycle. In certain algorithms, however, the average instruction length in a chain may be bigger than 5 bytes.
Particularly, the length of a simple SSE2 instruction with register-register operands (for example, MOVAPD XMM0, XMM1) is 4 bytes. If an instruction uses indirect addressing (with a base register and an offset like in MOVAPD XMM0, [EAX+16]), its length increases to 6-8 bytes. In 64-bit mode a one-byte REX prefix is added to the instruction code when the additional registers are employed. Thus, SSE2 instructions may be as long as 7-9 bytes in 64-bit mode. An SSE1 instruction may be 1 byte shorter if it is a vector one (that is, operates on four 32-bit values), but is 7-9 bytes, too, under the same conditions if it is a scalar one (with one operand).
In this situation, the fetch rate of 16 bytes per cycle doesn’t seem fast enough to keep up the decoding speed at a rate of 3 instructions per cycle. This limitation is not important for the K8 processor because vector SSE and SSE2 instructions are decoded at a rate of 3 instructions per 2 clock cycles (or 1.5 instructions per cycle), which is enough to load the two 64-bit FPUs. In the future processor, a rate of at least 3 instructions per cycle must be maintained. Considering this, the fetching in 32-byte blocks, announced in the presentation of architectural innovations of the K8L, doesn’t seem excessive. If the succession of these long commands takes a few neighboring 16-byte blocks, then the average fetching tempo with 16-byte data blocks of 3 commands per clock cannot be achieved.

Fig-2


Figure 2 illustrates the positioning of five long instructions in a 32-byte block which can be fetched in one clock cycle. If the instructions are fetched in 16-byte blocks, it is impossible to achieve a fetch rate of 3 instructions per cycle.
By the way, Conroe processors fetch instructions in 16-byte blocks, just like K8 processors do, so they can decode the instruction stream at a rate of 4 instructions per clock only when the average instruction length is no longer than 4 bytes. Otherwise the decoder cannot process not only 4 but even 3 instructions per clock. To fight this in short loops, the Conroe has a special 64-byte internal buffer that caches loops up to 64 bytes long (four 16-byte blocks) and allows fetching data in such loops at a rate of 32 bytes per cycle. If a loop is longer than 4 blocks, it cannot be cached in this buffer.
The fetching of the next block of instructions is done using the branch prediction mechanism if there are any branch instructions present. Branches are predicted in the K8 processor by means of simpler algorithms than those employed in the Conroe. For example, the K8 cannot predict alternating indirect branches (this may have a negative effect on the execution of object-oriented polymorph code) and is also doesn’t always predict correctly regular patterns. The branch prediction mechanism will be improved in the K8L, but there’s no detailed info about that yet. The branch tables and counters will probably be made larger, and the algorithm of predicting branches alternating in regular patterns may be improved.

Decoding:

The x86 instructions extracted from the block of bytes are decoded into macro-operations. A macro-op consists of two micro-operations: an integer or floating-point arithmetic micro-op and an address operation for memory access. The splitting into micro-ops is done by the scheduler prior to sending them for execution. The decoder of K8 processors distinguishes between three types of instructions

DirectPath Single instructions are decoded into one macro-op in the hardware decoder; DirectPath Double instructions are decoded into two macro-ops in the hardware decoder; VectorPath instructions are decoded into three or more macro-ops using the on-chip microcode-engine ROM.

In a K8 processor, DirectPath and VectorPath instructions cannot be dispatched simultaneously. The decoders are issuing the decoded results at a rate of 3 macro-ops per cycle. Thus, the hardware decoder can decode 3 single instructions, 1 double and 1 single instruction or 1.5 double instructions (3 double instructions per two cycles). Since one VectorPath instruction can be decoded into more than 3 macro-ops, it can take more than 1 cycle to decode such instructions.
The macro-ops produced by the decoder each clock cycle are united into groups. A group consisting of 2 or even 1 macro-op is possible due to alternation of DirectPath and VectorPath commands and to various instruction fetch latencies. Such a group is completed with empty macro-ops so that there are thee macro-ops in total and is then dispatched.
VectorPath instructions from the SSE, SSE2 and SSE3 sets are divided in the K8 processor into pairs of macro-ops that separately process the top and bottom 64-bit parts of a 128-bit SSE register on 64-bit execution units. That’s why such instructions are decoded in the K8 processor at a rate of 3 instructions per 2 clock cycles. The width of the SSE devices in the future K8L processor will be expanded to 128 bits, so there is now no need to split vector instructions in two parts. The algorithm of decoding such instructions will obviously be changed in such a way that vector instructions could be decoded into single 128-bit macro-ops at a rate of 3 instructions per cycle.
Although the decoder of the K8L processor may not be able to decode 4-5 instructions per cycle, just the way Conroe can do it under favorable conditions, it will not hinder programs execution, because the commands are on average executed at less than 3 commands per cycle. K8 usually decodes one x86 instruction into fewer macro-operations than Conroe CPU would do. This, as well as the 32.

Integer Instructions:

The decoded triplets of macro-ops arrive at the instruction control unit (ICU), which puts information about them into the reorder buffer (ROB), and are then transferred to the schedulers. The ROB keeps track of the state of the macro-ops and controls the order of their retirement. The macro-ops come into the queues and are retired in groups of three (in lines) in the same manner as they have arrived at the ICU, but are received by the schedulers and are dispatched to the execution units independently.
The macro-ops from each group are distributed among the three independent queues of the scheduler, 8 elements each (24 macro-ops in total), assigned to three symmetrical integer channels. The queue number corresponds to the position of the macro-op in the group as it was shaped on the decoder’s output. As soon as the data is ready, the scheduler can dispatch one integer operation to the ALU and one address operation to the AGU from each queue. There can be two simultaneous memory accesses at most. Thus, each clock cycle 3 integer operations and 2 memory operations (64-bit reads and writes in any combination) can be dispatched. Integer operations are dispatched from the queues out of order as soon as the data is ready for them. But the load from memory operation is performed in program order, for example:


add ebx, ecx ; mov eax, [ebx+10h] ; quick address calculation mov ecx, [eax+ebx] ; the address depends on the result of the previous instruction mov edx, [ebx+24h] ; this instruction won’t be executed until the addresses of all the ; previous instructions are calculated


This is one of the limiting factors in the K8 processor. It is because of it that the K8, although can dispatch two read instructions per clock, may be less efficient with memory than the Conroe, which can only dispatch one read instruction per clock but has a mechanism for speculative out-of-order execution of read instructions bypassing previous reads and writes (called Memory Disambiguation). Fortunately, a mechanism for out-of-order loads will appear in the K8L, and this bottleneck will be eliminated. Details of this mechanism are not yet disclosed, but the reordering of read instructions will not probably affect write instructions, and this may become a reason for less efficient execution of some types of code.
A group of three macro-ops is removed from the ROB after all the instructions from this group are executed. The queuing and removal of macro-ops in groups simplifies control over the resources and helps load the schedulers in a more efficient way. If one of the three queues is fully loaded, new triplets of macro-ops cannot arrive at the scheduler and empty slots may appear in the other queues. However, there is a small percent of free slots in practice, and they do not worsen the CPU efficiency much.
Besides that, there can theoretically occur a certain reduction in the scheduler efficiency due to the static linking of the position of a macro-op in the group to the scheduler’s queue because one queue can have two or more micro-ops ready for execution, and another can have none (Figure 3). But this is not very probable in practice and is not frequently observed since there are usually quite enough of execution-ready instructions in the pipeline.




Fig-3


Unlike in the K8, there is a common queue for all instructions, including floating-point ones, in the Conroe. The queue length is 32 macro-ops. The common queue theoretically helps avoid empty slots and the possible limitations due to the static linking to the execution units. Besides that, the stack engine mechanism helps reduce the number of data dependencies between the instructions PUSH, POP, CALL and RET. In practice, however, it is very difficult to organize a fully associative queue from which all 5 micro-ops could be dispatched simultaneously. That’s why the common queue is still divided into sections. This produces a reverse effect: the insufficiently ordered selection of execution-ready instructions leads to the so-called chaotic scheduling problem of P6+ family processors and reduces the execution speed. Besides that, the Conroe has a limitation on the number of registers that can be simultaneously read from the ROB (not more than three), which puts a limitation on the scheduling of the instruction stream.
The out-of-order execution mechanisms in the K8L and the Conroe differ but slightly, because both processors can send up to 5 commands per clock cycle to be executed (3 ALU + 2 Mem). The peculiarities of scheduling and execution algorithms may show differently depending on the code generated by the compiler.

Floating Point Instructions:

In the K8 processor the scheduler of floating-point instructions is separate from the integer instruction scheduler and is designed in a different way. The scheduler buffer can take in up to 12 groups of three macro-ops (theoretically, 36 floating-point operations). Moreover, a queue consolidation mechanism is employed to make one complete and one empty triplet out of two incomplete triplets. Unlike the integer instruction execution unit with symmetrical computational channels, the FPU contains three different units FADD, FMUL and FMISC (it is FSTORE, too) for floating-point addition, multiplication and auxiliary operations, so the scheduler buffer doesn’t link the position of a macro-op in an instruction group to a particular execution unit (Figure 4).



Fig-4


Each clock cycle one operation can be dispatched to one of the K8’s 80-bit FPUs. 128-bit SSE instructions are divided into two 64-bit macro-ops on the decoding step. These macro-ops are then dispatched sequentially in two cycles. Theoretically, up to three macro-ops can be dispatched each cycle, but this rate is unachievable in practice due to the decoding limitations because besides floating-point instructions there are also auxiliary commands of loads, loops, etc. in the code. Moreover, the simple scheduling algorithm doesn’t always distribute the operations in the free devices in the optimal order, and this can reduce the dispatch rate due to inefficient utilization of the execution devices.
Thanks to its two 64-bit read buses, the K8 processor can receive up to two 64-bit operands per cycle from the L1 cache, which helps the processor keep up a high execution rate when floating-point instructions are frequently accessing data in memory. This is an important feature of the architecture since four operands are necessary to execute two instructions in parallel (two operands per one instruction), and two out of four operands are usually read from memory in a number of algorithms for processing streaming data.
In the K8L processor the FADD and FMUL devices will be expanded to 128 bits (Figure 5), which will help double the theoretical floating-point performance with code that uses vector SSE instructions (not only due to a doubled dispatch rate, but also due to an increased decoding and retiring rate caused by the reduced number of generated macro-ops).



Fig-5

The buses for reading data from the cache will also become two times wider, which will enable the processor to perform two 128-bit data loads from the L1 cache per cycle. The ability to perform two 128-bit data reads per cycle can give the K8L an advantage in some algorithms over a Conroe-core processor that is only capable of performing one 128-bit load.
According to the revealed information, the FMISC (FSTORE) device will remain a 64-bit one. This is illogical since writes from the 128-bit SSE registers into memory are executed on the FMISC (FSTORE) unit, and if this unit is left as it is now, it will automatically become a bottleneck in streaming calculations due to the inability of the CPU to perform a 128-bit data write each clock cycle. So, there is an opinion that AMD’s presentation contains an error and the FMISC (FSTORE) unit will still be able to perform writes at the full rate, i.e. 128 bits per cycle. Or perhaps the saving of 128-bit data will be implemented in the other two units while the auxiliary operations unit will work at half the full rate, which won’t be crucial for overall performance. The scheduling algorithm, which doesn’t always work optimally in the K8, calls for improvement, too.
As we said above, the Conroe, unlike the K8, uses a common queue for integer and floating-point instructions with all the ensuing advantages and shortcomings. Besides that, the ports for dispatching integer and floating-point instructions are combined, which prohibits certain combinations of integer and floating-point instructions. Another limitation in the Conroe, which can show up in floating-point algorithms that make use of x87 commands (i.e. without SSE optimizations), is the two times lower rate of dispatching the multiplication instruction FMUL.
Besides having wider execution units, the K8L will also have wider integer units inside the FADD and FMUL blocks that deal with SSE2 commands processing. As a result, the integer applications using these instruction sets will work faster. Also K8L will learn to perform a few extra SSE instructions that we won’t discuss here.
The FPU in the K8L is going to be very effective, more effective in some parameters (like the ability to read two 128-bit values per cycle) than the Conroe’s.

...............................................

0 comments: