17. DISCUSSION OF SPECIAL INSTRUCTIONS ====================================== 17.1 TEST --------- The TEST instruction with an immediate operand is only pairable if the destination is AL, AX, or EAX, and only if it is coded in a certain way. TEST register,register and TEST register,memory is always pairable. TEST EAX,immediate can be coded in three ways: a. Two bytes instruction code + 4 bytes data: not pairable b. Two bytes instruction code + 1 byte sign extended data: not pairable c. One byte instruction code + 4 bytes data: pairable The assembler will always choose the shortest form of an instruction. An immediate constant between -128 and +127 can be written as a sign extended byte, which will cause the assembler to pick form b, which is not pairable. To make it pairable you have to hard code form c, e.g.: DB 0A9H / DD const or change it to TEST AL,const if possible. If the constant is not between -128 and +127 or if the destination is AL, then the shortest form is also the pairable form. Examples: TEST ECX,ECX ; pairable TEST [mem],EBX ; pairable TEST EDX,256 ; not pairable TEST DWORD PTR [EBX],8000H ; not pairable To make it pairable, use any of the following methods: MOV EAX,[EBX] / TEST EAX,8000H MOV EDX,[EBX] / AND EDX,8000H MOV AL,[EBX+1] / TEST AL,80H MOV AL,[EBX+1] / TEST AL,AL ; (result in sign flag) It is also possible to test a bit by shifting it into the carry flag: MOV EAX,[EBX] / SHR EAX,16 ; (result in carry flag) but this method has a penalty on the PentiumPro when the shift count is more than one. (The reason for this non-pairability is probably that the first byte of the 2-byte instruction is the same as for some other non-pairable instructions, and the Pentium cannot afford to check the second byte too when determining pairability.) 17.2 WAIT --------- You can often increase speed by omitting the WAIT instruction. The WAIT instruction has three functions: a. The 8087 processor requires a WAIT before _every_ floating point instruction. b. WAIT is used to coordinate memory access between the floating point unit and the integer unit. Examples: b.1. FISTP [mem32] WAIT ; wait for f.p. unit to write before.. MOV EAX,[mem32] ; reading the result with the integer unit b.2. FILD [mem32] WAIT ; wait for f.p. unit to read value before.. MOV [mem32],EAX ; overwriting it with integer unit b.3. FLD DWORD PTR [ESP] WAIT ; prevent an accidental hardware interrupt from.. ADD ESP,4 ; overwriting value on stack before it is read c. WAIT is sometimes used to check for exceptions. It will generate an interrupt if there is an unmasked exception bit in the f.p. status word set by a preceding floating point instruction. Regarding a: The function in point a is never needed on any other processors than the old 8087. Unless you want your code to be compatible with the 8087 you should tell your assembler to not put in these WAITs by specifying a higher processor. Regarding b: WAIT instructions to coordinate memory access are definitely needed on the 8087 and 80287. A superscalar processor like the Pentium has special circuitry to detect memory conflicts so you wouldn't need the WAIT for this purpose on code that only runs on a Pentium or higher. I have made some tests on other Intel processors and not been able to provoke any error by omitting the WAIT on any 32 bit Intel processor, although Intel manuals say that the WAIT is needed for this purpose except after FNSTSW and FNSTCW. If you want to be certain that your code will work on any 32 bit processor (including non-Intel processors) then I would recommend that you include the WAIT here in order to be safe. Regarding c: The assembler automatically inserts a WAIT for this purpose before the following instructions: FCLEX, FINIT, FSAVE, FSTCW, FSTENV, FSTSW You can omit the WAIT by writing FNCLEX, etc. My tests show that the WAIT is unneccessary in most cases because these instructions without WAIT will still generate an interrupt on exceptions except for FNCLEX and FNINIT on the 80387. (There is some inconsistency about whether the IRET from the interrupt points to the FN.. instruction or to the next instruction). Almost all other floating point instructions will also generate an interrupt if a previous floating point instruction has set an unmasked exception bit, so the exception is likely to be detected sooner or later anyway. You may still need the WAIT if you want to know exactly where an exception occurred in order to recover from the situation. Consider, for example, the code under b.3 above: If you want to be able to recover from an exception generated by the FLD here, then you need the WAIT because an interrupt after ADD ESP,4 would overwrite the value to load. 17.3 FCOM + FSTSW AX -------------------- The usual way of doing floating point comparisons is: FLD [a] FCOMP [b] FSTSW AX SAHF JB ASmallerThanB You may improve this code by using FNSTSW AX rather than FSTSW AX and test AH directly rather than using the non-pairable SAHF. (TASM version 3.0 has a bug with the FNSTSW AX instruction) FLD [a] FCOMP [b] FNSTSW AX SHR AH,1 JC ASmallerThanB Testing for zero or equality: FTST FNSTSW AX AND AH,40H JNZ IsZero ; (the zero flag is inverted!) Test if greater: FLD [a] FCOMP [b] FNSTSW AX AND AH,41H JZ AGreaterThanB Do not use TEST AH,41H as it is not pairable. Do not use TEST EAX,4100H as it would produce a partial register stall on the PentiumPro. Do not test the flags after multibit shifts, as this has a penalty on the PentiumPro. It is often faster to use integer instructions for comparing floating point values, as described in paragraph 18 below. 17.4 LEA -------- The LEA instruction is useful for many purposes because it can do a shift, two additions, and a move in just one pairable instruction taking one clock cycle. Example: LEA EAX,[EBX+8*ECX-1000] is much faster than MOV EAX,ECX / SHL EAX,3 / ADD EAX,EBX / SUB EAX,1000 The LEA instruction can also be used to do an add or shift without changing the flags. The source and destination need not have the same word size, so LEA EAX,[BX] is a useful replacement for MOVZX EAX,BX. You must be aware, however, that the LEA instruction will suffer an AGI stall if it uses a base or index register which has been changed in the preceding clock cycle. Since the LEA instruction is pairable in the V-pipe and shift instructions are not, you may use LEA as a substitute for a SHL by 1, 2, or 3 if you want the instruction to execute in the V-pipe. The 32 bit processors have no documented addressing mode with a scaled index register and nothing else, so an instruction like LEA EAX,[EAX*2] is actually coded as LEA EAX,[EAX*2+00000000] with an immediate displacement of 4 bytes. You may reduce the instruction size by instead writing LEA EAX,[EAX+EAX] or even better ADD EAX,EAX. The latter code cannot have an AGI delay. If you happen to have a register which is zero (like a loop counter after a loop), then you may use it as a base register to reduce the code size: LEA EAX,[EBX*4] ; 7 bytes LEA EAX,[ECX+EBX*4] ; 3 bytes 17.5 integer multiplication --------------------------- An integer multiplication takes approximately 9 clock cycles. It is therefore advantageous to replace a multiplication by a constant with a combination of other instructions such as SHL, ADD, SUB, and LEA. Example: IMUL EAX,10 can be replaced with MOV EBX,EAX / ADD EAX,EAX / SHL EBX,3 / ADD EAX,EBX or LEA EAX,[EAX+4*EAX] / ADD EAX,EAX Floating point multiplication is faster than integer multiplication on a Pentium without MMX, but the time used to convert integers to float and convert the product back again is usually more than the time saved by using floating point multiplication, except when the number of conversions is low compared with the number of multiplications. 17.6 division ------------- Division is quite time consuming. The DIV instruction takes 17, 25, or 41 clock cycles for byte, word, and dword divisors respectively. The IDIV instruction takes 5 clock cycles more. It is therefore preferable to use the smallest operand size possible that won't generate an overflow, even if it costs an operand size prefix, and use unsigned division if possible. Unsigned division by a power of two can be done with SHR. Division of a signed number by a power of two can be done with SAR, but the result with SAR is rounded towards minus infinity, whereas the result with IDIV is truncated towards zero. Floating point division takes 39 clock cycles. It is possible to do a floating point division and an integer division in parallel to save time. Example: A = A1 / A2; B = B1 / B2 FILD [B1] FILD [B2] MOV EAX,[A1] MOV EBX,[A2] CDQ FDIV DIV EBX FISTP [B] MOV [A],EAX (make sure you set the floating point unit to the desired rounding method) Obviously, you should always try to minimize the number of divisions. For example: if (A/B > C)... can be rewritten as if (A > B*C)... when B is positive, and the opposite when B is negative. A/B + C/D can be rewritten as (A*D + C*B) / (B*D) If you are using integer division, then you should be aware that the rounding errors may be different when you rewrite the formulas. 17.7 string instructions ------------------------ String instructions without a repeat prefix are too slow, and should always be replaced by simpler instructions. The same applies to LOOP and JECXZ. String instructions with repeat may be optimal. Always use the dword version if possible, and make sure that both source and destination are aligned by 4. REP MOVSD is the fastest way to move blocks of data when the destination is in the cache. See section 19 for an alternative. REP STOSD is optimal when the destination is in the cache. REP LOADS, REP SCAS, and REP CMPS are not optimal, and may be replaced by loops. See section 16 example 10 for an alternative to REP SCASB. 17.8 XCHG --------- The XCHG register,memory instruction is dangerous. By default this instruction has an implicit LOCK prefix which prevents it from using the cache. The instruction is therefore very time consuming, and should always be avoided. 17.9 rotates through carry -------------------------- RCR and RCL with a count different from one are slow and should be avoided. 17.10 bit scan -------------- BSF and BSR are the poorest optimized instructions on the Pentium, taking 11 + 2*n clock cycles, where n is the number of zeros skipped. (on later processors it takes only 1) The following code emulates BSF ECX,EAX: TEST EAX,EAX JZ SHORT BS6 PUSH EAX XOR ECX,ECX TEST EAX,0FFFFH ; (only pairable if register is EAX) JNZ SHORT BS1 SHR EAX,16 ADD ECX,16 BS1: TEST AL,AL JNZ SHORT BS2 MOV AL,AH ADD ECX,8 BS2: TEST AL,0FH JNZ SHORT BS3 SHR AL,4 ADD ECX,4 BS3: TEST AL,3 JNZ SHORT BS4 SHR AL,2 ADD ECX,2 BS4: TEST AL,1 JNZ SHORT BS5 INC ECX BS5: POP EAX BS6: The following code emulates BSR ECX,EAX: TEST EAX,EAX JZ SHORT BS7 MOV DWORD PTR [TEMP],EAX MOV DWORD PTR [TEMP+4],0 FILD QWORD PTR [TEMP] FSTP QWORD PTR [TEMP] WAIT ; WAIT only needed for compatibility with earlier processors MOV ECX, DWORD PTR [TEMP+4] SHR ECX,20 SUB ECX,3FFH TEST EAX,EAX ; clear zero flag BS7: 17.11 bit test -------------- BT, BTC, BTR, and BTS instructions should preferably be replaced by instructions like TEST, AND, OR, XOR, or shifts. 17.12 FPTAN ----------- According to the manuals, FPTAN returns two values X and Y and leaves it to the programmer to divide Y with X to get the result, but in fact it always returns 1 in X so you can save the division. My tests show that on all 32 bit Intel processors with floating point unit or coprocessor, FPTAN always returns 1 in X regardless of the argument. If you want to be sure that your code will run correctly on all processors, then you may test if X is 1, which is faster than dividing with X. The Y value may be very high, but never infinity, so you don't have to test if Y contains a valid value. 18. USING INTEGER INSTRUCTIONS TO DO FLOATING POINT OPERATIONS ============================================================== Integer instructions are generally faster than floating point instructions, so it is often advantageous to use integer instructions for doing simple floating point operations. The most obvious example is moving data. Example: FLD QWORD PTR [ESI] / FSTP QWORD PTR [EDI] Change to: MOV EAX,[ESI] / MOV EBX,[ESI+4] / MOV [EDI],EAX / MOV [EDI+4],EBX The former code takes 4 clocks, the latter takes 2. Testing if a floating point value is zero: The floating point value of zero is usually represented as 32 or 64 bits of zero, but there is a pitfall here: The sign bit may be set! Minus zero is regarded as a valid floating point number, and the processor may actually generate a zero with the sign bit set if for example multiplying a negative number with zero. So if you want to test if a floating point number is zero, you should not test the sign bit. Example: FLD DWORD PTR [EBX] / FTST / FNSTSW AX / AND AH,40H / JNZ IsZero Use integer instructions in stead, and shift out the sign bit: MOV EAX,[EBX] / ADD EAX,EAX / JZ IsZero The former code takes 9 clocks, the latter takes only 2. If the floating point number is double precision (QWORD) then you only have to test bit 32-62. If they are zero, then the lower half will also be zero if it is a valid floating point number. Testing if negative: A floating point number is negative if the sign bit is set and at least one other bit is set. Example: MOV EAX,[NumberToTest] / CMP EAX,80000000H / JA IsNegative Manipulating the sign bit: You can change the sign of a floating point number simply by flipping the sign bit. Example: XOR BYTE PTR [a] + (TYPE a) - 1, 80H Likewise you may get the absolute value of a floating point number by simply ANDing out the sign bit. Comparing numbers: Floating point numbers are stored in a unique format which allows you to use integer instructions for comparing floating point numbers, except for the sign bit. If you are certain that two floating point numbers both are positive then you may simply compare them as integers. Example: FLD [a] / FCOMP [b] / FNSTSW AX / AND AH,1 / JNZ ASmallerThanB Change to: MOV EAX,[a] / MOV EBX,[b] / CMP EAX,EBX / JB ASmallerThanB This method only works if the two numbers have the same precision and you are certain that none of the numbers have the sign bit set. If one or both numbers may be negative or minus zero, then you have to take all combinations into account which makes the code so complicated that you probably would prefer to do a floating point compare. 19. USING FLOATING POINT INSTRUCTIONS TO DO INTEGER OPERATIONS ============================================================== 19.1 Moving data ---------------- Floating point instructions can be used to move 8 bytes at a time: FILD QWORD PTR [ESI] / FISTP QWORD PTR [EDI] This is only an advantage if the destination is not in the cache. The optimal way to move a block of data to uncached memory on the Pentium is: TopOfLoop: FILD QWORD PTR [ESI] FILD QWORD PTR [ESI+8] FXCH FISTP QWORD PTR [EDI] FISTP QWORD PTR [EDI+8] ADD ESI,16 ADD EDI,16 DEC ECX JNZ TopOfLoop The source and destination should of course be aligned by 8. The extra time used by the slow FILD and FISTP instructions is compensated for by the fact that you only have to do half as many write operations. Note that this method is only advantageous on the Pentium and only if the destination is not in the cache. On all other processors the optimal way to move blocks of data is REP MOVSD, or if you have a processor with MMX you may use the MMX instructions in stead to write 8 bytes at a time. 19.2 Integer multiplication --------------------------- Floating point multiplication is faster than integer multiplication on the Pentium without MMX, but the price for converting integer factors to float and converting the result back to integer is high, so floating point multiplication is only advantageous if the number of conversions needed is low compared to the number of multiplications. Integer multiplication is faster than floating point on other processors. 19.3 Integer division --------------------- Floating point division is not faster than integer division, but you can do other integer operations (including integer division, but not integer multiplication) while the floating point unit is working on the division. See paragraph 17.6 above for an example. 19.4 Converting binary to decimal numbers ----------------------------------------- The FBSTP instruction converts a binary number to decimal faster than using repeated division if you have more than a few digits. 20. LIST OF INTEGER INSTRUCTIONS ================================ Explanations: Operands: r=register, m=memory, i=immediate data, sr=segment register m32= 32 bit memory operand, etc. Clock cycles: The numbers are minimum values. Cache misses, misalignment, and exceptions may increase the clock counts considerably. Pairability: u=pairable in U-pipe, v=pairable in V-pipe, uv=pairable in either pipe, np=not pairable Opcode Operands Clock cycles Pairability ---------------------------------------------------------------------------- NOP 1 uv MOV r/m, r/m/i 1 uv MOV r/m, sr 1 np MOV sr, r/m >= 2 b) np XCHG (E)AX, r 2 np XCHG r , r 3 np XCHG r , m >20 np XLAT 4 np PUSH r/i 1 uv POP r 1 uv PUSH m 2 np POP m 3 np PUSH sr 1 b) np POP sr >= 3 b) np PUSHF 4 np POPF 6 np PUSHA POPA 5 np LAHF SAHF 2 np MOVSX MOVZX r, r/m 3 a) np LEA r/m 1 uv LDS LES LFS LGS LSS m 4 c) np ADD SUB AND OR XOR r , r/i 1 uv ADD SUB AND OR XOR r , m 2 uv ADD SUB AND OR XOR m , r/i 3 uv CMP r , r/i 1 uv CMP m , r/i 2 uv TEST r , r 1 uv TEST m , r 2 uv TEST r , i 1 f) TEST m , i 2 np ADC SBB r/m, r/m/i 1/3 u INC DEC r 1 uv INC DEC m 3 uv NEG NOT r/m 1/3 np MUL IMUL r8/r16/m8/m16 11 np MUL IMUL all other versions 9 d) np DIV r8/r16/r32 17/25/41 np IDIV r8/r16/r32 22/30/46 np CBW CWDE 3 np CWD CDQ 2 np SHR SHL SAR SAL r , i 1 u SHR SHL SAR SAL m , i 3 u SHR SHL SAR SAL r/m, CL 4/5 np ROR ROL RCR RCL r/m, 1 1/3 u ROR ROL r/m, i(><1) 1/3 np ROR ROL r/m, CL 4/5 np RCR RCL r/m, i(><1) 8/10 np RCR RCL r/m, CL 7/9 np SHLD SHRD r, i/CL 4 a) np SHLD SHRD m, i/CL 5 a) np BT r, r/i 4 a) np BT m, i 4 a) np BT m, r 9 a) np BTR BTS BTC r, r/i 7 a) np BTR BTS BTC m, i 8 a) np BTR BTS BTC m, r 14 a) np BSF BSR r , r/m 7-73 a) np SETcc r/m 1 a) np JMP CALL short/near 1 v JMP CALL far >= 3 np conditional jump short/near 1/4/5 e) v CALL JMP r/m 2 np RETN 2 np RETN i 3 np RETF 4 np RETF i 5 np J(E)CXZ short 5-10 np LOOP short 5-10 np BOUND r , m 8 np CLC STC CMC CLD STD 2 np CLI STI 6-7 np LODS 2 np REP LODS 7+3*n g) np STOS 3 np REP STOS 10+n g) np MOVS 4 np REP MOVSB 12+1.8*n g) np REP MOVSW 12+1.5*n g) np REP MOVSD 12+n g) np SCAS 4 np REP(N)E SCAS 9+4*n g) np CMPS 5 np REP(N)E CMPS 8+5*n g) np BSWAP 1 a) np ---------------------------------------------------------------------------- Notes: a) this instruction has a 0FH prefix which takes one clock cycle extra to decode on a Pentium without MMX unless preceded by a multicycle instruction (see section 13 above). b) versions with FS and GS have a 0FH prefix. see note a. c) versions with SS, FS, and GS have a 0FH prefix. see note a. d) versions with two operands and no immediate have a 0FH prefix, see note a. e) see section 12 above f) only certain versions are pairable. see paragraph 17.1 above g) add one clock cycle for decoding the repeat prefix unless preceded by a multicycle instruction (such as CLD. see section 13 above). 21. LIST OF FLOATING POINT INSTRUCTIONS ======================================= Explanations: Operands: r=register, m=memory, m32=32 bit memory operand, etc. Clock cycles: The numbers are minimum values. Cache misses, misalignment, and exceptions may increase the clock counts considerably. Pairability: +=pairable with FXCH, np=not pairable i-ov: Overlap with integer instructions. i-ov = 4 means that the last four clock cycles can overlap with subsequent integer instructions. fp-ov: Overlap with floating point instructions. fp-ov = 2 means that the last two clock cycles can overlap with subsequent floating point instructions. (WAIT is considered a floating point instruction here) Opcode Operand Clock cycles Pairability i-ov fp-ov ----------------------------------------------------------------------------- FLD r/m32/m64 1 + 0 0 FLD m80 3 np 0 0 FBLD m80 49 np 0 0 FST(P) r 1 np 0 0 FST(P) m32/m64 2 h) np 0 0 FST(P) m80 3 h) np 0 0 FBSTP m80 153 np 0 0 FILD m 3 np 2 2 FIST(P) m 6 np 0 0 FLDZ FLD1 2 np 0 0 FLDPI FLDL2E etc. 5 np 0 0 FNSTSW AX/m16 6 np 0 0 FLDCW m16 8 np 0 0 FNSTCW m16 2 np 0 0 FADD(P) r/m 3 + 2 2 FSUB(R)(P) r/m 3 + 2 2 FMUL(P) r/m 3 + 2 2 i) FDIV(R)(P) r/m 39 + j) 38 k) 2 FCHS FABS 1 + 0 0 FCOM(P)(P) FUCOM r/m 1 + 0 0 FIADD FISUB(R) m 6 np 2 2 FIMUL m 6 np 2 2 FIDIV(R) m 42 np 38 k) 2 FICOM m 4 np 0 0 FTST 1 np 0 0 FXAM 17 np 4 0 FPREM 18-33 np 2 2 FPREM1 20-49 np 2 2 FRNDINT 19 np 0 0 FSCALE 32 np 5 0 FXTRACT 12-66 np 0 0 FSQRT 70 np 69 k) 2 FSIN FCOS FSINCOS varies np 2 2 F2XM1 FYL2X FYL2XP1 varies np 2 2 FPATAN varies np 2 2 FPTAN varies np 36 k) 0 FNOP 2 np 0 0 FXCH r 1 np 0 0 FINCSTP FDECSTP 2 np 0 0 FFREE r 2 np 0 0 FNCLEX 6-9 np 0 0 FNINIT 22 np 0 0 FNSAVE m ca.300 np 0 0 FRSTOR m 73 np 0 0 WAIT 1 np 0 0 ----------------------------------------------------------------------------- Notes: h) The value to store is needed one clock cycle in advance. i) 1 if the overlapping instruction is also a FMUL. j) If the FXCH is followed by an integer instruction then it will still pair, but take an extra clock cycle so that the integer instruction will begin in clock cycle 3. k) Cannot overlap integer multiplication instructions. 22. TESTING SPEED ================= The Pentium has an internal 64 bit clock counter which can be read into EDX:EAX using the instruction RDTSC (read time stamp counter). This is very useful for testing exactly how many clock cycles a piece of code takes. The program below is useful for measuring the number of clock cycles a piece of code takes. The program executes the code to test 10 times and stores the 10 clock counts. The program can be used in both 16 and 32 bit mode. RDTSC MACRO ; define RDTSC instruction DB 0FH,31H ENDM ITER EQU 10 ; number of iterations .DATA ; data segment ALIGN 4 COUNTER DD 0 ; loop counter TICS DD 0 ; temporary storage of clock RESULTLIST DD ITER DUP (0) ; list of test results .CODE ; code segment BEGIN: MOV [COUNTER],0 ; reset loop counter TESTLOOP: ; test loop ;**************** Do any initializations here: ************************ FINIT ;**************** End of initializations ************************ RDTSC ; read clock counter MOV [TICS],EAX ; save count CLD ; non-pairable filler REPT 8 NOP ; eight NOP's to avoid shadowing effect ENDM ;**************** Put instructions to test here: ************************ FLDPI ; this is only an example FSQRT RCR EBX,10 FSTP ST ;********************* End of instructions to test ************************ CLC ; non-pairable filler with shadow RDTSC ; read counter again SUB EAX,[TICS] ; compute difference SUB EAX,15 ; subtract the clocks cycles used by fillers MOV EDX,[COUNTER] ; loop counter MOV [RESULTLIST][EDX],EAX ; store result in table ADD EDX,TYPE RESULTLIST ; increment counter MOV [COUNTER],EDX ; store counter CMP EDX,ITER * (TYPE RESULTLIST) JB TESTLOOP ; repeat ITER times ; insert here code to read out the values in RESULTLIST The 'filler' instructions before and after the piece of code to test are critical. The CLD is a non-pairable instruction which has been inserted to make sure the pairing is the same the first time as the subsequent times. The eight NOP instructions are inserted to prevent any prefixes in the code to test to be decoded in the shadow of the preceding instructions. Single byte instructions are used here to obtain the same pairing the first time as the subsequent times. The CLC after the code to test is a non-pairable instruction which has a shadow under which the 0FH prefix of the RDTSC can be decoded so that it is independent of any shadowing effect from the code to test. The RDTSC instruction cannot execute in virtual mode, so if you are running under DOS you must skip the EMM386 (or any other memory manager) in your CONFIG.SYS and not run under a DOS box in Windows. The Pentium processor has special performance monitor counters which can count events such as cache misses, misalignments, AGI stalls, etc. Details about how to use the performance monitor counters are not covered by this manual and must be sought elsewhere. 23. CONSIDERATIONS FOR OTHER MICROPROCESSORS ============================================ Most of the optimations described in this document have little or no negative effects on other microprocessors, including non-Intel processors, but there are some problems to be aware of. Using a full register after writing to part of the register will cause a moderate delay on the 80486 and a severe delay on the PentiumPro. Example: MOV AL,[EBX] / MOV ECX,EAX On the PentiumPro you may avoid this penalty by zeroing the full register first: XOR EAX,EAX / MOV AL,[EBX] / MOV ECX,EAX or by using MOVZX. Scheduling floating point code for the Pentium often requires a lot of extra FXCH instructions. This will slow down execution on earlier microprocessors, but not on the PentiumPro and advanced non-Intel processors. As mentioned in the introduction, Intel has announced new MMX versions of the Pentium and PentiumPro chips with special instructions for integer vector operations. These instructions will be very useful for massively parallel integer calculations. The Pentium Pro chip is faster than the Pentium in some respects, but inferior in other respects. Knowing the strong and weak sides of the PentiumPro can help you make your code work well on both processors. The most important advantage of the PentiumPro is that it does much of the optimation for you: reordering instructions and splitting complex instructions into simple ones. But for perfectly optimized code there is less difference between the two processors. The two processors have basically the same number of execution units, so the throughput should be near the same. The PPro has separate units for memory read and write so that it can do three operations simultaneously if one of them is a memory read, but on the other hand it cannot do two memory reads or two writes simultaneously as the Pentium can. The PPro is better than the Pentium in the following respects: - out of order execution - one cache miss does not delay subsequent independent instructions - splitting complex instructions into smaller micro-ops - automatic register renaming to avoid unnecessary dependencies - better jump prediction algorithm than Pentium without MMX - many instructions which are unpairable and poorly optimized on the Pentium perform better on the PPro, f.ex. integer multiplication, movzx, cdq, bit scan, bit test, shifts by cl, and floating point store - floating point instructions and simple integer instructions can execute simultaneously - memory reads and writes do not occupy the ALU's - indirect memory read instructions have no AGI stall - new conditional move instructions can be used in stead of branches in some cases - new FCOMI instruction eliminates the need for the slow FNSTSW AX - higher maximum clock frequency The PPro is inferior to the Pentium in the following respects: - mispredicted jumps are very expensive (10-15 clock cycles!) - poor performance on 16 bit code and segmented models - prefixes are expensive (except 0F extended opcode) - long stall when mixing 8, 16, and 32 bit registers - fadd, fsub, fmul, fchs have longer latency - cannot do two memory reads or two memory writes simultaneously - some instruction combinations cannot execute in parallel, like push+push, push+call, compare+conditional jump As a consequence of this, the Pentium Pro may actually be slower than the Pentium on perfectly optimized code with a lot of unpredictable branches, and a lot of floating point code with little or no natural parallelism. Most of the drawbacks of each processor can be circumvented by careful optimation and running 32 bit flat mode. But the problem with mispredicted jumps on the PPro cannot be avoided except in the cases where you can use a conditional move instead. Taking advantage of the new instructions in the MMX and PentiumPro processors will create problems if you want your code to be compatible with earlier microprocessors. The solution may be to write several versions of your code, each optimized for a particular processor. Your program should automatically detect which processor it is running on and select the appropriate version of code. Such a complicated approach is of course only needed for the most critical parts of your program.