AMDメーカーx86の使用説明書/サービス説明書
ページ先へ移動 of 256
AM D Athlon Pr oc essor x86 Code Optimization Guide TM.
T ra demarks AMD , the A MD logo , A MD Athlon , K6, 3DNo w!, and combi nations ther e of, K 86, and Sup er7 ar e tr adema rks, and AMD -K6 is a r egis tered tra demark of Ad v anced Micr o De vices, I nc. Microso ft, Windows , and Wind ows NT are r egi stered trademarks of Micros oft Corp oration.
Contents iii 22007E/0 — Novembe r 1 99 9 AMD Athlon™ Pr ocessor x86 Code Optimization Contents Revision Histo ry . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xv 1 Intro duction 1 About this Docum ent . . . . .
iv Contents AMD Athlon ™ Processor x86 Code Optimizatio n 22007E/0 — Novemb er 1 99 9 Switch Statement Us age . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21 Optimize Switch State ments . . . . . . . . . . . . . . . . . .
Contents v 22007E/0 — No ve mb er 1 999 AMD Athlon ™ Proce ssor x86 Code Optimization Use 8-Bit Sign- Extended Displacements . . . . . . . . . . . . . . . . . . . . . . . 39 Code Padding Using Neutral Code F illers . . . . . . . . . . . . . . . . .
vi Contents AMD Athlon ™ Processor x86 Code Optimizatio n 22007E/0 — Novemb er 1 99 9 7 Scheduling Opti mizations 6 7 Schedule Instructio ns According to their La tency . . . . . . . . . . . . . . 67 Unrolling Loops . . . . . . . . . . . . . . . .
Contents vii 22007E/0 — No ve mb er 1 999 AMD Athlon ™ Proce ssor x86 Code Optimization Signed Deriva tion for Algorithm, Multiplier, and Shift Factor . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95 9 Floating-P oint Optimizations 9 7 Ensure All FP U Data is Alig ned .
viii Contents AMD Athlon ™ Processor x86 Code Optimizatio n 22007E/0 — Novemb er 1 99 9 Fast Conver sion of Signed Wo rds to Floating-P oint . . . . . . . . . . . . 113 Use MMX PX OR to Negate 3 DNow! Data . . . . . . . . . . . . . . . . . . . . 113 Use MMX PCM P Instead of 3D Now! PFCMP .
Contents ix 22007E/0 — No ve mb er 1 999 AMD Athlon ™ Proce ssor x86 Code Optimization Integer Execution Unit . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 135 Floating-Point Scheduler . . . . . . . . . . . . . . . . . . . . . . . . .
x Contents AMD Athlon ™ Processor x86 Code Optimizatio n 22007E/0 — Novemb er 1 99 9 Perf Ctr[3:0] MSRs (MSR Addre sses C001_00 04h – C001_0007h) . . . . . . . . . . . . . 167 Starting and Stopping the Perfor mance-Monitoring Counters . . . . . .
List of Figures xi 22007E/0 — No ve mb er 1 999 AMD Athlon ™ Proce ssor x86 Code Optimization List of Figures Figure 1. AMD Athlon ™ Processo r Block Diagr am . . . . . . . . . . . 131 Figure 2. Integer Execution Pipeline . . . . . . . . . . . .
xii List of Figur es AMD Athlon ™ Processor x86 Code Optimizatio n 22007E/0 — Novemb er 1 99 9.
List of T ables xiii 22007E/0 — No ve mb er 1 999 AMD Athlon ™ Proce ssor x86 Code Optimization List of T ables Table 1. Latency of Repeated String Instr uctions . . . . . . . . . . . . . 84 Table 2. Integer Pipeline Operation T ypes . . . . . . .
xiv List of T ables AMD Athlon ™ Processor x86 Code Optimizatio n 22007E/0 — Novemb er 1 9 9 9 Table 29. VectorPa th Integer In structions . . . . . . . . . . . . . . . . . . . 231 Table 30. VectorPa th MMX Instructions . . . . . . . . . . . . . .
Revision History xv 22007E/0 — No ve mb er 1 999 AMD Athlon ™ Proce ssor x86 Code Optimization Revision History Date Rev Descriptio n Nov . 1 999 E Added “ About this Document” on page 1. F urther clarification of “Consider the Sign of Integer Operands” on page 1 4.
xvi Revision History AMD Athlon ™ Processor x86 Code Optimizatio n 22007E/0 — Novemb er 1 9 9 9.
22007E/0 — No ve mb er 1 999 AMD Athlon ™ Proce ssor x86 Code Optimization About this Docume nt 1 1 Introduction Th e A M D At h l o n ™ processor is the ne west micr oprocessor in the AMD K86 ™ famil y of micropr ocessors.
2 About this Document AMD Athlon ™ Processor x86 Code Optimizatio n 22007E/0 — Novemb er 1 9 9 9 pr e vious- gener ation processor s and describes how those optimizations ar e applicable to the AMD Athlon processor . This guide co ntains the f ollowing c hapt er s: Chapter 1: Introduction.
AMD Athlon ™ Proces sor Family 3 22007E/0 — No ve mb er 1 999 AMD Athlon ™ Proce ssor x86 Code Optimization Appendix B: Pipeline and Execu tion Unit Resources Over view . Describes in detail the e xecution units and its r elation to the instructi on pipeline.
4 AMD Athlon ™ Processor Mic roarchitectur e Summary AMD Athlon ™ Processor x86 Code Optimizatio n 22007E/0 — Novemb er 1 9 9 9 AM D Athlon ™ Proc essor Microar chitecture Summary T he AMD Athlon pr ocessor brings s uper scalar performance and high operating frequency to P C syste ms run ning industr y- standard x86 softw ar e.
AMD Athlon ™ Processor Mic roarchitecture Summary 5 22007E/0 — No ve mb er 1 999 AMD Athlon ™ Proce ssor x86 Code Optimization AMD A thlon execution c or e to ac hiev e and sustain maxim um performance.
6 AMD Athlon ™ Processor Mic roarchitectur e Summary AMD Athlon ™ Processor x86 Code Optimizatio n 22007E/0 — Novemb er 1 9 9 9 T he coding tec hniques for ac hieving peak perf ormance on the AMD Athlon processor include, but are not limited to , those for the AMD-K6, AMD-K6-2, P e ntium ® , P enti um Pro , and P ent ium II pr ocessor s.
22007E/0 — No ve mb er 1 999 AMD Athlon ™ Proce ssor x86 Code Optimization Top Optimiz ations 7 2 T op Optimizations T his chap ter contains concise desc riptions of the best optimizations f or impro ving the performance of the AMD Athlon ™ processor .
8 Optimization Star AMD Athlon ™ Processor x86 Code Optimizatio n 22007E/0 — Novemb er 1 9 9 9 ■ A void Placing Cod e and Da ta in the Same 64 -Byte Cache Line Optimization Star T he top optimizations described in this c hapter ar e flagged with a star .
Group II Optimizati ons — Secondary Optimizations 9 22007E/0 — No ve mb er 1 999 AMD Athlon ™ Proce ssor x86 Code Optimization an ywher e, in an y type of code (integer , x87, 3DNo w!, MMX, etc.). Use the f ollowi ng f ormul a to determine pr efetc h distance: Prefetc h Length = 200 ( DS / C ) ■ Round up to the near est cache line.
10 Group II Optimizations — S econdary Optimizations AMD Athlon ™ Processor x86 Code Optimizatio n 22007E/0 — Novemb er 1 9 9 9 A void Load-Execute Floating-Point Instructions with Integer Opera nds Do not use load-execute floating-point instructions with integer operands .
Group II Optimizati ons — Secondary Optimizations 11 22007E/0 — No ve mb er 1 999 AMD Athlon ™ Proce ssor x86 Code Optimization A void Placing Code and Data in the Sam e 64-Byte Cache Line Consider that the AMD Athlon processor cac he line is twice the siz e of pr e vious processor s.
12 Group II Optimizations — S econdary Optimizations AMD Athlon ™ Processor x86 Code Optimizatio n 22007E/0 — Novemb er 1 9 9 9.
22007E/0 — No ve mb er 1 999 AMD Athlon ™ Proce ssor x86 Code Optimization Ensure Floati ng-Point Variables and Exp ressions are of Type Float 13 3 C Sourc e Lev el Optimizations This c h apter details C pro gramming pr actice s f or opt imizing code f or the AMD Athlon ™ pr ocessor .
14 Consider the S ign of Integer Operands AMD Athlon ™ Processor x86 Code Optimizatio n 22007E/0 — Novemb er 1 9 9 9 Consider the Sig n of Integer Oper ands In man y cases, the data stored in integer v aria bles determines whether a signed or an unsigned integer type is appr opriate.
Use Array Style Instead of Poin ter Style Code 15 22007E/0 — No ve mb er 1 999 AMD Athlon ™ Proce ssor x86 Code Optimization Example (Av oid): int i; ====> MOV EAX, i CDQ i = i / 4; AND EDX, 3 .
16 Use Array Style Instead of Pointer Style Co de AMD Athlon ™ Processor x86 Code Optimizatio n 22007E/0 — Novemb er 1 9 9 9 Note that source code transf ormations wi ll interact with a compiler ’ s code gener ator and that it is difficult to contr ol the gener ated mac hine code fr om the sourc e lev el.
Use Array Style Instead of Poin ter Style Code 17 22007E/0 — No ve mb er 1 999 AMD Athlon ™ Proce ssor x86 Code Optimization *res++ = dp; /* write transformed z */ dp = vv->x * *m++; dp += vv-&.
18 Completely Unr oll Small L oops AMD Athlon ™ Processor x86 Code Optimizatio n 22007E/0 — Novemb er 1 9 9 9 Completely Unr oll Small Loops T ak e ad v antage of the AMD At hlon pr ocessor ’ s large, 64-Kb yte instruct ion cache and completel y unroll small loops.
Avoid Unnecessary Store-to-Load Depend encies 19 22007E/0 — No ve mb er 1 999 AMD Athlon ™ Proce ssor x86 Code Optimization code in a w a y that a v oids the stor e-to-load dependency . In some instances the language definition ma y prohibit the compiler fr om using code tra nsforma tions that would r emo v e the stor e- to-load dependenc y .
20 Consider Expressi on Order in Compoun d Branch Conditions AMD Athlon ™ Processor x86 Code Optimizatio n 22007E/0 — Novemb er 1 9 9 9 Consider Expr ession Order in Compound Branch Conditions Br .
Switch Statement Us age 21 22007E/0 — No ve mb er 1 999 AMD Athlon ™ Proce ssor x86 Code Optimization Switch Statemen t Usage Optimize Switch Statements Switc h statements ar e transl ated using a vari ety of algorithms. T he most common of these ar e jump ta bles and comparison c hains/t r ees.
22 Use Const T ype Qualifier AMD Athlon ™ Processor x86 Code Optimizatio n 22007E/0 — Novemb er 1 9 9 9 Use Const T ype Qualifier Use the “ const ” type qualifier as m u c h as possible.
Generic Loop Hoisting 23 22007E/0 — No ve mb er 1 999 AMD Athlon ™ Proce ssor x86 Code Optimization Generalization for M ultiple Const ant Control C ode T o gener alize this further f or multiple constant control code some mor e w ork ma y ha ve to be done to cr eate the pr oper outer loop .
24 Declar e Local Functions as Static AMD Athlon ™ Processor x86 Code Optimizatio n 22007E/0 — Novemb er 1 9 9 9 case combine( 1, 1 ): for( i ... ) { DoWork1( i ); DoWork3( i ); } break; default: .
Dynamic Memory All ocation Consideration 25 22007E/0 — No ve mb er 1 999 AMD Athlon ™ Proce ssor x86 Code Optimization which might inhibit certain op timizations with some compiler s — for example, agg r essiv e inlining.
26 Explicitly Extract Common S ube xpressions AMD Athlon ™ Processor x86 Code Optimizatio n 22007E/0 — Novemb er 1 9 9 9 lead to unexpected r esults. F ortunately , in the v ast majority of cases, the final result will differ onl y in the least significa nt bits.
C Language Struc ture Component Considerations 27 22007E/0 — No ve mb er 1 999 AMD Athlon ™ Proce ssor x86 Code Optimization Example 1 Avoid: double a,b,c,d,e,f; e = b*c/d; f = b/d*a; Preferred: d.
28 Sort L ocal V ariables Acco rding to Base T ype Size AMD Athlon ™ Processor x86 Code Optimizatio n 22007E/0 — Novemb er 1 9 9 9 P ad by Multiple of Largest Base T ype Size P ad the structur e to a m ultiple of the larg est base type siz e of an y member .
Accelerating Floating-Point Div ides and Square Roots 29 22007E/0 — No ve mb er 1 999 AMD Athlon ™ Proce ssor x86 Code Optimization quadw ord alignment), so that quadw or d operands might be misaligned, ev en if this technique is used and the compiler does alloca te v ariables in t he order they ar e de clared.
30 Accel erating Floating-Point Divides and Squar e Roots AMD Athlon ™ Processor x86 Code Optimizatio n 22007E/0 — Novemb er 1 9 9 9 necessar y for the c urr ently s elected pr ecision. This means that settin g pr ecision c ontrol to singl e pr ecisio n (v ersus Win32 default of double precision) lo w ers the latenc y of those oper ations.
Avoid Unnecessary Integ er Division 31 22007E/0 — No ve mb er 1 999 AMD Athlon ™ Proce ssor x86 Code Optimization A void Unnec essary Integer Division Integer divisi on is the slow est of all integer arithmetic oper ations a nd should be a v oided wh er ev er possi ble.
32 Copy Fr equently De-r eferenced Pointe r Arguments to Local V ariables AMD Athlon ™ Processor x86 Code Optimizatio n 22007E/0 — Novemb er 1 9 9 9 Example 1 (Av oid): //assumes pointers are diff.
22007E/0 — No ve mb er 1 999 AMD Athlon ™ Proce ssor x86 Code Optimization Over view 33 4 Instruction Dec oding Optimizations T his c hapter discusses w a ys to maximize the n umber of instructions decoded by the instruction decoder s in the AMD Athlon ™ pr ocessor .
34 Select Dir ectPath Over V ectorPath Instructions AMD Athlon ™ Processor x86 Code Optimizatio n 22007E/0 — Novemb er 1 9 9 9 Select DirectP ath Over V ectorP ath Instructions Use Dir ect P ath instructions rather than V ectorP ath instructions.
Load-Execute Instructio n Usage 35 22007E/0 — No ve mb er 1 999 AMD Athlon ™ Proce ssor x86 Code Optimization Use Load-Execute Floating-Point Instructions with Floating-P oint Operands W hen opera.
36 Align Branch T argets in Pr ogram Hot Spots AMD Athlon ™ Processor x86 Code Optimizatio n 22007E/0 — Novemb er 1 9 9 9 Example 1 (Av oid): FLD QWORD PTR [foo] FIMUL DWORD PTR [bar] FIADD DWORD .
Avoid Partial Reg ister Reads and Writes 37 22007E/0 — No ve mb er 1 999 AMD Athlon ™ Proce ssor x86 Code Optimization Example 2 (Preferred): 05 78 56 34 12 add eax, 12345678h ;uses single byte ; .
38 Replace C ertain SH LD Instructions with Alternative AMD Athlon ™ Processor x86 Code Optimizatio n 22007E/0 — Novemb er 1 9 9 9 Replac e Certain S H LD Instructions with Alternative Code Certain instances of the SHLD instruction can be r eplaced b y alternati v e code using SHR and LEA.
Use 8-Bit Sign-E xtended Displacements 39 22007E/0 — No ve mb er 1 999 AMD Athlon ™ Proce ssor x86 Code Optimization Use 8-Bit Sign-Extended Displac ements Use 8- bit sign- extend ed displacements for condition al br anc hes.
40 Code Padding Using Neutral Code F illers AMD Athlon ™ Processor x86 Code Optimizatio n 22007E/0 — Novemb er 1 9 9 9 Recommendation s for th e AM D Athlon ™ Processo r F or code that is optimi.
Code Padding Usi ng Neutral Code Fillers 41 22007E/0 — No ve mb er 1 999 AMD Athlon ™ Proce ssor x86 Code Optimization Recommendati ons for AM D- K6 ® Family and AM D Athlon ™ Processor Blen de.
42 Code Padding Using Neutral Code F illers AMD Athlon ™ Processor x86 Code Optimizatio n 22007E/0 — Novemb er 1 9 9 9 NOP3_ECX TEXTEQU <DB 08Dh,00Ch,021h> ;lea ecx, [ecx] NOP3_EDX TEXTEQU &.
Code Padding Usi ng Neutral Code Fillers 43 22007E/0 — No ve mb er 1 999 AMD Athlon ™ Proce ssor x86 Code Optimization ;lea edi ,[edi+00000000] NOP6_EDI TEXTEQU <DB 08Dh,0BFh,0,0,0,0> ;lea e.
44 Code Padding Using Neutral Code F illers AMD Athlon ™ Processor x86 Code Optimizatio n 22007E/0 — Novemb er 1 9 9 9.
22007E/0 — No ve mb er 1 999 AMD Athlon ™ Proce ssor x86 Code Optimization Memory Size a nd Alignment Issues 45 5 Cache and Memory Optimizations T his chapter describes code optimization tec hniques that tak e ad v anta ge of the large L1 caches and high-band width buses of the AMD Athlon ™ proces sor .
46 Use the 3DNow! ™ PR EF ET C H and PR E FETCHW AMD Athlon ™ Processor x86 Code Optimizatio n 22007E/0 — Novemb er 1 9 9 9 Align Data Where P ossible In general, a v oid misaligned data references. All data who se siz e is a pow er of 2 is cons ider ed aligned i f it is naturally aligned.
Use the 3DNow ! ™ PREFE TCH and PREFETCHW I nstructions 47 22007E/0 — No ve mb er 1 999 AMD Athlon ™ Proce ssor x86 Code Optimization PRE FET CH /W ve rs us PR E F ETC H N T A/T0/T1 /T2 T he PREFETCHNT A/T0/T1/T2 instructions in the MMX extensions ar e pr ocessor implement ation dependent.
48 Use the 3DNow! ™ PR EF ET C H and PR E FETCHW AMD Athlon ™ Processor x86 Code Optimizatio n 22007E/0 — Novemb er 1 9 9 9 MOV ECX, (-LARGE_NUM) ;used biased index MOV EAX, OFFSET array_a ;get .
Use the 3DNow ! ™ PREFE TCH and PREFETCHW I nstructions 49 22007E/0 — No ve mb er 1 999 AMD Athlon ™ Proce ssor x86 Code Optimization T he follo wing optimiza tion rule s w er e app lied to this example . ■ Loops should be unr olled to mak e sur e that the data stride per loop i ter ation is equal to the length of a cac he line.
50 T ake A dvantage of W rite Combining AMD Athlon ™ Processor x86 Code Optimizatio n 22007E/0 — Novemb er 1 9 9 9 T ak e Advantage of W rite Combining Oper ating system and device dri v er pro gr ammers sh ould tak e ad v antage of the write- combining capabili ties of the AMD Athlon pr ocessor .
Store-to-Load F orwarding Restrictions 51 22007E/0 — No ve mb er 1 999 AMD Athlon ™ Proce ssor x86 Code Optimization Store-to-Load F o rwarding R estrictions Stor e-to-load forw arding r efers to the pr ocess of a load reading (f orw ar ding) data fr om the stor e buffer (LS2).
52 Store-to -Load Forwar ding Restrictions AMD Athlon ™ Processor x86 Code Optimizatio n 22007E/0 — Novemb er 1 9 9 9 Narrow-to-Wide Store-Buffer Data F orwarding Restriction If the f ollo wing co.
Store-to-Load F orwarding Restrictions 53 22007E/0 — No ve mb er 1 999 AMD Athlon ™ Proce ssor x86 Code Optimization Example 5 (Preferred): MOVD [foo], MM1 ;store lower half PUNPCKHDQ MM1, MM1 ;get upper half into lower half MOVD [foo+4], MM1 ;store lower half .
54 Stack Alignment Consider ations AMD Athlon ™ Processor x86 Code Optimizatio n 22007E/0 — Novemb er 1 9 9 9 One Supported Store- to-Load Forw arding Case T her e is one case of a mism atc hed stor e-to- load fo rw arding that is supported by the b y AMD Athlon pr ocessor .
Align TBYTE Variab les on Quadword Aligned Addres ses 55 22007E/0 — No ve mb er 1 999 AMD Athlon ™ Proce ssor x86 Code Optimization Example (Preferred): Prolog: PUSH EBP MOV EBP, ESP SUB ESP, SIZE.
56 Sort V ariables Accordin g to Base T ype Size AMD Athlon ™ Processor x86 Code Optimizatio n 22007E/0 — Novemb er 1 9 9 9 Example: struct { char a[5]; long k; doublex; } baz; T he str uctur e components should be alloc ated (lo west to highes t addr ess) as follo ws: x, k, a[4], a[3], a[2], a[1], a[0], padbyte6, .
22007E/0 — No ve mb er 1 999 AMD Athlon ™ Proce ssor x86 Code Optimization Avoid Branches Depende nt on Random Data 57 6 Br anch Optimizations W hile th e AMD Athlon ™ pr ocessor contains a v ery sophisticated br anch unit, certain optimizations increase t he effect iv eness of the br anc h pr ediction unit.
58 A void Branches De pendent on Random Dat a AMD Athlon ™ Processor x86 Code Optimizatio n 22007E/0 — Novemb er 1 9 9 9 AM D Ath lon ™ Proces sor Spec ific Code E xample 1 — Signed integer AB.
Always Pair CALL and RETURN 59 22007E/0 — No ve mb er 1 999 AMD Athlon ™ Proce ssor x86 Code Optimization Example 6 — Increment Ring Buffer Offset: //C Code char buf[BUFSIZE]; int a; if (a < .
60 Replace Bran ches with Computation in 3DNow! ™ Code AMD Athlon ™ Processor x86 Code Optimizatio n 22007E/0 — Novemb er 1 9 9 9 Rep lace Br anches with Computa tion in 3D Now! ™ Code Br anches negati vel y impact the perf ormance of 3DNo w! code.
Replace Branches wi th Computation in 3DNow! ™ Code 61 22007E/0 — No ve mb er 1 999 AMD Athlon ™ Proce ssor x86 Code Optimization Example 2 (Preferred): ; r = (x < y) ? a : b ; ; in: mm0 a ; .
62 Replace Bran ches with Computation in 3DNow! ™ Code AMD Athlon ™ Processor x86 Code Optimizatio n 22007E/0 — Novemb er 1 9 9 9 Example 2: C code: float x,z; z = abs(x); if (z >= 1) { z = 1.
Replace Branches wi th Computation in 3DNow! ™ Code 63 22007E/0 — No ve mb er 1 999 AMD Athlon ™ Proce ssor x86 Code Optimization Example 4: C code: #define PI 3.
64 Replace Bran ches with Computation in 3DNow! ™ Code AMD Athlon ™ Processor x86 Code Optimizatio n 22007E/0 — Novemb er 1 9 9 9 Example 5: C code: #define PI 3.
Avoid the Loop Instruction 65 22007E/0 — No ve mb er 1 999 AMD Athlon ™ Proce ssor x86 Code Optimization A void the Loop Instruction T he LOOP instruction in the AMD A thlon pr ocessor r equires eight cycles to e xecute.
66 A void Recursive Functions AMD Athlon ™ Processor x86 Code Optimizatio n 22007E/0 — Novemb er 1 9 9 9 A void R ecursive Functions A void r ec ur siv e func tions due to the danger o f o verflo wing t he r eturn addr ess stac k. Con v ert end- r ecur siv e functions to iterati ve code.
22007E/0 — No ve mb er 1 999 AMD Athlon ™ Proce ssor x86 Code Optimization Schedule In structions According to their Latenc y 67 7 Scheduling Optimizations T his c hapter descr ibes ho w to code instruc tions f or efficient scheduling. Guidelines ar e lis ted in or der of impor tance.
68 Unrolling Loops AMD Athlon ™ Processor x86 Code Optimizatio n 22007E/0 — Novemb er 1 9 9 9 unroll ing r educ es r egist er pr essur e by r emoving the loop counter . T o complete l y unroll a loop, remo ve the loop control and r eplicate the loop bod y N times.
Unrolling Loops 69 22007E/0 — No ve mb er 1 999 AMD Athlon ™ Proce ssor x86 Code Optimization Without Loop Unrolling: MOV ECX, MAX_LENGTH MOV EAX, OFFSET A MOV EBX, OFFSET B $add_loop: FLD QWORD PTR [EAX] FADD QWORD PTR [EBX] FSTP QWORD PTR [EAX] ADD EAX, 8 ADD EBX, 8 DEC ECX JNZ $add_loop T he loop consists of se v en instructions.
70 Unrolling Loops AMD Athlon ™ Processor x86 Code Optimizatio n 22007E/0 — Novemb er 1 9 9 9 no faster than three iter a tions in 10 cycles, or 6/10 floating-po int adds per c ycle, or 1.4 times as f ast as the or iginal loop. Deriving Loop Control For P arti ally Unrolled Loops A fr equentl y used loop construct is a counting loop.
Use Function Inlini ng 71 22007E/0 — No ve mb er 1 999 AMD Athlon ™ Proce ssor x86 Code Optimization Use Function In lining Overview Mak e use of the AMD A thlon pr ocessor ’ s large 64- Kbyte instruct ion cache b y inl ining sm all routines to av oid pr ocedur e- call ov erhead.
72 A void Address Generati on Interlocks AMD Athlon ™ Processor x86 Code Optimizatio n 22007E/0 — Novembe r 1 9 9 9 Always Inline Fu nctions if Called from One Site A function should alw a ys be inlined if it can be established that it is called from just one site in the code.
Use MOVZX and MO VSX 73 22007E/0 — No ve mb er 1 999 AMD Athlon ™ Proce ssor x86 Code Optimization Example 1 (Av oid): ADD EBX, ECX ;inst 1 MOV EAX, DWORD PTR [10h] ;inst 2 (fast address calc.) MOV ECX, DWORD PTR [EAX+EBX] ;inst 3 (slow address calc.
74 Minimize Po inter Arithmetic in L oops AMD Athlon ™ Processor x86 Code Optimizatio n 22007E/0 — Novemb er 1 9 9 9 Example 1 (Av oid): int a[MAXSIZE], b[MAXSIZE], c[MAXSIZE], i; for (i=0; i <.
Push Memory Data Carefu lly 75 22007E/0 — No ve mb er 1 999 AMD Athlon ™ Proce ssor x86 Code Optimization v ariable that starts wi th a negati ve v alue and r eac hes zero when the loop expires. Note that if the base addresses ar e held in r egisters (e.
76 Push Memory Data Careful ly AMD Athlon ™ Processor x86 Code Optimizatio n 22007E/0 — Novembe r 1 9 9 9.
22007E/0 — No ve mb er 1 999 AMD Athlon ™ Proce ssor x86 Code Optimization Replace Divi des with Multiplies 77 8 Integer Optimizations T his c hapter desc ribes w a ys to impr ov e integer p erf ormance thr ough optimize d pr ogr amming tec hniques.
78 Replace Div ides with Multiplies AMD Athlon ™ Processor x86 Code Optimizatio n 22007E/0 — Novemb er 1 9 9 9 Signed Division Utility In the opt_utilities dir ector y of the AMD documentation CDR O M, ru n sdiv .exe in a DOS window to find the fastest code fo r si gned di vision b y a constant.
Replace Divi des with Multiplies 79 22007E/0 — No ve mb er 1 999 AMD Athlon ™ Proce ssor x86 Code Optimization Example 1: ;In: EDX = dividend ;Out: EDX = quotient XOR EDX, EDX;0 CMP EAX, d ;CF = (.
80 Replace Div ides with Multiplies AMD Athlon ™ Processor x86 Code Optimizatio n 22007E/0 — Novemb er 1 9 9 9 ;algorithm 1 MOV EAX, m MOV EDX, dividend MOV ECX, EDX IMUL EDX ADD EDX, ECX SHR ECX,.
Use Alternative Code When Multiplying by a Co nstant 81 22007E/0 — No ve mb er 1 999 AMD Athlon ™ Proce ssor x86 Code Optimization Remainder of Signed Integer 2 n or – (2 n ) ;IN:EAX = dividend .
82 Use Alternative Code When Multiplying b y a Constant AMD Athlon ™ Processor x86 Code Optimizatio n 22007E/0 — Novemb er 1 9 9 9 by 11: LEA REG2, [REG1*8+REG1] ;3 cycles ADD REG1, REG1 ADD REG1,.
Use MMX ™ Instructio ns for Integer-Only Work 83 22007E/0 — No ve mb er 1 999 AMD Athlon ™ Proce ssor x86 Code Optimization by 26: use IMUL by 27: LEA REG2, [REG1*4+REG1] ;3 cycles SHL REG1, 5 S.
84 Repeated String Instructi on Usage AMD Athlon ™ Processor x86 Code Optimizatio n 22007E/0 — Novemb er 1 9 9 9 In addition, using MMX instructi ons incr eases t he a v ailable par allelism. T he AMD Athlon proces sor can issue thr ee integer OPs and two MMX OPs per cycle.
Repeated String I nstruction Usage 85 22007E/0 — No ve mb er 1 999 AMD Athlon ™ Proce ssor x86 Code Optimization Ensure D F=0 (U P) A lway s m a ke s u re t h a t D F = 0 ( U P ) ( a f t e r ex e c u t i o n o f C L D ) fo r REP MO VS an d REP STOS.
86 Use X OR Instruction to Cl ear Integer Registers AMD Athlon ™ Processor x86 Code Optimizatio n 22007E/0 — Novemb er 1 9 9 9 Use X O R Instruction to Clear Integer Registe rs T o clear an inte ger r egister to all 0s, use “ X OR r eg , r eg ” .
Efficient 64-Bi t Integer Arithmetic 87 22007E/0 — No ve mb er 1 999 AMD Athlon ™ Proce ssor x86 Code Optimization Example 4 (Le ft shift ): ;shift operand in EDX:EAX left, shift count in ECX (cou.
88 Efficient 64-Bit Integer Arithmeti c AMD Athlon ™ Processor x86 Code Optimizatio n 22007E/0 — Novemb er 1 9 9 9 Example 7 (Division): ;_ulldiv divides two unsigned 64-bit integers, and returns ; the quotient.
Efficient 64-Bi t Integer Arithmetic 89 22007E/0 — No ve mb er 1 999 AMD Athlon ™ Proce ssor x86 Code Optimization MOV ECX, EAX ;save quotient IMUL EDI, EAX ;quotient * divisor hi-word ; (low only) MUL DWORD PTR [ESP+20];quotient * divisor lo-word ADD EDX, EDI ;EDX:EAX = quotient * divisor SUB EBX, EAX ;dividend_lo – (quot.
90 Efficient 64-Bit Integer Arithmeti c AMD Athlon ™ Processor x86 Code Optimizatio n 22007E/0 — Novemb er 1 9 9 9 $r_two_divs: MOV ECX, EAX ;save dividend_lo in ECX MOV EAX, EDX ;get dividend_hi .
Efficient Impl ementation of Populati on Count Function 91 22007E/0 — No ve mb er 1 999 AMD Athlon ™ Proce ssor x86 Code Optimization Efficient Implementation of Population Co unt Function P opulation count is an oper ation that determines the number of set bits in a bit string.
92 Efficient Impl ementation of Populat ion Count Function AMD Athlon ™ Processor x86 Code Optimizatio n 22007E/0 — Novemb er 1 9 9 9 Step 3 F or the fir st time, the v alue in each k-bit field is small eno ugh that adding two k-bit fields r esults in a v alue that stil l fits in the k-bit field.
Derivation of Multipl ier Used for Integer Division by Constants 93 22007E/0 — No ve mb er 1 999 AMD Athlon ™ Proce ssor x86 Code Optimization ADD EAX, EDX ;x = (w & 0x33333333) + ((w >>.
94 Derivation of Multiplie r Used for Integer Division by AMD Athlon ™ Processor x86 Code Optimizatio n 22007E/0 — Novemb er 1 9 9 9 ;algorithm 1 MOV EDX, dividend MOV EAX, m MUL EDX ADD EAX, m AD.
Derivation of Multipl ier Used for Integer Division by Constants 95 22007E/0 — No ve mb er 1 999 AMD Athlon ™ Proce ssor x86 Code Optimization /* Generate m, s for algorithm 1. Based on: Magenheimer, D.J.; et al: “Integer Multiplication and Division on the HP Precision Architecture”.
96 Derivation of Multiplie r Used for Integer Division by AMD Athlon ™ Processor x86 Code Optimizatio n 22007E/0 — Novemb er 1 9 9 9 ;algorithm 1 MOV EAX, m MOV EDX, dividend MOV ECX, EDX IMUL EDX.
22007E/0 — No ve mb er 1 999 AMD Athlon ™ Proce ssor x86 Code Optimization Ensure All FP U Data is Ali gned 97 9 Floating-P oint Optimizations T his c hapt er details the methods used to optimiz e floating-point code to the pipelined floating-point unit (FPU).
98 Use FFRE E P Macr o to Pop On e Register fr om the FPU AMD Athlon ™ Processor x86 Code Optimizatio n 22007E/0 — Novemb er 1 9 9 9 Use F F R E E P Macro to P op One Register fr om the F P U Stack In FPU intensi v e code, fr equently accessed data is oft en pr e-loaded at the bottom of the FPU stac k befor e pr ocessing floating- point data.
Use the FXCH Instruction Rather tha n FST/FLD Pairs 99 22007E/0 — No ve mb er 1 999 AMD Athlon ™ Proce ssor x86 Code Optimization T hese instruc tions ar e muc h faster than the classical appr oach using FSTSW , because FSTSW is essentiall y a serializing instruction on the AMD Athlon pr ocess or .
10 0 Minimize Floating-Po int-to-Integer Conversions AMD Athlon ™ Processor x86 Code Optimizatio n 22007E/0 — Novemb er 1 9 9 9 Minimize Floating-P oint-to-Integer Con versio ns C++, C, an d F ortr an define floa ting-point-t o-integer con v er sions as truncating .
Minimize F loating-Point-to-Integer Conversi ons 10 1 22007E/0 — No ve mb er 1 999 AMD Athlon ™ Proce ssor x86 Code Optimization FPU into truncating mode, and perf orming all of the conv ersions before restoring the original control w ord. The speed of the a bo v e code is somewhat dependent on the natur e of the code surrounding it.
10 2 Minimize Floating-Po int-to-Integer Conversions AMD Athlon ™ Processor x86 Code Optimizatio n 22007E/0 — Novemb er 1 9 9 9 Example 3 (P otentially faster): MOV ECX, DWORD PTR[X+4] ;get upper 32 bits of double XOR EDX, EDX ;i = 0 MOV EAX, ECX ;save sign bit AND ECX, 07FF00000h ;isolate exponent field CMP ECX, 03FF00000h ;if abs(x) < 1.
Floating-Point Subex pression Elimination 10 3 22007E/0 — No ve mb er 1 999 AMD Athlon ™ Proce ssor x86 Code Optimization Floating-P oint Subexpr ession Elimination T her e ar e cases which do not r equir e an FXCH instruction after e v er y instruction to allo w access to tw o new stac k entries.
10 4 Check Argument Range of T rigonometric Instructions AMD Athlon ™ Processor x86 Code Optimizatio n 22007E/0 — Novemb er 1 9 9 9 If an “ ar gument out of r ange ” is detected, a r ange r eduction subr o utine is in v ok ed whic h r educes the ar gument to less than 2^63 befor e the instruction is attempted again.
Take Advantag e of the FSINCOS Instruction 10 5 22007E/0 — No ve mb er 1 999 AMD Athlon ™ Proce ssor x86 Code Optimization Since out- of-r an ge arguments ar e extremely uncommon, the conditional br anch will be perfectly pr edicted, and the other instructions used to guard the trigonometric instruction can execute in par allel to it.
10 6 T ake Advantage of the FSI NCOS Instruction AMD Athlon ™ Processor x86 Code Optimizatio n 22007E/0 — Novemb er 1 9 9 9.
22007E/0 — No ve mb er 1 999 AMD Athlon ™ Proce ssor x86 Code Optimization Use 3DNow! ™ Instr uctions 10 7 10 3D Now! ™ and M MX ™ Optimizations T his chapter describes 3DNow! and MMX code optimization tec hniqu es f or the AMD Athlon ™ processo r .
10 8 Use 3DNow! ™ Instructions for Fast Div ision AMD Athlon ™ Processor x86 Code Optimizatio n 22007E/0 — Novemb er 1 9 9 9 FEMMS instru ction is suppo rted fo r bac kw ar d compatibili ty with AMD-K6 famil y p r ocessors, and is aliased t o the EMMS instruction.
Use 3DNow! ™ Instructions for Fast Division 10 9 22007E/0 — No ve mb er 1 999 AMD Athlon ™ Proce ssor x86 Code Optimization Pipelined P a ir of 24-Bit Precisio n Divides T his di vi de operation.
110 Use 3DNow ! ™ Instructions for Fast Square Ro ot and AMD Athlon ™ Processor x86 Code Optimizatio n 22007E/0 — Novembe r 1 9 9 9 Use 3D Now! ™ Instructions for F a st Squar e Root and Recip.
Use MMX ™ PMADDWD Ins truction to Perform Two 32-Bit Multipli es in Parallel 111 22007E/0 — No ve mb er 1 999 AMD Athlon ™ Proce ssor x86 Code Optimization Newton- Raphson Re cipr ocal Squa re R.
112 3D Now! ™ and MMX ™ Intra-Operand S wapping AMD Athlon ™ Processor x86 Code Optimizatio n 22007E/0 — Novemb er 1 9 9 9 Example: PXOR MM2, MM2 ; 0 | 0 MOVD MM0, [ab] ; 0 0 | b a MOVD MM1, [.
Fast Conversion of S igned Words to Floating-Poin t 113 22007E/0 — No ve mb er 1 999 AMD Athlon ™ Proce ssor x86 Code Optimization F ast Conversion of Signed W ords to Floating-P oint In many appl ications there is a need to quickl y conv ert data consisting of pac ked 16-bit signed integer s into floating-point n umbers.
114 Us e M MX ™ P CM P Instead of 3DNow! ™ PFCMP AMD Athlon ™ Processor x86 Code Optimizatio n 22007E/0 — Novemb er 1 9 9 9 c ycle b ypassing penalty , and another one c ycle penalty if the r esult goes to a 3DNo w! operation.
Use MMX ™ Instructio ns for Block Copies and Block Fills 115 22007E/0 — No ve mb er 1 999 AMD Athlon ™ Proce ssor x86 Code Optimization Use M MX ™ Instructions for Block Copies and Block Fills F or moving or filling small bloc ks of data (e.g.
116 Us e M MX ™ Instructions for Block Copies and Block Fills AMD Athlon ™ Processor x86 Code Optimizatio n 22007E/0 — Novemb er 1 9 9 9 $xfer: movq mm0, [eax] add edx, 64 movq mm1, [eax+8] add .
Use MMX ™ Instructio ns for Block Copies and Block Fills 117 22007E/0 — No ve mb er 1 999 AMD Athlon ™ Proce ssor x86 Code Optimization AM D Athlon ™ Proc essor Specific Code T he f ollo wing .
118 Us e M MX ™ PXOR to Clear All Bits in an M MX ™ Register AMD Athlon ™ Processor x86 Code Optimizatio n 22007E/0 — Novemb er 1 9 9 9 /* block fill (destination QWORD aligned) */ __asm { mov.
Use MMX ™ PCMPEQD to S et All Bits in an MMX ™ Register 119 22007E/0 — No ve mb er 1 999 AMD Athlon ™ Proce ssor x86 Code Optimization Use M MX ™ PC M P E QD to Set All Bits in an M MX ™ R.
12 0 Optimized Matrix Multip lication AMD Athlon ™ Processor x86 Code Optimizatio n 22007E/0 — Novemb er 1 9 9 9 /* Function XForm performs a fully generalized 3D transform on an array of vertices pointed to by "v" and stores the transformed vertices in the location pointed to by "res".
Optimized Matrix Multipli cation 121 22007E/0 — No ve mb er 1 999 AMD Athlon ™ Proce ssor x86 Code Optimization $$xform: ADD EBX, 16 ;res++ MOVQ MM0, QWORD PTR [EDX] ;v->y | v->x MOVQ MM1, Q.
12 2 Efficient 3D- Clipping Code Computation Using AMD Athlon ™ Processor x86 Code Optimizatio n 22007E/0 — Novemb er 1 9 9 9 Efficient 3D- Clipping Code Computation Using 3D Now! ™ Instructions Clipping is one of the major acti vities occurring in a 3D gr aphics pipeli ne.
Use 3DNow! ™ PAVGUSB for MPEG-2 Motion Compensation 12 3 22007E/0 — No ve mb er 1 999 AMD Athlon ™ Proce ssor x86 Code Optimization ;; ;; DESTROYS MM0,MM1,MM2,MM3,MM4 PXOR MM0, MM0 ; 0 | 0 MOVQ .
12 4 Use 3DNow! ™ P A VG US B for MP EG-2 Motion AMD Athlon ™ Processor x86 Code Optimizatio n 22007E/0 — Novemb er 1 9 9 9 Example 1 (Av oid): MOV ESI, DWORD PTR Src_MB MOV EDI, DWORD PTR Dst_M.
Stream of Packed Unsi gned Bytes 12 5 22007E/0 — No ve mb er 1 999 AMD Athlon ™ Proce ssor x86 Code Optimization T he f ollo wing code fr agment us es the 3DNo w! P A V GUSB instruction to perform.
12 6 Co mple x N umbe r Ari thm etic AMD Athlon ™ Processor x86 Code Optimizatio n 22007E/0 — Novemb er 1 9 9 9 Complex Number Arithmetic Complex n umbers ha v e a “ real ” part and an “ imaginar y ” part. Multipl ying complex number s (ex.
22007E/0 — No ve mb er 1 999 AMD Athlon ™ Proce ssor x86 Code Optimization Short Forms 12 7 11 Gener al x86 Optimization Guidelines T his c hapter describes gener a l code optimization tec hniques.
12 8 Dependencies AMD Athlon ™ Processor x86 Code Optimizatio n 22007E/0 — Novemb er 1 9 9 9 Depend encies Spr ead out true dependencies to increase the opportunities f or par allel execution. Anti- depende ncies and output dependencies do not impact performance.
22007E/0 — No ve mb er 1 999 AMD Athlon ™ Proce ssor x86 Code Optimization Introduction 12 9 Appendix A AM D Athlon ™ Proc essor Micr oarc hitecture Intr oduction W hen discussing processor design, it is important to unders tand the follo wing terms — architecture , microarchitectur e , and design implementation .
130 AMD Athlon ™ Processor Micro architectur e AMD Athlon ™ Processor x 86 Code Optimization 22007E/0 — November 1 9 9 9 AM D Athlon ™ Proc essor Microar chitecture T he innov ativ e AMD Athlo.
AMD Athlon ™ Processor Mic roarchitecture 131 22007E/0 — No ve mb er 1 999 AMD Athlon ™ Processor x86 Code Optimization Figure 1 . AM D Athlon ™ Processor Block Diagram Instruction Cache T he o ut-of-or der ex ecute engi ne of t he AMD Athlon proc essor contains a v ery larg e 64- Kbyte L1 ins truction cac he.
132 AMD Athlon ™ Processor Micro architectur e AMD Athlon ™ Processor x 86 Code Optimization 22007E/0 — November 1 9 9 9 r eplacement is based on a least- r ecently used (LR U ) r eplacement algori thm. T he L1 instruction cac he has an associated tw o-le v el tr anslation look- aside buffer (TLB) structur e.
AMD Athlon ™ Processor Mic roarchitecture 13 3 22007E/0 — No ve mb er 1 999 AMD Athlon ™ Processor x86 Code Optimization r eturn stack. Subsequen t RETs pop a p r ed icted return addr ess off the top of the stac k. Early Dec oding T he Dir ectP ath and V ectorP ath decoders perf orm ear ly- decoding of instructions into Macr oOPs.
134 AMD Athlon ™ Processor Micro architectur e AMD Athlon ™ Processor x 86 Code Optimization 22007E/0 — November 1 9 9 9 Instruction Control Unit T he instruction contr ol unit (ICU) is the contr ol center f or the AMD Athlon processor .
AMD Athlon ™ Processor Mic roarchitecture 13 5 22007E/0 — No ve mb er 1 999 AMD Athlon ™ Processor x86 Code Optimization Integer Scheduler T he integer s che duler is ba sed on a thr ee- wide queuing system (also kno wn as a r eserv ation station) that feeds thr e e integer executi on positions or pipes.
136 AMD Athlon ™ Processor Micro architectur e AMD Athlon ™ Processor x 86 Code Optimization 22007E/0 — November 1 9 9 9 Eac h of the three IEUs ar e general purpose in that eac h performs lo gic functions, arithmetic functions, conditional functions, di vide step functions, status flag multiplexing, and br anc h r esolutions.
AMD Athlon ™ Processor Mic roarchitecture 13 7 22007E/0 — No ve mb er 1 999 AMD Athlon ™ Processor x86 Code Optimization Floa ting-P oint Ex ecutio n Unit T he floating-point execution unit (FPU) is implemented as a coprocessor that has its o wn out-of- ord er control in addition to the da ta path.
138 AMD Athlon ™ Processor Micro architectur e AMD Athlon ™ Processor x 86 Code Optimization 22007E/0 — November 1 9 9 9 Load-Store Unit (LS U ) T he load-s tor e unit (LSU) manages dat a load and s tor e accesses to the L1 dat a cache and, if r equired, to the backside L2 cache or system memory .
AMD Athlon ™ Processor Mic roarchitecture 13 9 22007E/0 — No ve mb er 1 999 AMD Athlon ™ Processor x86 Code Optimization L2 Cache Controller T he AMD Athlon processor contai ns a v ery flexible onboar d L2 contr oller . It uses an independent bac kside bus to access up to 8-Mb ytes of industry- standar d SRAMs.
140 AMD Athlon ™ Processor Micro architectur e AMD Athlon ™ Processor x 86 Code Optimization 22007E/0 — November 1 9 9 9.
22007E/0 — No ve mb er 1 999 AMD Athlon ™ Proce ssor x86 Code Optimization Fetch and Dec ode Pipeline Stages 141 Appendix B Pipeline and Execution Unit R esourc es Ov erview Th e A M D A t h l o n ™ pr ocessor contains two independent execut ion pipelines — one for integer oper ations and one for floating-point operations.
142 Fetch and Dec ode Pipeline Stages AMD Athlon ™ Processor x86 Code Optimizatio n 22007E/0 — Novemb er 1 9 9 9 Figure 5. F etch/Scan/Align/D ecode Pipeline Hardware T he most common x8 6 instructions flo w throug h the Dir ectP ath pipeline stages and are decoded by har dw a r e .
Fetch and Dec ode Pipeline Stages 14 3 22007E/0 — No ve mb er 1 999 AMD Athlon ™ Proce ssor x86 Code Optimization Cycle 1 – FET CH The FETCH pipeline stag e calculates t he addr ess of the next x86 instr uction window to fetch from the pr oce ssor caches or system me mory .
144 Integer Pipeline Stages AMD Athlon ™ Processor x86 Code Optimizatio n 22007E/0 — Novemb er 1 9 9 9 oper ands mapped to r egisters. Both integer and floating-point Macr oOPs ar e placed into the IC U .
Integer Pipelin e Stages 14 5 22007E/0 — No ve mb er 1 999 AMD Athlon ™ Proce ssor x86 Code Optimization Cycle 7 – SC H E D In the scheduler (SCHED) pipeline stage, the scheduler buffer s can cont ain Macr oOPs that are waiting f or integer operands fr om the ICU or the IEU r esult bus .
146 Floating-Point Pipe line Stages AMD Athlon ™ Processor x86 Code Optimizatio n 22007E/0 — Novemb er 1 9 9 9 Floating-P oint Pipeline Stages T he floa ting-point unit (FPU) is implemente d as a coprocessor that has its o w n out- of- or der cont r ol in addition to the data path.
Floating-Point P ipeline Stages 14 7 22007E/0 — No ve mb er 1 999 AMD Athlon ™ Proce ssor x86 Code Optimization Cycle 7 – ST K R E N T he stack r ename (S TKREN) pipeline stage in cycle 7 r eceiv e s up to thr ee Macr oOPs fr om IDEC and maps stac k- relati ve r egi ster tag s to vir tual register ta gs.
148 Execution Unit Resour ces AMD Athlon ™ Processor x86 Code Optimizatio n 22007E/0 — Novemb er 1 9 9 9 Execution Unit Resour ces Te r m i n o l o g y T he execution units o perate with two types of register v al ues — operands and res u lt s .
Execution Unit Resources 14 9 22007E/0 — No ve mb er 1 999 AMD Athlon ™ Proce ssor x86 Code Optimization Integer Pipeline Operations T abl e 2 shows the categor y or type of o per ations handled b y the integer pipeline. T able 3 sho w s examples of the decode type.
150 Execution Unit Resour ces AMD Athlon ™ Processor x86 Code Optimizatio n 22007E/0 — Novemb er 1 9 9 9 Floa ting-P oint P ipeline Oper ations T abl e 4 shows the categor y or type of o per ations handled b y the floating-point execution units. T able 5 sho ws examples of the decode types.
Execution Unit Resources 151 22007E/0 — No ve mb er 1 999 AMD Athlon ™ Proce ssor x86 Code Optimization Load/Store Pipeline Oper ations T he AMD Athlon pr ocessor decodes an y instruction that r efer ences memor y into primiti ve load/stor e oper a tions.
152 Execution Unit Resour ces AMD Athlon ™ Processor x86 Code Optimizatio n 22007E/0 — Novemb er 1 9 9 9 Code Sample Analysis T he samples in T able 7 on page 153 and T able 8 on page 154 show the execut ion behavior of sev eral serie s of ins tructi ons as a function of decode constr aints, dependenc ies, and execution r esour ce constr aints.
Execution Unit Resources 15 3 22007E/0 — No ve mb er 1 999 AMD Athlon ™ Proce ssor x86 Code Optimization T a ble 7 . Sample 1 – Integer Register Operations Inst ructi on Number Deco de Pipe Deco.
154 Execution Unit Resour ces AMD Athlon ™ Processor x86 Code Optimizatio n 22007E/0 — Novemb er 1 9 9 9 T a ble 8. Sample 2 – Integer Reg ister and Memory Load Operations Instruc Num Decode Pip.
22007E/0 — No ve mb er 1 999 AMD Athlon ™ Proce ssor x86 Code Optimization Introduction 15 5 Appendix C Implementation of W rite Combining Intr oduction T his appendix describes the memory write- c ombining featur e as implemente d in the AMD Athlon ™ pr ocessor famil y .
15 6 Write-Combinin g Definitions and Abbrev iations AMD Athlon ™ Processor x86 Code Optimizatio n 22007E/0 — Novemb er 1 9 9 9 W rite-Combining Definitions and Abbr eviations T his appendix uses .
Write-Combining Operations 15 7 22007E/0 — No ve mb er 1 999 AMD Athlon ™ Proce ssor x86 Code Optimization signatur e in r egister EAX, wher e EAX[11 – 8] contai ns the instruction famil y code. F or the AMD Athlon processor , the instruction famil y code is six .
15 8 Wr ite- Combining Operations AMD Athlon ™ Processor x86 Code Optimizatio n 22007E/0 — Novemb er 1 9 9 9 T a ble 9. W rite Combining Completion Events Event Comment Non-WB write outside o f current buffer The first non-WB write to a different cache block address closes combining for previous writes.
Write-Combining Operations 15 9 22007E/0 — No ve mb er 1 999 AMD Athlon ™ Proce ssor x86 Code Optimization Sending W rite-Buffer Data to the System Once write combining is closed f or a 64- byte write buffer , the contents of the write buffer ar e eligible to be sent to the system as one or more AMD Athlon system bus commands.
16 0 W rite- Combining Operations AMD Athlon ™ Processor x86 Code Optimizatio n 22007E/0 — Novemb er 1 9 9 9.
22007E/0 — No ve mb er 1 999 AMD Athlon ™ Proce ssor x86 Code Optimization Over view 16 1 Appendix D P erformance-Monitoring Counters T his c hapter describes ho w to use the AMD Athlon ™ processo r perf ormance monitoring counters.
16 2 Performance Counter Usage AMD Athlon ™ Processor x86 Code Optimizatio n 22007E/0 — Novemb er 1 9 9 9 T hese r egisters can be r ead from and written to using t he RDMSR and WRM SR instructions, r espectiv el y . T he P erfEvtSel[3 :0] r egister s ar e locat ed at MSR l ocations C001_0000h to C0 01_0003h.
Performance Counter Usage 16 3 22007E/0 — No ve mb er 1 999 AMD Athlon ™ Proce ssor x86 Code Optimization Unit Mask Field (Bits 8 — 15 ) Th ese bits are used to further qualify the e vent sel ected in the e v ent select fi eld. F or e xample, f or some cac he ev ents, the ma sk is used as a MESI- pr otocol qualifier of cac he states.
16 4 Per formance Counter Usage AMD Athlon ™ Processor x86 Code Optimizatio n 22007E/0 — Novemb er 1 9 9 9 greater than or equal to the counter mask. Otherwise if this field is zero , then the counte r increm ents by the total n umber of even t s .
Performance Counter Usage 16 5 22007E/0 — No ve mb er 1 999 AMD Athlon ™ Proce ssor x86 Code Optimization 65h BU 1xxx_xxxxb = reserved x1xx_xxxxb = WB xx1x_xxxxb = WP xxx1_xxxxb = WT bits 11–10 .
16 6 Performance Counter Usage AMD Athlon ™ Processor x86 Code Optimizatio n 22007E/0 — Novemb er 1 9 9 9 7Ah B U C ycles that at least one fill request waited to use the L2 80h PC Instr uctio n c.
Performance Counter Usage 16 7 22007E/0 — No ve mb er 1 999 AMD Athlon ™ Proce ssor x86 Code Optimization P erfCtr[3:0] M S Rs (M S R Addr esses C00 1 _000 4h – C00 1 _000 7h) T he performance-counter MSRs contain the e vent or dur ation counts for the se lecte d ev ents b eing count ed.
16 8 Event and Time-S tamp Monitoring Softwar e AMD Athlon ™ Processor x86 Code Optimizatio n 22007E/0 — Novemb er 1 9 9 9 allo ws writing both positi ve and negativ e va lues to the perf ormance counters . The perf ormance counter s ma y be initializ ed us ing a 64-bit sig ned integer in the r ange -2 47 and +2 47 .
Monitoring Counter Ov erflow 16 9 22007E/0 — No ve mb er 1 999 AMD Athlon ™ Proce ssor x86 Code Optimization T he initialization and start counter s pr ocedur e sets the P erfEvtSel0 and/ or P erfEvtSel1 MSRs for the e v ents to be counted and the method used to count them and init ializ es the counter MSR s (P erfCtr[3:0]) to starting counts.
17 0 Monitoring Counter Overflow AMD Athlon ™ Processor x86 Code Optimizatio n 22007E/0 — Novemb er 1 9 9 9 An e v ent moni tor application util ity or another application pr ogr am can r ead the collected perf ormance inf ormation of the pr ofiled a pplication.
22007E/0 — No ve mb er 1 999 AMD Athlon ™ Proce ssor x86 Code Optimization Introduction 171 Appendix E Progr amming the M TR R and PA T Intr oduction Th e A M D A t h l o n ™ processor includes a set of memor y type and r ange register s (MTRRs) to control cachea bility and access to spec ified m emor y re gions.
17 2 Memory T ype Range Register (MTRR) Mechanism AMD Athlon ™ Processor x86 Code Optimizatio n 22007E/0 — Novemb er 1 9 9 9 T her e ar e two types of ad dr ess r anges: fixed and v a ria ble. (See F i gur e 12.) F or each addr ess r a nge, ther e is a memo ry type.
Memory Type Ra nge Register (MTRR) Mechan ism 17 3 22007E/0 — No ve mb er 1 999 AMD Athlon ™ Proce ssor x86 Code Optimization Figure 1 2. MTRR Mapping of Physic al Memory 0 FFFFFFFF h 512 K b y t .
17 4 Memory T ype Range Register (MTRR) Mechanism AMD Athlon ™ Processor x86 Code Optimizatio n 22007E/0 — Novemb er 1 9 9 9 Memory T ypes F iv e standard memor y types ar e defi ned b y the AMD At hlon pr ocessor: writethr ough (WT), write back (WB), wr ite-pro tect (WP), write-combining (WC) , and uncachea ble (UC).
Memory Type Ra nge Register (MTRR) Mechan ism 17 5 22007E/0 — No ve mb er 1 999 AMD Athlon ™ Proce ssor x86 Code Optimization MTRR D efault T ype Register Format. T he MTRR def ault type r egister is defined as f ollows. Figure 1 4. MTRR Default T ype Register Format E MTRRs ar e ena bled when set.
17 6 Memory T ype Range Register (MTR R) Mechanism AMD Athlon ™ Processor x86 Code Optimizatio n 22007E/0 — Novemb er 1 9 9 9 Note that if tw o or mor e v ariable m emor y r anges matc h then the inter actions ar e defined as f ollows: 1. If the memor y types ar e identical, then that memor y type is used.
Page Attribute Tabl e (PAT) 17 7 22007E/0 — No ve mb er 1 999 AMD Athlon ™ Proce ssor x86 Code Optimization not affected b y this issue, onl y the v ariable r ange (and MTRR DefT ype) r egi sters are affecte d.
17 8 Page Attribute T able (P A T) AMD Athlon ™ Processor x86 Code Optimizatio n 22007E/0 — Novemb er 1 9 9 9 Accessing the P A T A 3-bit inde x consisting of the P A T i, PCD , and PWT bit s of t.
Page Attribute Tabl e (PAT) 17 9 22007E/0 — No ve mb er 1 999 AMD Athlon ™ Proce ssor x86 Code Optimization T a ble 1 5. Effective Memor y T ype Based on P A T and MTR Rs P A T Memory T ype MTRR M.
18 0 Page Attribute T able (P A T ) AMD Athlon ™ Processor x86 Code Optimizatio n 22007E/0 — Novemb er 1 9 9 9 T a ble 1 6. Final Output Memory T ypes Input Memory T ype Output Memory T ype Note RdMem WrM e m Effective.
Page Attribute Tabl e (PAT) 18 1 22007E/0 — No ve mb er 1 999 AMD Athlon ™ Proce ssor x86 Code Optimization ● ● CD - ●● CD ●● WC - ●● WC ●● WT - ●● WT ●● WP - ●● WP ●● WB - ● ● WT 4 ●● - ●● ● CD 2 Notes: 1 .
18 2 Page Attribute T able (P A T) AMD Athlon ™ Processor x86 Code Optimizatio n 22007E/0 — Novemb er 1 9 9 9 MTR R Fixed-Range Register F ormat T he memor y types defined f or memor y segments defined in eac h of the MTRR fixed-r ange r egist er s ar e defined in T a ble 17 (Also See “ Standar d MTRR T ypes and Pr operties ” on page 176.
Page Attribute Tabl e (PAT) 18 3 22007E/0 — No ve mb er 1 999 AMD Athlon ™ Proce ssor x86 Code Optimization V ariable-Range MTRRs A v ariable MTRR can be pro gramm ed to st art at ad dr ess 0000_0000h bec ause the fixed MTRRs alw ays o verride the v aria ble ones.
18 4 Page Attribute T able (P A T) AMD Athlon ™ Processor x86 Code Optimizatio n 22007E/0 — Novemb er 1 9 9 9 Figure 1 7 . MTR RphysMask n Register F ormat Note: A softwar e attempt to write to reser ved bits will generate a general protection exception.
Page Attribute Tabl e (PAT) 18 5 22007E/0 — No ve mb er 1 999 AMD Athlon ™ Proce ssor x86 Code Optimization MTRR M SR F ormat T his table defines the model-specifi c r egister s re lated to the memor y type range r egister implementation. All MTRRs ar e defined to be 64 bits.
18 6 Page Attribute T able (P A T ) AMD Athlon ™ Processor x86 Code Optimizatio n 22007E/0 — Novemb er 1 9 9 9.
22007E/0 — No ve mb er 1 999 AMD Athlon ™ Proce ssor x86 Code Optimization Instruction Dispatch and Execution Resou rces 18 7 Appendix F Instruction Dispatch and Execution Resourc es T his c hapter describes the Macr oOPs gener ated by eac h decoded instruction, along with the r elativ e static execution latencies of these groups of operations.
18 8 Instruction Dispatch and Execution Resour ces AMD Athlon ™ Processor x86 Code Optimizatio n 22007E/0 — Novemb er 1 9 9 9 ■ disp1 6/32 — 16-bit or 32-bit displacem ent v alue ■ disp3 2/4.
Instruction Dispatch and Execution Resou rces 18 9 22007E/0 — No ve mb er 1 999 AMD Athlon ™ Proce ssor x86 Code Optimization ADC mreg8, reg8 1 0h 1 1-xxx-xxx DirectPath ADC mem8, r eg8 1 0h mm-xx.
19 0 Instruction Dispatch and Execution Resour ces AMD Athlon ™ Processor x86 Code Optimizatio n 22007E/0 — Novemb er 1 9 9 9 AN D mem8, reg8 20h mm- xxx-xxx Dir ectPath AN D mreg1 6/ 32, reg1 6/3.
Instruction Dispatch and Execution Resou rces 19 1 22007E/0 — No ve mb er 1 999 AMD Athlon ™ Proce ssor x86 Code Optimization BT mem1 6/32, imm8 0Fh BAh mm-1 00-xxx DirectPath BT C mreg1 6/32, reg.
19 2 Instruction Dispatch and Execution Resources AMD Athlon ™ Processor x86 Code Optimizatio n 22007E/0 — Novemb er 1 9 9 9 CMOVE/C MOVZ reg1 6/32, reg1 6/32 0Fh 44h 1 1-xxx-xxx DirectP ath CMOVE.
Instruction Dispatch and Execution Resou rces 19 3 22007E/0 — No ve mb er 1 999 AMD Athlon ™ Proce ssor x86 Code Optimization CM P EA X, imm1 6/32 3Dh DirectPath CM P mreg8, imm8 80h 1 1-1 1 1-xxx.
19 4 Instruction Dispatch and Execution Resources AMD Athlon ™ Processor x86 Code Optimizatio n 22007E/0 — Novemb er 1 9 9 9 DIV EA X, mreg1 6/32 F7h 1 1-1 1 0-xxx V ectorPath DIV EA X, mem1 6/32 .
Instruction Dispatch and Execution Resou rces 19 5 22007E/0 — No ve mb er 1 999 AMD Athlon ™ Proce ssor x86 Code Optimization INC mreg8 FEh 1 1-000-xxx DirectPath INC mem8 F Eh mm-000-xxx DirectPa.
19 6 Instruction Dispatch and Execution Resour ces AMD Athlon ™ Processor x86 Code Optimizatio n 22007E/0 — Novemb er 1 9 9 9 J P/JP E near disp1 6/32 0Fh 8Ah DirectPath J NP/J PO near disp1 6/32 .
Instruction Dispatch and Execution Resou rces 19 7 22007E/0 — No ve mb er 1 999 AMD Athlon ™ Proce ssor x86 Code Optimization L OOP E/L OOPZ disp8 E1h V ectorPath L OOPN E/L OOP NZ disp8 E0h V ect.
19 8 Instruction Dispatch and Execution Resour ces AMD Athlon ™ Processor x86 Code Optimizatio n 22007E/0 — Novemb er 1 9 9 9 MOV EDX, imm1 6/32 BAh DirectPath MOV EBX, imm1 6/32 BBh DirectPath MO.
Instruction Dispatch and Execution Resou rces 19 9 22007E/0 — No ve mb er 1 999 AMD Athlon ™ Proce ssor x86 Code Optimization NOT mem8 F6h mm-0 1 0- xx DirectPath NOT mreg1 6/32 F7h 1 1-0 1 0-xxx .
200 Instruction Dispatch and Execution Resour ces AMD Athlon ™ Processor x86 Code Optimizatio n 22007E/0 — Novemb er 1 9 9 9 POP EB X 5Bh V ectorPath POP ES P 5Ch VectorP ath POP EB P 5Dh V ectorP.
Instruction Dispatch and Execution Resou rces 20 1 22007E/0 — No ve mb er 1 999 AMD Athlon ™ Proce ssor x86 Code Optimization RCL mreg8, 1 D0h 1 1-0 1 0-xxx DirectPath RC L mem8, 1 D0h mm- 0 1 0-x.
202 Instruction Dispatch and Execution Resour ces AMD Athlon ™ Processor x86 Code Optimizatio n 22007E/0 — Novemb er 1 9 9 9 ROL mreg1 6/32, 1 D1h 1 1-000-xxx DirectPath ROL mem1 6/32, 1 D1h mm- 0.
Instruction Dispatch and Execution Resou rces 203 22007E/0 — No ve mb er 1 999 AMD Athlon ™ Proce ssor x86 Code Optimization SB B mreg1 6/32, reg1 6/32 1 9h 1 1-xxx-xxx DirectPath S BB mem 1 6/32,.
204 Instruction Dispatch and Execution Resources AMD Athlon ™ Processor x86 Code Optimizatio n 22007E/0 — Novemb er 1 9 9 9 S ETS mreg8 0Fh 98h 1 1-xxx -xxx DirectPath S ETS mem8 0Fh 98h mm-xxx -x.
Instruction Dispatch and Execution Resou rces 205 22007E/0 — No ve mb er 1 999 AMD Athlon ™ Proce ssor x86 Code Optimization SH R mem1 6/32, imm8 C1h mm-1 0 1-xxx DirectPath SH R mreg8, 1 D0h 1 1-.
206 Instruction Dispatch and Execution Resour ces AMD Athlon ™ Processor x86 Code Optimizatio n 22007E/0 — Novemb er 1 9 9 9 S UB r eg8, mreg8 2Ah 1 1-xxx -xxx DirectPath S UB r eg8, mem8 2Ah mm-x.
Instruction Dispatch and Execution Resou rces 207 22007E/0 — No ve mb er 1 999 AMD Athlon ™ Proce ssor x86 Code Optimization X ADD mreg8, reg8 0Fh C0h 1 1 -1 00-xxx V ectorPath XA DD mem8, r eg8 0.
208 Instruction Dispatch and Execution Resour ces AMD Athlon ™ Processor x86 Code Optimizatio n 22007E/0 — Novemb er 1 9 9 9 T able 20. M MX ™ Instruct ions Instruct ion Mnem onic Prefix By t e(.
Instruction Dispatch and Execution Resou rces 209 22007E/0 — No ve mb er 1 999 AMD Athlon ™ Proce ssor x86 Code Optimization P AN DN mmreg1 , mmreg2 0Fh DFh 1 1-xx x-xxx DirectPath F ADD/F M UL P .
210 Instruction Dispatch and Execution Resources AMD Athlon ™ Processor x86 Code Optimizatio n 22007E/0 — Novemb er 1 9 9 9 PS R AW mmreg1 , mmreg2 0Fh E1h 1 1-xxx-xxx DirectPath F ADD/F M UL P SR.
Instruction Dispatch and Execution Resou rces 21 1 22007E/0 — No ve mb er 1 999 AMD Athlon ™ Proce ssor x86 Code Optimization P UN PCK HDQ mmreg1 , mmreg2 0Fh 6Ah 1 1-xxx-xxx DirectPath F ADD/FM U.
212 Instruction Dispatch and Execution Resources AMD Athlon ™ Processor x86 Code Optimizatio n 22007E/0 — Novemb er 1 9 9 9 PM I NSW mmreg, mem64 0F h EAh mm-xxx-xxx Direct Path F ADD/FM U L PM I .
Instruction Dispatch and Execution Resou rces 21 3 22007E/0 — No ve mb er 1 999 AMD Athlon ™ Proce ssor x86 Code Optimization FCMOVB ST(0), ST(i) DAh C0- C7h VectorP ath FCMOVE ST(0), ST(i) DAh C .
214 Instruction Dispatch and Execution Resources AMD Athlon ™ Processor x86 Code Optimizatio n 22007E/0 — Novemb er 1 9 9 9 FIADD [mem32int] DAh m m-000-xxx V ectorPath FIADD [mem1 6int] DEh mm-00.
Instruction Dispatch and Execution Resou rces 21 5 22007E/0 — No ve mb er 1 999 AMD Athlon ™ Proce ssor x86 Code Optimization FLD CW [mem1 6] D9h mm-1 0 1-xxx V ectorPath FLD ENV [mem1 4byte] D 9h.
216 Instruction Dispatch and Execution Resources AMD Athlon ™ Processor x86 Code Optimizatio n 22007E/0 — Novemb er 1 9 9 9 F S T C W [ m e m 16 ] D 9 h m m - 111 - x x x V e c t o r P a t h FSTE .
Instruction Dispatch and Execution Resou rces 21 7 22007E/0 — No ve mb er 1 999 AMD Athlon ™ Proce ssor x86 Code Optimization T a ble 23. 3DNow! ™ Instructions Instru ction Mn emonic Prefix Byte.
218 Instruction Dispatch and Execution Resources AMD Athlon ™ Processor x86 Code Optimizatio n 22007E/0 — Novemb er 1 9 9 9 PF R SQRT mmr eg, mem64 0F h, 0Fh 9 7h mm-xxx-xxx DirectPat h F MU L P F.
22007E/0 — No ve mb er 1 999 AMD Athlon ™ Proce ssor x86 Code Optimization Select DirectP ath Over VectorPath Instruc tions 219 Appendix G Dire ctP ath versus V ectorP ath Instructions Select DirectP ath Over V ectorP ath Instructions Use DirectP ath instructions rather than V ectorPath ins tr ucti on s.
220 DirectPat h Instructions AMD Athlon ™ Processor x 86 Code Optimization 22007E/0 — November 1 9 9 9 T able 25. DirectP ath Integer Instructions Instru ction Mn emonic ADC mreg8, reg8 ADC mem8, .
DirectPath Instructi ons 22 1 22007E/0 — No ve mb er 1 999 AMD Athlon ™ Processor x86 Code Optimization CMOVBE/C MOVNA reg1 6/32, reg1 6/32 CMOVBE/C MOVNA reg1 6/32, mem1 6/32 CMOVE/C MOVZ reg1 6/.
222 DirectPat h Instructions AMD Athlon ™ Processor x 86 Code Optimization 22007E/0 — November 1 9 9 9 JN O s ho rt di sp 8 JB /JNAE short disp8 JN B/JAE short disp8 JZ/J E short disp8 J NZ/JN E s.
DirectPath Instructi ons 223 22007E/0 — No ve mb er 1 999 AMD Athlon ™ Processor x86 Code Optimization MOV mem1 6/32, imm1 6/32 MOVSX reg1 6/32, mreg8 MOVSX reg1 6/32, mem8 MOVSX reg32, mreg1 6 MO.
224 DirectPat h Instructions AMD Athlon ™ Processor x 86 Code Optimization 22007E/0 — November 1 9 9 9 ROL mreg8, CL ROL mem8, CL ROL mreg1 6/32, CL ROL mem1 6/32, CL ROR mreg8, i mm8 ROR mem8, im.
DirectPath Instructions 225 22007E/0 — No ve mb er 1 999 AMD Athlon ™ Processor x86 Code Optimization SE TL/S ETNG E mreg8 SE TL/SE TNGE mem8 SE TGE/SE TNL mreg8 SET GE/ SETNL mem 8 SE TLE/S ETNG .
226 DirectPat h Instructions AMD Athlon ™ Processor x 86 Code Optimization 22007E/0 — November 1 9 9 9 XO R reg1 6/32, mem1 6/32 XOR AL, imm8 XO R EA X, imm1 6/32 XOR mreg8, imm8 X OR mem8, imm8 XOR m reg 1 6 /32 , imm 1 6/32 X OR mem1 6/32, imm1 6/32 XO R mreg1 6/32, imm8 (sign extended) XO R mem1 6/32, imm8 (sign extended) T able 25.
DirectPath Instructi ons 227 22007E/0 — No ve mb er 1 999 AMD Athlon ™ Processor x86 Code Optimization T able 26. DirectP ath M MX ™ Instructions Instruct ion Mnem onic EMMS MOVD mmreg, mem32 MO.
228 DirectPat h Instructions AMD Athlon ™ Processor x 86 Code Optimization 22007E/0 — November 1 9 9 9 PS R LD mmreg, imm8 PS R LQ mmreg1 , mmreg2 PS R LQ mmreg, mem64 PS R LQ mmreg, imm8 PS R L W.
DirectPath Instructi ons 229 22007E/0 — No ve mb er 1 999 AMD Athlon ™ Processor x86 Code Optimization T able 28. DirectP ath Floating-Point Instructions Instruct ion Mnem onic FA B S F ADD ST, ST.
230 DirectPat h Instructions AMD Athlon ™ Processor x 86 Code Optimization 22007E/0 — November 1 9 9 9 FS U B ST(i), ST FS U BP ST, ST(i) FS U BR [mem32real] FS U BR [mem64real] FS U BR ST, ST(i) FS U BR ST(i), ST FS U BR P ST(i), ST F TST FUC OM FUC OMP FUC OMPP FW A IT FXCH T able 28.
V ectorPath Instructions 23 1 22007E/0 — No ve mb er 1 999 AMD Athlon ™ Processor x86 Code Optimization V ectorP ath Instructions T he f ollowi ng ta bles contain Ve c t o r P a t h instructions, .
232 V ectorPath Instructions AMD Athlon ™ Processor x 86 Code Optimization 22007E/0 — Novembe r 1 9 9 9 DIV EA X, mem1 6/32 EN TER IDIV mr eg8 IDIV mem8 IDIV E A X, mreg1 6/32 IDIV E A X, mem1 6/3.
V ectorPath Instructions 233 22007E/0 — No ve mb er 1 999 AMD Athlon ™ Processor x86 Code Optimization MUL EAX , m em 3 2 OUT imm8, A L OUT imm8, A X OUT imm8, E A X OUT DX, AL OUT DX, A X OUT DX,.
234 V ectorPath Instructions AMD Athlon ™ Processor x 86 Code Optimization 22007E/0 — Novembe r 1 9 9 9 STI ST OS B mem8, AL ST OSW mem1 6, A X STOSD mem32, EA X STR mreg1 6 STR mem1 6 SYSC ALL SY.
V ectorPath Instructions 235 22007E/0 — No ve mb er 1 999 AMD Athlon ™ Processor x86 Code Optimization T able 32. V ectorPath Floating-P oint Instructions Instruct ion Mnem onic F2XM1 FB LD [mem80.
236 V ectorPath Instructions AMD Athlon ™ Processor x 86 Code Optimization 22007E/0 — Novembe r 1 9 9 9.
Index 237 22007E/0 — No ve mb er 1 999 AMD Athlon ™ Pr ocessor x86 Code Optimization Index Numerics 3DNow! ™ Inst ructions . . . . . . . . . . . . . . . . . . . . . . . . . . 10 , 107 3DNo w! and MMX ™ Intr a-Oper and Swapping . . . . . . . 112 Clippin g .
238 Index AM D Athlon ™ Pr ocessor x86 Code Optimization 22007E/0 — Novemb er 1 9 9 9 Instructio n Cach e . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 131 Contr ol Unit . . . . . . . . . . . . . . . . . . . . .
Index 239 22007E/0 — No ve mb er 1 999 AM D Athlon ™ Pr ocessor x86 Code Optimization T TBYTE V ariables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55 T rigo nome tri c Inst ruc tions . . . . . . . . . . . . . . . . . . . .
240 Index AM D Athlon ™ Pr ocessor x86 Code Optimization 22007E/0 — Novemb er 1 9 9 9.
デバイスAMD x86の購入後に(又は購入する前であっても)重要なポイントは、説明書をよく読むことです。その単純な理由はいくつかあります:
AMD x86をまだ購入していないなら、この製品の基本情報を理解する良い機会です。まずは上にある説明書の最初のページをご覧ください。そこにはAMD x86の技術情報の概要が記載されているはずです。デバイスがあなたのニーズを満たすかどうかは、ここで確認しましょう。AMD x86の取扱説明書の次のページをよく読むことにより、製品の全機能やその取り扱いに関する情報を知ることができます。AMD x86で得られた情報は、きっとあなたの購入の決断を手助けしてくれることでしょう。
AMD x86を既にお持ちだが、まだ読んでいない場合は、上記の理由によりそれを行うべきです。そうすることにより機能を適切に使用しているか、又はAMD x86の不適切な取り扱いによりその寿命を短くする危険を犯していないかどうかを知ることができます。
ですが、ユーザガイドが果たす重要な役割の一つは、AMD x86に関する問題の解決を支援することです。そこにはほとんどの場合、トラブルシューティング、すなわちAMD x86デバイスで最もよく起こりうる故障・不良とそれらの対処法についてのアドバイスを見つけることができるはずです。たとえ問題を解決できなかった場合でも、説明書にはカスタマー・サービスセンター又は最寄りのサービスセンターへの問い合わせ先等、次の対処法についての指示があるはずです。