The most important thing to know about 65816 optimizations is to not try to emulate other CPU architectures. A lot of games (games with a lot of slowdown) have this programming style where the programmer pretends that a couple memory locations are 68000 registers or high-level local variables, and the accumulator is only there as an operand buffer. Please avoid this style of programming. It is the biggest cause of slowdown in games, and is also the most time consuming to fix.

## Peephole optimization

A peephole optimization is replacement of a very short sequence of instructions with an equivalent improved sequence.

The following example was originally posted to NESdev BBS. It starts with the following 68000-oid code:

``````lda \$00      ;; add \$01 to \$00
clc
sta \$00

lda \$00      ;; add \$02 to \$00
clc
sta \$00

lda \$1000    ;; move \$1000 to zero page so it can run faster (this is sarcasm btw)
sta \$03

lda \$03      ;; add \$00 to \$03
clc
sta \$03

lda \$03
sta \$1000    ;; move \$03 back to \$1000``````

``````lda \$00
clc
sta \$00      ;; remove lda \$00 after store
clc
sta \$00
lda \$1000    ;; move \$1000 to the \$03 so it can run faster
sta \$03      ;; remove lda \$03 after store
clc
sta \$03
sta \$1000``````

Then remove stores whose value is provably unused:

``````lda \$00
clc
clc          ;; remove unused sta \$00
sta \$00
lda \$1000
clc
sta \$03
sta \$1000``````

Addition of this type is commutative (`ram[\$1000] + ram[\$00] = ram[\$00] + ram[\$1000]`):

``````lda \$00
clc
clc
sta \$00
lda \$00      ;; group accesses to same address
clc
sta \$03
sta \$1000``````

Which allows removing another load after store:

``````lda \$00
clc
clc
sta \$00      ;; remove lda \$00 after store
clc
sta \$03
sta \$1000``````

Thus this section of code is provably equivalent yet small enough for repeating unused store analysis with `00 and `03 in the rest of the snippet. If it turns out they’re not needed, you end up with perfectly idiomatic 6502-family assembly:

``````lda \$00
clc