The most important thing to know about 65816 optimizations is to not try to emulate other CPU architectures. A lot of games (games with a lot of slowdown) have this programming style where the programmer pretends that a couple memory locations are 68000 registers or high-level local variables, and the accumulator is only there as an operand buffer. Please avoid this style of programming. It is the biggest cause of slowdown in games, and is also the most time consuming to fix.
A peephole optimization is replacement of a very short sequence of instructions with an equivalent improved sequence.
The following example was originally posted to NESdev BBS. It starts with the following 68000-oid code:
lda $00 ;; add $01 to $00 clc adc $01 sta $00 lda $00 ;; add $02 to $00 clc adc $02 sta $00 lda $1000 ;; move $1000 to zero page so it can run faster (this is sarcasm btw) sta $03 lda $03 ;; add $00 to $03 clc adc $00 sta $03 lda $03 sta $1000 ;; move $03 back to $1000
First remove loads after stores to the same address:
lda $00 clc adc $01 sta $00 ;; remove lda $00 after store clc adc $02 sta $00 lda $1000 ;; move $1000 to the $03 so it can run faster sta $03 ;; remove lda $03 after store clc adc $00 sta $03 sta $1000
Then remove stores whose value is provably unused:
lda $00 clc adc $01 clc ;; remove unused sta $00 adc $02 sta $00 lda $1000 clc adc $00 sta $03 sta $1000
Addition of this type is commutative (
ram[$1000] + ram[$00] = ram[$00] + ram[$1000]):
lda $00 clc adc $01 clc adc $02 sta $00 lda $00 ;; group accesses to same address clc adc $1000 sta $03 sta $1000
Which allows removing another load after store:
lda $00 clc adc $01 clc adc $02 sta $00 ;; remove lda $00 after store clc adc $1000 sta $03 sta $1000
Thus this section of code is provably equivalent yet small enough for repeating unused store analysis with
00 and 03 in the rest of the snippet. If it turns out they’re not needed, you end up with perfectly idiomatic 6502-family assembly:
lda $00 clc adc $01 clc adc $02 ;; remove unused sta $00 clc adc $1000 ;; remove unused sta $03 sta $1000
And half the instructions are gone.
You probably heard that the direct page is faster than normal memory, and so you’re moving stuff to and from direct page in order to speed up certain parts of your code. The problem is that more time gets wasted moving variables back and forth, than you gain from using direct page memory.
For 16-bit memory moves you take 9 cycles getting variables into direct page, and 9 cycles getting it out. Taking 18 cycles total. While accessing direct page will only save 1 cycle per word.