The most important thing to know about 65816 optimizations is to not try to emulate other CPU architectures. A lot of games (games with a lot of slowdown) have this programming style where the programmer pretends that a couple memory locations are 68000 registers or high-level local variables, and the accumulator is only there as an operand buffer. Please avoid this style of programming. It is the biggest cause of slowdown in games, and is also the most time consuming to fix.

Peephole optimization

A peephole optimization is replacement of a very short sequence of instructions with an equivalent improved sequence.

The following example was originally posted to NESdev BBS. It starts with the following 68000-oid code:

lda $00      ;; add $01 to $00
clc
adc $01
sta $00

lda $00      ;; add $02 to $00
clc
adc $02
sta $00

lda $1000    ;; move $1000 to zero page so it can run faster (this is sarcasm btw)
sta $03

lda $03      ;; add $00 to $03
clc
adc $00
sta $03

lda $03
sta $1000    ;; move $03 back to $1000

First remove loads after stores to the same address:

lda $00
clc
adc $01
sta $00      ;; remove lda $00 after store
clc
adc $02
sta $00
lda $1000    ;; move $1000 to the $03 so it can run faster
sta $03      ;; remove lda $03 after store
clc
adc $00
sta $03
sta $1000

Then remove stores whose value is provably unused:

lda $00
clc
adc $01
clc          ;; remove unused sta $00
adc $02
sta $00
lda $1000
clc
adc $00
sta $03
sta $1000

Addition of this type is commutative (ram[$1000] + ram[$00] = ram[$00] + ram[$1000]):

lda $00
clc
adc $01
clc
adc $02
sta $00
lda $00      ;; group accesses to same address
clc
adc $1000
sta $03
sta $1000

Which allows removing another load after store:

lda $00
clc
adc $01
clc
adc $02
sta $00      ;; remove lda $00 after store
clc
adc $1000
sta $03
sta $1000

Thus this section of code is provably equivalent yet small enough for repeating unused store analysis with $00 and $03 in the rest of the snippet. If it turns out they're not needed, you end up with perfectly idiomatic 6502-family assembly:

lda $00
clc
adc $01
clc
adc $02      ;; remove unused sta $00
clc
adc $1000    ;; remove unused sta $03
sta $1000

And half the instructions are gone.

Avoid Frequent Memory Moves to and from Direct Page Memory

You probably heard that the direct page is faster than normal memory, and so you're moving stuff to and from direct page in order to speed up certain parts of your code. The problem is that more time gets wasted moving variables back and forth, than you gain from using direct page memory.

For 16-bit memory moves you take 9 cycles getting variables into direct page, and 9 cycles getting it out. Taking 18 cycles total. While accessing direct page will only save 1 cycle per word.