The most important thing to know about 65816 optimizations is to not try to emulate other CPU architectures. A lot of games (games with a lot of slowdown) have this programming style where the programmer pretends that a couple memory locations are 68000 registers or high-level local variables, and the accumulator is only there as an operand buffer. Please avoid this style of programming. It is the biggest cause of slowdown in games, and is also the most time consuming to fix.

Peephole Optimization

A peephole optimization is replacement of a very short sequence of instructions with an equivalent improved sequence.

The following example was originally posted to NESdev BBS. It starts with the following 68000-oid code:

LDA $00      ;; add $01 to $00
CLC
ADC $01
STA $00

LDA $00      ;; add $02 to $00
CLC
ADC $02
STA $00

LDA $1000    ;; move $1000 to zero page so it can run faster (this is sarcasm btw)
STA $03

LDA $03      ;; add $00 to $03
CLC
ADC $00
STA $03

LDA $03
STA $1000    ;; move $03 back to $1000

First remove loads after stores to the same address:

LDA $00
CLC
ADC $01
STA $00      ;; remove LDA $00 after store
CLC
ADC $02
STA $00
LDA $1000    ;; move $1000 to the $03 so it can run faster
STA $03      ;; remove LDA $03 after store
CLC
ADC $00
STA $03
STA $1000

Then remove stores whose value is provably unused:

LDA $00
CLC
ADC $01
CLC          ;; remove unused STA $00
ADC $02
STA $00
LDA $1000
CLC
ADC $00
STA $03
STA $1000

Addition of this type is commutative (ram[$1000] + ram[$00] = ram[$00] + ram[$1000]):

LDA $00
CLC
ADC $01
CLC
ADC $02
STA $00
LDA $00      ;; group accesses to same address
CLC
ADC $1000
STA $03
STA $1000

Which allows removing another load after store:

LDA $00
CLC
ADC $01
CLC
ADC $02
STA $00      ;; remove lda $00 after store
CLC
ADC $1000
STA $03
STA $1000

Thus this section of code is provably equivalent yet small enough for repeating unused store analysis with $00 and $03 in the rest of the snippet. If it turns out they're not needed, you end up with perfectly idiomatic 6502-family assembly:

LDA $00
CLC
ADC $01
CLC
ADC $02      ;; remove unused STA $00
CLC
ADC $1000    ;; remove unused STA $03
STA $1000

And half the instructions are gone.

Memory Hierarchy

Knowing which part of memory is fastest is important. There are two factors involved in what parts of memory are fastest: Address size and Memory latency.

Address size is the number of bits an absolute address takes up. There are 3 sizes: 8 bit, 16 bit and 24 bit addresses. When the 65816's accumulator is set in 8-bit mode:

  • LDA $xx takes 3 cycles
  • LDA $xxxx takes 4 cycles
  • LDA $xxxxxx takes 5 cycles

8-bit absolute mode (also called "Direct Page addressing") normally stays at the first 256 bytes of memory $00:0000-$00:00FF, but it can be moved around using the Direct Page register. By doing:

PEA $2100        ;; important, these 2 instructions aren't ALWAYS the fastest way of changing the DP register.  Optimization depends on context.
PLD

...you can set the direct page to $00:2100-$00:21FF, which allows fast PPU writing.

You can even use the direct page address register for accessing object slots, so you can have more moving sprite objects on-screen without lagging. The downside is that Direct Page is stuck within the first 64KiB of the 65816's address space, even when the bank register (which I'll explain in next section) is set to another bank. In the Super Nintendo's case, the first bank has 8KiB of RAM, system registers, and 32KiB of cartridge ROM. If you're using DP memory as object slots, you have to make sure you balance the amount of objects with the amount of RAM per object, to keep it fitting within 8KiB. The more objects you use, the less memory you have per object.

This limitation can be turned into a strength because you have fast access to 2 banks at one time. That being said, Nintendo was smart enough to mirror the same 8KiB of RAM, in every bank from $00:xxxx-$3F:xxxx, $80:xxxx-$BF:xxxx and $7E:xxxx.

I've mentioned above about how to set the Direct Page register. There are two instructions that load the Direct Page register, and they are "PLD" and "TCD". PLD loads DP from stack memory, whereas TCD loads DP from the 16-bit accumulator. You can also perform it this way, which is faster but requires the accumulator to already be in 16-bit mode, or else wouldn't work:

LDA #$2100
TCD

The previous method works for both 8-bit and 16-bit mode, but this method is recommended when in 16-bit mode, because it is faster.

16-bit absolute addressing is slightly slower than direct page addressing, but still very flexible. Unlike Direct Page addressing, 16-bit absolute addressing can be allocated to any bank. The Data Bank register is similar to the Direct Page register, because it allows allocating 16-bit addresses, similar to how the Direct Page allocates 8-bit addresses. The difference is that Direct Page can be offset by individual bytes, where as the Data Bank can only offset by banks of 64kB. The only instruction that loads the Data Bank register is "PLB" which loads Data Bank "B" from the stack. This makes it tricky to change banks while in 16-bit accumulator mode, and "PEA $xxxx" only works with 16-bit values. My advice is to stick to the same bank as much as possible, and only change it under certain circumstances where it can speed up a routine. I typically stay in bank $80, and occasionally switch to $7E or $7F when a routine needs a lot of RAM usage.

24-bit long addressing is the slowest, and least flexible of the 3. The advantage is that you can access out-of-bank memory without fiddling with the data bank register, but the disadvantage of being slightly slower, and having missing instructions, and the fact that index registers are only 16-bits. There is no "long,y" addressing mode, and long addressing mode isn't available for LDY, STY, LDX, STX, ASL, LSR, ROL, ROR, INC, DEC, BIT, TSB and TRB. Even though "long,Y" isn't available, "[dp],Y" is which can be used as a substitute for the missing "long,Y" mode. Some cases you can use the data bank register and Y index register together as a pseudo 24-bit address register.

The last thing about "memory speed" I would like to explain is memory latency. In the Super Nintendo, there are 2 lengths of CPU cycles, "slow cycles" that are 8 master cycles, and "fast cycles" that are only 6 cycles. "Dummy" cycles are always fast, S-PPU and system registers are almost always fast, internal WRAM is always slow, and cartridge ROM can be either fast or slow depending on the ROM chips. There is a system register which controls the ROM access speed. If it is set, the second half of ROM, from banks $80-$FF has fast ROM. For some reason banks $00-$7D are still in slow ROM. Amusingly Contra III sets the register to fast ROM, but forgets to jump to bank $80-$FF. If you can afford a fast ROM, (which you can because it is 2018) using immediates and unrolled code can give you a speed boost. Adding some cartridge SRAM gives an even bigger speed boost, because you can then do some self modifying code. This is why I prefer using $80 as my data bank setting. Not only is it in the fast ROM area, but is almost always a perfect mirror of bank $00, which as I've explained is a very important bank.

Avoid Unnecessary Memory Moves To And From Direct Page

A mistake a lot of people make is thinking that you optimize code using the Direct Page memory space the same way people optimize code for a CPU with a lot of registers such as a 68000. They're both used to optimize code, but with extremely different strategy. With the 68000's registers, you would want to swap variables in and out as much as possible to fit inside 16 registers at once. The 65816's Direct Page memory is much bigger than the 68000 register set, so it's better to keep variables there for longer periods of time, or permanently. Also, the location of the direct page memory is movable with the direct page address register, so you can have multiple pools of direct page memory to choose from depending on which set of variables are currently being used. Also, the cost of the 68000 accessing absolute addressing is MUCH higher than the cost of the 65816 absolute addressing.

65816 cycle counts for 8-bit and 16-bit accesses:

  • Immediate addressing: 2 and 3 cycles
  • Direct Page addressing: 3 and 4 cycles
  • Absolute addressing: 4 and 5 cycles
  • Long absolute addressing: 5 and 6 cycles

68000 cycle counts for 8-bit and 16-bit accesses:

  • Immediate addressing: 8 cycles
  • Register Direct: 4 cycles
  • Short absolute addressing: 12 cycles
  • Long absolute addressing: 16 cycles

See, the reason why register allocation is such an important part of 68000 optimizing is because accessing memory is 3x slower, so you have a big enough benefit of using registers to overcome the overhead of loading and storing said registers. The direct page memory on the 65816, only saves a cycle, and since we're dealing with memory to memory moves, we need 2 instructions to move stuff into direct page, and 2 instructions to move stuff out of direct page (with the exception of PEI which can move stuff from the Direct Page onto the stack in one instruction, which is very situational, but can be used to pull off some pretty cool optimizations). This adds up to 18 cycles, which means that you must use that variable 18 times before you get any speed benefit. For the most part, keep every variable statically located, keep the most frequently used variables in direct page memory, move the direct page address register when you need to focus on a separate set of variables, and only move variables in and out of direct page memory when it's absolutely worth doing.