Optimizing C array lookups for the 6502
Most of the platforms in the IDE that are powerful enough to support a C compiler are Z80-based. The Z80 isn’t the easiest target for C, but at least it has a lot of registers.
While the 6502 only has three 8-bit registers, the 6502 has strengths the Z80 doesn’t, as we’ll see here.
Let’s compare the two dominant C compilers for these CPUs.
We’ll compile a function with the CC65 C compiler for the 6502, and then with the SDCC
(Small Device C Compiler) for the Z80.
We’ll call the function getValue()
:
int arr16[128];
int getValue(int index) {
return (arr16[index & 0x7f] & 0xff) >> 1;
}
This function just looks up a value in a array of 16-bit ints, truncates the result to 8 bits, then divides it by 2.
We’ll first compile it with the default flags for both compilers:
6502 VERSION (cc65) Z80 VERSION (sdcc)
jsr pushax push ix
ldy #$01 ld ix,#0
jsr ldaxysp add ix,sp
ldx #$00 ld bc, #_arr16+0
and #$7F ld l, 4 (ix)
jsr aslax1 res 7, l
clc ld h, #0x00
adc #<(_arr16) add hl, hl
tay add hl, bc
txa ld c, (hl)
adc #>(_arr16) ld h, #0x00
tax sra h
tya rr c
ldy #$01 ld l, c
jsr ldaxidx pop ix
ldx #$00
jsr asrax1
jmp L0004
L0004: jsr incsp2
rts
The 6502 version is a little more verbose here, compiling to 41 bytes vs. 31 bytes for Z80. One big difference: SDCC turns on most optimizations by default. That’s not fair, so let’s see what optimizations we can activate for CC65.
We’ll compile it again, but this time we’ll pass the -O
flag to CC65.
We’ll try to get the Z80 version even smaller by passing --opt-code-size
to SDCC:
6502 VERSION (cc65 -O) Z80 VERSION (sdcc --opt-code-size)
jsr pushax call ___sdcc_enter_ix
ldy #$00 ld bc, #_arr16+0
lda (sp),y ld l, 4 (ix)
asl a res 7, l
tay ld h, #0x00
lda _arr16,y add hl, hl
ldx #$00 add hl, bc
lsr a ld c, (hl)
jmp incsp2 ld h, #0x00
sra h
rr c
ld l, c
pop ix
ret
The 6502 version is a lot smaller, down to 18 bytes! The Z80 version is only a little smaller due to the replacement of the stack frame setup with a function call.
What happened here to make the 6502 version so small?
The pedigree of CC65 goes back to 1980 and the Small C compiler. As a result, its optimization methods depend highly on peephole optimization. This is a method of scanning code linearly to look for patterns, then rewriting those patterns to be more efficient.
One such pattern is one I recently added in my own branch. It looks for 16-bit array lookups of the form:
array16[index & 0x7f] // or index < 0x7f
CC65 defers most optimization until after code generation. The pseudocode before optimization looks more-or-less like this:
AX = index
AX = AX & 0x7f
AX = AX << 1
AX = AX + address(arr16)
AX = read16(AX)
AX = AX >> 1
return AX
This may not look like a lot of code, but AX
is a 16-bit value (actually
the AX register pair)
and manipulating 16-bit values is expensive on the 6502.
We need to find a way to reduce it to an 8-bit problem.
The key here is the index & 0x7f
mask.
This guarantees that the array index will be less than 128.
Each array element takes two bytes, so we only have to index up to 256 bytes
of data – enough for 8 bits.
Instead of computing a 16-bit offset for the array, we can use the 6502’s
indexed addressing mode, i.e. LDA address,Y
.
The pseudocode now looks like this:
// jsr pushax
// ldy #$00
A = index // lda (sp),y
A = A << 1 // asl a
Y = A // tay
A = arr16[Y] // lda _arr16,y
// ldx #$00
A = A >> 1 // lsr a
return A // jmp incsp2
Because we’re dealing mainly with 8-bit values, this code becomes much smaller and
faster. Note that we don’t even need the index & 0x7f
mask anymore, because
the asl a
instruction takes off that high bit for us.
Let’s compare the non-optimized 6502 output with the optimized version:
ORIGINAL VERSION OPTIMIZED VERSION
jsr pushax jsr pushax
ldy #$01 ldy #$00
jsr ldaxysp lda (sp),y
ldx #$00
and #$7F
jsr aslax1 asl a
clc
adc #<(_arr16)
tay
txa
adc #>(_arr16)
tax
tya
ldy #$01 tay
jsr ldaxidx lda _arr16,y
ldx #$00 ldx #$00
jsr asrax1 lsr a
jmp L0004
L0004: jsr incsp2 jmp incsp2
rts
The Z80 doesn’t have a equivalent addressing mode. To do a fast 8-bit table lookup, we have to align our table to a 256-byte page. Then we can stuff the index in the lower 8 bits of a 16-bit register pair. The result would look sort of like this:
ld h, #>(_arr16) ; H = MSB of table (constant)
ld a, 4 (ix) ; index parameter
sla a ; shift left
ld l, a ; L = LSB of table (computed)
ld a, (hl) ; look up 8-bit value in HL
The 6502 doesn’t require page alignment, so it has the edge here in
optimizing this case. The Z80 is better at other things, like dealing with
structs
.
Thanks for reading! If you want to play with the CC65 C compiler, you can do so right now in the 8bitworkshop IDE using the venerable Apple II platform (with maybe more coming in the future!)