Optimizing C array lookups for the 6502¶

Most of the platforms in the IDE that are powerful enough to support a C compiler are Z80-based. The Z80 isn’t the easiest target for C, but at least it has a lot of registers.

While the 6502 only has three 8-bit registers, the 6502 has strengths the Z80 doesn’t, as we’ll see here.

Let’s compare the two dominant C compilers for these CPUs. We’ll compile a function with the CC65 C compiler for the 6502, and then with the SDCC (Small Device C Compiler) for the Z80. We’ll call the function getValue():

int arr16[128];

int getValue(int index) {
return (arr16[index & 0x7f] & 0xff) >> 1;
}



This function just looks up a value in a array of 16-bit ints, truncates the result to 8 bits, then divides it by 2.

We’ll first compile it with the default flags for both compilers:

        6502 VERSION (cc65)		Z80 VERSION (sdcc)

jsr     pushax                  push    ix
ldy     #$01 ld ix,#0 jsr ldaxysp add ix,sp ldx #$00                    ld      bc, #_arr16+0
and     #$7F ld l, 4 (ix) jsr aslax1 res 7, l clc ld h, #0x00 adc #<(_arr16) add hl, hl tay add hl, bc txa ld c, (hl) adc #>(_arr16) ld h, #0x00 tax sra h tya rr c ldy #$01                    ld      l, c
jsr     ldaxidx                 pop     ix
ldx     #$00 jsr asrax1 jmp L0004 L0004: jsr incsp2 rts  The 6502 version is a little more verbose here, compiling to 41 bytes vs. 31 bytes for Z80. One big difference: SDCC turns on most optimizations by default. That’s not fair, so let’s see what optimizations we can activate for CC65. We’ll compile it again, but this time we’ll pass the -O flag to CC65. We’ll try to get the Z80 version even smaller by passing --opt-code-size to SDCC:  6502 VERSION (cc65 -O) Z80 VERSION (sdcc --opt-code-size) jsr pushax call ___sdcc_enter_ix ldy #$00                    ld      bc, #_arr16+0
lda     (sp),y                  ld      l, 4 (ix)
asl     a                       res     7, l
tay                             ld      h, #0x00
ldx     #$00 add hl, bc lsr a ld c, (hl) jmp incsp2 ld h, #0x00 sra h rr c ld l, c pop ix ret  The 6502 version is a lot smaller, down to 18 bytes! The Z80 version is only a little smaller due to the replacement of the stack frame setup with a function call. What happened here to make the 6502 version so small? The pedigree of CC65 goes back to 1980 and the Small C compiler. As a result, its optimization methods depend highly on peephole optimization. This is a method of scanning code linearly to look for patterns, then rewriting those patterns to be more efficient. One such pattern is one I recently added in my own branch. It looks for 16-bit array lookups of the form: array16[index & 0x7f] // or index < 0x7f  CC65 defers most optimization until after code generation. The pseudocode before optimization looks more-or-less like this: AX = index AX = AX & 0x7f AX = AX << 1 AX = AX + address(arr16) AX = read16(AX) AX = AX >> 1 return AX  This may not look like a lot of code, but AX is a 16-bit value (actually the AX register pair) and manipulating 16-bit values is expensive on the 6502. We need to find a way to reduce it to an 8-bit problem. The key here is the index & 0x7f mask. This guarantees that the array index will be less than 128. Each array element takes two bytes, so we only have to index up to 256 bytes of data – enough for 8 bits. Instead of computing a 16-bit offset for the array, we can use the 6502’s indexed addressing mode, i.e. LDA address,Y. The pseudocode now looks like this:  // jsr pushax // ldy #$00
A = index       // lda (sp),y
A = A << 1      // asl a
Y = A           // tay
A = arr16[Y]    // lda _arr16,y
// ldx #$00 A = A >> 1 // lsr a return A // jmp incsp2  Because we’re dealing mainly with 8-bit values, this code becomes much smaller and faster. Note that we don’t even need the index & 0x7f mask anymore, because the asl a instruction takes off that high bit for us. Let’s compare the non-optimized 6502 output with the optimized version:  ORIGINAL VERSION OPTIMIZED VERSION jsr pushax jsr pushax ldy #$01                    ldy     #$00 jsr ldaxysp lda (sp),y ldx #$00
and     #$7F jsr aslax1 asl a clc adc #<(_arr16) tay txa adc #>(_arr16) tax tya ldy #$01                    tay
jsr     ldaxidx                 lda     _arr16,y
ldx     #$00 ldx #$00
jsr     asrax1                  lsr     a
jmp     L0004
L0004:  jsr     incsp2                  jmp     incsp2
rts


The Z80 doesn’t have a equivalent addressing mode. To do a fast 8-bit table lookup, we have to align our table to a 256-byte page. Then we can stuff the index in the lower 8 bits of a 16-bit register pair. The result would look sort of like this:

        ld      h, #>(_arr16)    ; H = MSB of table (constant)
ld      a, 4 (ix)        ; index parameter
sla     a                ; shift left
ld      l, a             ; L = LSB of table (computed)
ld      a, (hl)          ; look up 8-bit value in HL


The 6502 doesn’t require page alignment, so it has the edge here in optimizing this case. The Z80 is better at other things, like dealing with structs.

Thanks for reading! If you want to play with the CC65 C compiler, you can do so right now in the 8bitworkshop IDE using the venerable Apple II platform (with maybe more coming in the future!)