Optimizing C array lookups for the 6502

Most of the platforms in the IDE that are powerful enough to support a C compiler are Z80-based. The Z80 isn’t the easiest target for C, but at least it has a lot of registers.

While the 6502 only has three 8-bit registers, the 6502 has strengths the Z80 doesn’t, as we’ll see here.

Let’s compare the two dominant C compilers for these CPUs. We’ll compile a function with the CC65 C compiler for the 6502, and then with the SDCC (Small Device C Compiler) for the Z80. We’ll call the function getValue():

int arr16[128];

int getValue(int index) {
    return (arr16[index & 0x7f] & 0xff) >> 1;
}

This function just looks up a value in a array of 16-bit ints, truncates the result to 8 bits, then divides it by 2.

We’ll first compile it with the default flags for both compilers:

	6502 VERSION (cc65)		Z80 VERSION (sdcc)

        jsr     pushax		        push    ix
        ldy     #$01                    ld      ix,#0
        jsr     ldaxysp                 add     ix,sp
        ldx     #$00                    ld      bc, #_arr16+0
        and     #$7F                    ld      l, 4 (ix)
        jsr     aslax1                  res     7, l
        clc                             ld      h, #0x00
        adc     #<(_arr16)              add     hl, hl
        tay                             add     hl, bc
        txa                             ld      c, (hl)
        adc     #>(_arr16)              ld      h, #0x00
        tax                             sra     h
        tya                             rr      c
        ldy     #$01                    ld      l, c
        jsr     ldaxidx                 pop     ix
        ldx     #$00
        jsr     asrax1
        jmp     L0004
L0004:  jsr     incsp2
        rts

The 6502 version is a little more verbose here, compiling to 41 bytes vs. 31 bytes for Z80. One big difference: SDCC turns on most optimizations by default. That’s not fair, so let’s see what optimizations we can activate for CC65.

We’ll compile it again, but this time we’ll pass the -O flag to CC65. We’ll try to get the Z80 version even smaller by passing --opt-code-size to SDCC:

        6502 VERSION (cc65 -O)          Z80 VERSION (sdcc --opt-code-size)

        jsr     pushax		        call    ___sdcc_enter_ix
        ldy     #$00                    ld      bc, #_arr16+0        
        lda     (sp),y                  ld      l, 4 (ix)            
        asl     a                       res     7, l                 
        tay                             ld      h, #0x00             
        lda     _arr16,y                add     hl, hl               
        ldx     #$00                    add     hl, bc               
        lsr     a                       ld      c, (hl)              
        jmp     incsp2                  ld      h, #0x00             
                                        sra     h                    
                                        rr      c                    
                                        ld      l, c                 
                                        pop     ix                   
                                        ret                          

The 6502 version is a lot smaller, down to 18 bytes! The Z80 version is only a little smaller due to the replacement of the stack frame setup with a function call.

What happened here to make the 6502 version so small?

The pedigree of CC65 goes back to 1980 and the Small C compiler. As a result, its optimization methods depend highly on peephole optimization. This is a method of scanning code linearly to look for patterns, then rewriting those patterns to be more efficient.

One such pattern is one I recently added in my own branch. It looks for 16-bit array lookups of the form:

array16[index & 0x7f]	// or index < 0x7f

CC65 defers most optimization until after code generation. The pseudocode before optimization looks more-or-less like this:

AX = index
AX = AX & 0x7f
AX = AX << 1
AX = AX + address(arr16)
AX = read16(AX)
AX = AX >> 1
return AX

This may not look like a lot of code, but AX is a 16-bit value (actually the AX register pair) and manipulating 16-bit values is expensive on the 6502. We need to find a way to reduce it to an 8-bit problem.

The key here is the index & 0x7f mask. This guarantees that the array index will be less than 128. Each array element takes two bytes, so we only have to index up to 256 bytes of data – enough for 8 bits.

Instead of computing a 16-bit offset for the array, we can use the 6502’s indexed addressing mode, i.e. LDA address,Y. The pseudocode now looks like this:

		// jsr pushax
		// ldy #$00
A = index	// lda (sp),y
A = A << 1	// asl a
Y = A		// tay
A = arr16[Y]	// lda _arr16,y
		// ldx #$00
A = A >> 1	// lsr a
return A	// jmp incsp2

Because we’re dealing mainly with 8-bit values, this code becomes much smaller and faster. Note that we don’t even need the index & 0x7f mask anymore, because the asl a instruction takes off that high bit for us.

Let’s compare the non-optimized 6502 output with the optimized version:

        ORIGINAL VERSION                OPTIMIZED VERSION

        jsr     pushax		        jsr     pushax
        ldy     #$01			ldy     #$00
        jsr     ldaxysp		        lda     (sp),y
        ldx     #$00
        and     #$7F
        jsr     aslax1		        asl     a
        clc
        adc     #<(_arr16)
        tay
        txa
        adc     #>(_arr16)
        tax
        tya
        ldy     #$01		        tay
        jsr     ldaxidx		        lda     _arr16,y
        ldx     #$00		        ldx     #$00
        jsr     asrax1		        lsr     a
        jmp     L0004
L0004:  jsr     incsp2		        jmp     incsp2
        rts

The Z80 doesn’t have a equivalent addressing mode. To do a fast 8-bit table lookup, we have to align our table to a 256-byte page. Then we can stuff the index in the lower 8 bits of a 16-bit register pair. The result would look sort of like this:

        ld      h, #>(_arr16)    ; H = MSB of table (constant)
        ld      a, 4 (ix)        ; index parameter
        sla     a                ; shift left
        ld      l, a		 ; L = LSB of table (computed)
        ld      a, (hl)          ; look up 8-bit value in HL

The 6502 doesn’t require page alignment, so it has the edge here in optimizing this case. The Z80 is better at other things, like dealing with structs.

Thanks for reading! If you want to play with the CC65 C compiler, you can do so right now in the 8bitworkshop IDE using the venerable Apple II platform (with maybe more coming in the future!)