# Optimizing C array lookups for the 6502 Most of the platforms in the IDE that are powerful enough to support a C compiler are Z80-based. The Z80 isn't the easiest target for C, but at least it has a lot of registers. While the 6502 only has three 8-bit registers, the 6502 has strengths the Z80 doesn't, as we'll see here. Let's compare the two dominant C compilers for these CPUs. We'll compile a function with the CC65 C compiler for the 6502, and then with the SDCC (Small Device C Compiler) for the Z80. We'll call the function `getValue()`: ```c int arr16[128]; int getValue(int index) { return (arr16[index & 0x7f] & 0xff) >> 1; } ``` This function just looks up a value in a array of 16-bit ints, truncates the result to 8 bits, then divides it by 2. We'll first compile it with the default flags for both compilers: ```asm 6502 VERSION (cc65) Z80 VERSION (sdcc) jsr pushax push ix ldy #$01 ld ix,#0 jsr ldaxysp add ix,sp ldx #$00 ld bc, #_arr16+0 and #$7F ld l, 4 (ix) jsr aslax1 res 7, l clc ld h, #0x00 adc #<(_arr16) add hl, hl tay add hl, bc txa ld c, (hl) adc #>(_arr16) ld h, #0x00 tax sra h tya rr c ldy #$01 ld l, c jsr ldaxidx pop ix ldx #$00 jsr asrax1 jmp L0004 L0004: jsr incsp2 rts ``` The 6502 version is a little more verbose here, compiling to 41 bytes vs. 31 bytes for Z80. One big difference: SDCC turns on most optimizations by default. That's not fair, so let's see what optimizations we can activate for CC65. We'll compile it again, but this time we'll pass the `-O` flag to CC65. We'll try to get the Z80 version even smaller by passing `--opt-code-size` to SDCC: ```asm 6502 VERSION (cc65 -O) Z80 VERSION (sdcc --opt-code-size) jsr pushax call ___sdcc_enter_ix ldy #$00 ld bc, #_arr16+0 lda (sp),y ld l, 4 (ix) asl a res 7, l tay ld h, #0x00 lda _arr16,y add hl, hl ldx #$00 add hl, bc lsr a ld c, (hl) jmp incsp2 ld h, #0x00 sra h rr c ld l, c pop ix ret ``` The 6502 version is a lot smaller, down to 18 bytes! The Z80 version is only a little smaller due to the replacement of the stack frame setup with a function call. What happened here to make the 6502 version so small? The pedigree of CC65 goes back to 1980 and the [Small C](https://en.wikipedia.org/wiki/Small-C) compiler. As a result, its optimization methods depend highly on [peephole optimization](https://en.wikipedia.org/wiki/Peephole_optimization). This is a method of scanning code linearly to look for patterns, then rewriting those patterns to be more efficient. One such pattern is one I recently added in [my own branch](https://github.com/sehugg/cc65). It looks for 16-bit array lookups of the form: ```c array16[index & 0x7f] // or index < 0x7f ``` CC65 defers most optimization until after code generation. The pseudocode before optimization looks more-or-less like this: ```c AX = index AX = AX & 0x7f AX = AX << 1 AX = AX + address(arr16) AX = read16(AX) AX = AX >> 1 return AX ``` This may not look like a lot of code, but `AX` is a 16-bit value (actually the AX register pair) and manipulating 16-bit values is expensive on the 6502. We need to find a way to reduce it to an 8-bit problem. The key here is the `index & 0x7f` mask. This guarantees that the array index will be less than 128. Each array element takes two bytes, so we only have to index up to 256 bytes of data -- enough for 8 bits. Instead of computing a 16-bit offset for the array, we can use the 6502's *indexed addressing* mode, i.e. `LDA address,Y`. The pseudocode now looks like this: ```c // jsr pushax // ldy #$00 A = index // lda (sp),y A = A << 1 // asl a Y = A // tay A = arr16[Y] // lda _arr16,y // ldx #$00 A = A >> 1 // lsr a return A // jmp incsp2 ``` Because we're dealing mainly with 8-bit values, this code becomes much smaller and faster. Note that we don't even need the `index & 0x7f` mask anymore, because the `asl a` instruction takes off that high bit for us. Let's compare the non-optimized 6502 output with the optimized version: ```asm ORIGINAL VERSION OPTIMIZED VERSION jsr pushax jsr pushax ldy #$01 ldy #$00 jsr ldaxysp lda (sp),y ldx #$00 and #$7F jsr aslax1 asl a clc adc #<(_arr16) tay txa adc #>(_arr16) tax tya ldy #$01 tay jsr ldaxidx lda _arr16,y ldx #$00 ldx #$00 jsr asrax1 lsr a jmp L0004 L0004: jsr incsp2 jmp incsp2 rts ``` The Z80 doesn't have a equivalent addressing mode. To do a fast 8-bit table lookup, we have to align our table to a 256-byte page. Then we can stuff the index in the lower 8 bits of a 16-bit register pair. The result would look sort of like this: ```asm ld h, #>(_arr16) ; H = MSB of table (constant) ld a, 4 (ix) ; index parameter sla a ; shift left ld l, a ; L = LSB of table (computed) ld a, (hl) ; look up 8-bit value in HL ``` The 6502 doesn't require page alignment, so it has the edge here in optimizing this case. The Z80 is better at other things, like dealing with `structs`. *Thanks for reading! If you want to play with the CC65 C compiler, you can do so right now in the [8bitworkshop IDE](http://8bitworkshop.com/) using the venerable Apple II platform (with maybe more coming in the future!)*