Retargeting a C Compiler to 6502#

I’ve written before about cc65 and some of its optimization challenges. It’s a stable compiler with a robust toolchain, but its code generator doesn’t full advantage of the 6502 when performing 8-bit operations. It places function parameters and many local variables on a separate stack, which requires many calls to helper functions.

I’ve done a survey of 6502 high-level languages, and while many are intriguing, none have the stability and C compatiblity of cc65.

So what’s so hard about the 6502? What’s keeping us from just retargeting a modern C compiler, bringing its powerful optimization routines to 6502? Well, it’s complicated. Like, really complicated.

LLVM#

LLVM is the dominant compiler backend library, and there’s documentation on how to write a LLVM backend.

However, many of the instructions read like “draw two circles, then draw the rest of the damn owl”. LLVM is a freakin’ huge library, and backends tend to be complex. It wasn’t designed for 8-bit systems like the 6502 where stack access is cumbersome and registers are scarce.

SDCC#

But what if there were something already in the 8bitworkshop pipeline, right under our nose? Yes, it’s SDCC, the Small Device C Compiler, the same one we use for Z80! What if we could retarget it to the 6502?

The benefit of SDCC is that it already supports a handful of microcontrollers more constrained than the Z80. The Motorola/Freescale HC08 is similar to the 6502 in some ways. It has an 8-bit accumulator and two 8-bit index registers. It also differentiates between direct (0-255, or zero page) and extended (0-65535, or absolute) addressing modes.

Might this be a good start for our own 6502 backend? Let’s find out.

The Test Program#

We’ll use this simple C program to compare compilers:

#include <stdint.h>
#include <stdbool.h>

bool get_pixel(uint_fast8_t x, uint_fast8_t y);
void set_pixel(uint_fast8_t x, uint_fast8_t y);

void fill_line_left(uint_fast8_t x, const uint_fast8_t y) {
        for(;; x--)	{
                if(get_pixel(x, y))
                        return;
                set_pixel(x, y);
        }
}

Caveat: This test program is from a research paper about register allocation written by SDCC’s maintainer, so we might be cherry-picking a bit.

The CC65 Version#

Let’s see how CC65 compiles this:

.proc	_fill_line_left: near
        jsr     pusha
        ldy     #$01
L000E:  lda     (sp),y
        jsr     pusha
        ldy     #$01
        lda     (sp),y
        jsr     _get_pixel
        tax
        bne     L0003
        ldy     #$01
        lda     (sp),y
        jsr     pusha
        ldy     #$01
        lda     (sp),y
        jsr     _set_pixel
        ldy     #$01
        lda     (sp),y
        sec
        sbc     #$01
        sta     (sp),y
        jmp     L000E
L0003:	jmp     incsp2

CC65 stores its parameters on the stack, requiring an expensive indirect (sp),y instruction. Note the jsr pusha subroutine calls, which take about 40 cycles each, used to push parameters on the stack for get_pixel and set_pixel function calls.

The HC08 Target#

HC08 is the backend on which we’ll base the 6502 backend, so let’s see how it does:

_fill_line_left:
        stx	_fill_line_left_y_1_3
00104$:
        psha                            ;fill.c:9: if(get_pixel(x, y))
        ldx	_fill_line_left_y_1_3
        jsr	_get_pixel
        tax
        pula
        tstx
        beq	00102$
        rts                             ;fill.c:10: return;
00102$:
        psha                            ;fill.c:11: set_pixel(x, y);
        ldx	_fill_line_left_y_1_3
        jsr	_set_pixel
        pula                            ;fill.c:8: for(;; x--)	{
        deca
        bra	00104$
        rts

It’s a lot smaller than the CC65 version, and has no indirect loads or parameter pushes. This backend passes two 8-bit parameters in the A and X registers – additional parameters are generally passed in static locations.

It also looks pretty similar to 6502 code, doesn’t it? One might be tempted to convert this directly to 6502 without mucking about in the backend.

The Peephole Assembly Code Translator#

SDCC has a peephole optimizer that we can configure to translate to 6502 code. For example, HC08’s TST instruction can be converted to 6502’s BIT instruction:

        tst	%0
=
        BIT	%0

Some HC08 instructions aren’t directly mappable. For example, indirect load of X requires a zero-page pointer:

        lda	,x
=
        STX	ptr
        STY	ptr+1
        LDY	#0
        LDA	(ptr),Y
        PHP
        LDY	ptr+1
        PLP

Going down the rabbit hole, we’ll discover that this approach isn’t workable. HC08 isn’t exactly like 6502:

  • Not little-endian

  • 16-bit index register (HX) vs two 8-bit index registers (X and Y)

  • Different flag behavior

  • Additional instructions and addressing modes

So now we’ll go mucking around in the compiler source. Hacking the HC08 backend to produce 6502 code is easy, right?

  • Figure out what it is doing

  • Make it do the other thing

Really Doing a Compiler Backend#

Turns out there is a lot involved in “making it do the other thing”, e.g. emitting working 6502 code:

  • Change big-endian to little-endian.

  • Modify the register allocator to allocate both X and Y properly.

  • Make sure to pop/push/transfer registers without clobbering other registers.

  • Various optimizations:

    • Use the absolute X/Y indexing modes when possible.

    • Prefer 16-bit values and pointers to be in zero-page addresses.

    • Implement bit-shifting logic when multiplying/dividing by powers of two.

  • Create assembler and linker tools.

  • Port the standard SDCC library to 6502.

  • Create a 6502 simulator and test suite.

  • Fix the broken tests.

After all that and more, we finally get output from the experimental SDCC 6502 backend:

_fill_line_left:
        stx	_fill_line_left_y_65536_3
        pha
        ldx	_fill_line_left_y_65536_3
        jsr	_get_pixel
        tax
        pla
        cpx	#0x00
        beq	00102$
        rts
00102$:
        pha
        ldx	_fill_line_left_y_65536_3
        jsr	_set_pixel
        pla
        sec
        sbc	#0x01
        jmp	00104$
        rts

Yikes#

What kind of obstacles did we run into?

  • The code base is 20 years old, and the HC08 backend was hacked up by someone other than the author. The maintainer later came back and shoehorned a new, fancy register allocator in C++. The whole thing is hard for mere mortals to understand.

  • There are lots of special cases to fix – multibyte operations especially.

  • There’s a test suite, we had to fix many of the examples, as well as bring up a 6502 command-line simulator.

  • Registers can be free, in use, or dead (free after the current operation).

  • Pointers required rearranging much of the code, since the HC08 uses a register for indirect access, and the 6502 uses zero-page.

  • Had to transfer the S -> X register for stack access.

  • Had to make special case for absolute indexing (aaaa,x or aaaa,y)

  • Some instructions have addressing limitations, like BIT and shift instructions.

  • PLA modifies the flags in 6502 land, but not the HC08.

  • Register allocation is a big complex thing that I’m afraid to touch, and when I do ASSERTs are thrown. It actually does a “dry run” compiling multiple times to figure out the optimal register allocation.

  • Some registers had to be “pushed” onto a temporary zero-page stack, since vanilla 6502 can’t push/pop X or Y.

  • Needed to keep a pointer to the stack frame in zero page when there’s stuff on the stack.

  • Memory allocation (calloc of asmops, random freeing of objs)

  • Regression tests only expose one bug on average, so it goes very slow. There’s one test case per 14,000 lines of code. Had to use my own 6502 simulator to find some bugs.

  • Longlong routines yuk (found bug in longlong shift functions)

  • AST operations take lots of arguments (e.g. a == b << c) so tends to be a lot of logic to test.

  • Endless bugs from transferring registers in the wrong order

Most of the regression tests pass, but even after heroic efforts fixing things, many of them still fail. (“Debugging something into existence” is a phrase that keeps popping into my head.)

If I started from scratch, I’d think about writing a 6502 code generation layer before touching the SDCC backend, that would:

  • Track register liveness

  • Track bitmasks/ranges of values

  • Track equivalences (register == constant, etc)

  • Choose stack vs. memory for temporaries

  • Handle multiple register transfers (e.g. X -> A -> Y)

  • Convert pointers to absolute indexing

  • Handle 6502 vs 65C02 instructions

  • Convert long branches to JMP

Try it Out#

You can download the experimental fork of SDCC with 6502 support here:

https://github.com/sehugg/sdcc-m6502

Look at the manual (PDF) for build instructions.