Retargeting a C Compiler to 6502#
I’ve written before about cc65 and some of its optimization challenges. It’s a stable compiler with a robust toolchain, but its code generator doesn’t full advantage of the 6502 when performing 8-bit operations. It places function parameters and many local variables on a separate stack, which requires many calls to helper functions.
I’ve done a survey of 6502 high-level languages, and while many are intriguing, none have the stability and C compatiblity of cc65.
So what’s so hard about the 6502? What’s keeping us from just retargeting a modern C compiler, bringing its powerful optimization routines to 6502? Well, it’s complicated. Like, really complicated.
LLVM#
LLVM is the dominant compiler backend library, and there’s documentation on how to write a LLVM backend.
However, many of the instructions read like “draw two circles, then draw the rest of the damn owl”. LLVM is a freakin’ huge library, and backends tend to be complex. It wasn’t designed for 8-bit systems like the 6502 where stack access is cumbersome and registers are scarce.
SDCC#
But what if there were something already in the 8bitworkshop pipeline, right under our nose? Yes, it’s SDCC, the Small Device C Compiler, the same one we use for Z80! What if we could retarget it to the 6502?
The benefit of SDCC is that it already supports a handful of microcontrollers more constrained than the Z80. The Motorola/Freescale HC08 is similar to the 6502 in some ways. It has an 8-bit accumulator and two 8-bit index registers. It also differentiates between direct (0-255, or zero page) and extended (0-65535, or absolute) addressing modes.
Might this be a good start for our own 6502 backend? Let’s find out.
The Test Program#
We’ll use this simple C program to compare compilers:
#include <stdint.h>
#include <stdbool.h>
bool get_pixel(uint_fast8_t x, uint_fast8_t y);
void set_pixel(uint_fast8_t x, uint_fast8_t y);
void fill_line_left(uint_fast8_t x, const uint_fast8_t y) {
for(;; x--) {
if(get_pixel(x, y))
return;
set_pixel(x, y);
}
}
Caveat: This test program is from a research paper about register allocation written by SDCC’s maintainer, so we might be cherry-picking a bit.
The CC65 Version#
Let’s see how CC65 compiles this:
.proc _fill_line_left: near
jsr pusha
ldy #$01
L000E: lda (sp),y
jsr pusha
ldy #$01
lda (sp),y
jsr _get_pixel
tax
bne L0003
ldy #$01
lda (sp),y
jsr pusha
ldy #$01
lda (sp),y
jsr _set_pixel
ldy #$01
lda (sp),y
sec
sbc #$01
sta (sp),y
jmp L000E
L0003: jmp incsp2
CC65 stores its parameters on the stack, requiring an expensive indirect (sp),y
instruction.
Note the jsr pusha
subroutine calls, which take about 40 cycles each,
used to push parameters on the stack for get_pixel
and set_pixel
function calls.
The HC08 Target#
HC08 is the backend on which we’ll base the 6502 backend, so let’s see how it does:
_fill_line_left:
stx _fill_line_left_y_1_3
00104$:
psha ;fill.c:9: if(get_pixel(x, y))
ldx _fill_line_left_y_1_3
jsr _get_pixel
tax
pula
tstx
beq 00102$
rts ;fill.c:10: return;
00102$:
psha ;fill.c:11: set_pixel(x, y);
ldx _fill_line_left_y_1_3
jsr _set_pixel
pula ;fill.c:8: for(;; x--) {
deca
bra 00104$
rts
It’s a lot smaller than the CC65 version, and has no indirect loads or parameter pushes. This backend passes two 8-bit parameters in the A and X registers – additional parameters are generally passed in static locations.
It also looks pretty similar to 6502 code, doesn’t it? One might be tempted to convert this directly to 6502 without mucking about in the backend.
The Peephole Assembly Code Translator#
SDCC has a peephole optimizer that we can configure to translate to 6502 code. For example, HC08’s TST instruction can be converted to 6502’s BIT instruction:
tst %0
=
BIT %0
Some HC08 instructions aren’t directly mappable. For example, indirect load of X requires a zero-page pointer:
lda ,x
=
STX ptr
STY ptr+1
LDY #0
LDA (ptr),Y
PHP
LDY ptr+1
PLP
Going down the rabbit hole, we’ll discover that this approach isn’t workable. HC08 isn’t exactly like 6502:
Not little-endian
16-bit index register (HX) vs two 8-bit index registers (X and Y)
Different flag behavior
Additional instructions and addressing modes
So now we’ll go mucking around in the compiler source. Hacking the HC08 backend to produce 6502 code is easy, right?
Figure out what it is doing
Make it do the other thing
Really Doing a Compiler Backend#
Turns out there is a lot involved in “making it do the other thing”, e.g. emitting working 6502 code:
Change big-endian to little-endian.
Modify the register allocator to allocate both X and Y properly.
Make sure to pop/push/transfer registers without clobbering other registers.
Various optimizations:
Use the absolute X/Y indexing modes when possible.
Prefer 16-bit values and pointers to be in zero-page addresses.
Implement bit-shifting logic when multiplying/dividing by powers of two.
Create assembler and linker tools.
Port the standard SDCC library to 6502.
Create a 6502 simulator and test suite.
Fix the broken tests.
After all that and more, we finally get output from the experimental SDCC 6502 backend:
_fill_line_left:
stx _fill_line_left_y_65536_3
pha
ldx _fill_line_left_y_65536_3
jsr _get_pixel
tax
pla
cpx #0x00
beq 00102$
rts
00102$:
pha
ldx _fill_line_left_y_65536_3
jsr _set_pixel
pla
sec
sbc #0x01
jmp 00104$
rts
Yikes#
What kind of obstacles did we run into?
The code base is 20 years old, and the HC08 backend was hacked up by someone other than the author. The maintainer later came back and shoehorned a new, fancy register allocator in C++. The whole thing is hard for mere mortals to understand.
There are lots of special cases to fix – multibyte operations especially.
There’s a test suite, we had to fix many of the examples, as well as bring up a 6502 command-line simulator.
Registers can be free, in use, or dead (free after the current operation).
Pointers required rearranging much of the code, since the HC08 uses a register for indirect access, and the 6502 uses zero-page.
Had to transfer the S -> X register for stack access.
Had to make special case for absolute indexing (
aaaa,x
oraaaa,y
)Some instructions have addressing limitations, like
BIT
and shift instructions.PLA
modifies the flags in 6502 land, but not the HC08.Register allocation is a big complex thing that I’m afraid to touch, and when I do ASSERTs are thrown. It actually does a “dry run” compiling multiple times to figure out the optimal register allocation.
Some registers had to be “pushed” onto a temporary zero-page stack, since vanilla 6502 can’t push/pop X or Y.
Needed to keep a pointer to the stack frame in zero page when there’s stuff on the stack.
Memory allocation (calloc of asmops, random freeing of objs)
Regression tests only expose one bug on average, so it goes very slow. There’s one test case per 14,000 lines of code. Had to use my own 6502 simulator to find some bugs.
Longlong routines yuk (found bug in longlong shift functions)
AST operations take lots of arguments (e.g. a == b << c) so tends to be a lot of logic to test.
Endless bugs from transferring registers in the wrong order
Most of the regression tests pass, but even after heroic efforts fixing things, many of them still fail. (“Debugging something into existence” is a phrase that keeps popping into my head.)
If I started from scratch, I’d think about writing a 6502 code generation layer before touching the SDCC backend, that would:
Track register liveness
Track bitmasks/ranges of values
Track equivalences (register == constant, etc)
Choose stack vs. memory for temporaries
Handle multiple register transfers (e.g. X -> A -> Y)
Convert pointers to absolute indexing
Handle 6502 vs 65C02 instructions
Convert long branches to JMP
Try it Out#
You can download the experimental fork of SDCC with 6502 support here:
https://github.com/sehugg/sdcc-m6502
Look at the manual (PDF) for build instructions.