Exploring the New 6502 Compiler Based on LLVM#

Today, homebrew developers enjoy the convenience of programming the 6502 in C. However, the 6502 is a very different architecture than modern CPUs, and was never designed for C programming.

The 6502 has a small number of 8-bit registers, and limited addressing modes that cannot access the stack efficiently. This makes it difficult to retarget a modern C compiler to the 6502.

In a previous blog post, we explored the process of retargeting a microcontroller C compiler (SDCC) to the 6502. This required development of a brand new code generation backend, which required looking at over 80,000 lines of code. Gabriele Gorla has since improved it to the point where it has been merged into the main SDCC repository.

Enter LLVM#

LLVM (Low Level Virtual Machine) is a modern compiler infrastructure, providing a toolkit to develop compilers. Some enthusiasts have developed a working LLVM backend for the 6502. It relies on the new GlobalISel framework, which enables aggressive global optimizations.

In theory, LLVM should be able to generate better code than SDCC or CC65, since it’s a more modern compiler infrastructure.

Is this the case?

Let’s dive into some of the assembly code that the LLVM 6502 compiler produces. For context, the 6502 uses an accumulator-based architecture, where most operations involve an implicit “accumulator” register (A). The X and Y registers are often used for counting or indexing.

In our previous post, we used this code as a test case:

void fill_line_left(uint_fast8_t x, const uint_fast8_t y) {
        for(;; x--)	{
                if(get_pixel(x, y))
                        return;
                set_pixel(x, y);
        }
}

Let’s see what the LLVM 6502 compiler produces:

fill_line_left:
; %bb.0:
	sta	__rc16
	lda	__rc20
	pha
	lda	__rc21
	pha
	lda	__rc16
	sta	__rc20
	stx	__rc21
	jmp	.LBB0_2
.LBB0_1:
	ldx	__rc21
	lda	__rc20
	jsr	set_pixel
	dec	__rc20
	ldx	__rc21
	lda	__rc20
.LBB0_2:
	jsr	get_pixel
	and	#1
	beq	.LBB0_1
; %bb.3:
	pla
	sta	__rc21
	pla
	sta	__rc20
	rts

The LLVM version is 44 bytes of code. Compare this to the CC65 version, which emits 49 bytes of code.

LLVM was designed for architectures with a large number of registers, so it’s not surprising that it relies on zero-page temporary variables (prefixed by __rc). After all, this is how the 6502 was designed to be used.

Like the SDCC compiler, the LLVM code generator doesn’t always use an explicit stack. The CC65 function calls expensive 16-bit stack-manipulation functions like pusha and incsp2. The LLVM version passes function arguments in registers and uses the pha and pla instructions only to save and restore 8-bit temporary register values.

Array Indexing Optimization#

Let’s look at another function, fill_line_left, discussed in this blog post:

int arr16[128];

int getValue(int index) {
    return (arr16[index & 0x7f] & 0xff) >> 1;
}

This is the LLVM version:

getValue:
	and	#127
	asl
	tax
	lda	arr16+1,x
	lsr
	lda	arr16,x
	ror
	and	#127
	ldx	#0
	rts

This LLVM version is 17 bytes, while the CC65 version is 19 bytes and contains two calls to stack-manipulation routines. Note that the LLVM version infers that the arr16 array is less than 256 bytes long, and thus can use the absolute indexed addressing mode.

LZG Decoder Example#

Let’s try a more complex example: an LZG decoder. Here’s the function prototype:

unsigned char* lzg_decode_vram(const unsigned char* _src, 
                     unsigned char* _dest, 
                     unsigned char* _end);

The CC65 version compiles this function to 501 bytes, while LLVM compiles to 696 bytes, even with the -Os (minimum size) optimization flag. The greater size may be because LLVM does not use any helper functions for 16-bit manipulation.

I would guess that the LLVM version is faster, but I haven’t benchmarked it. On the 6502, speed roughly correlates with size, unless you are doing zero-page indirect addressing or subroutines.

The Future?#

This new LLVM 6502 compiler is still in development, but looks promising for some applications. It’s not yet integrated into the 8bitworkshop IDE compiler pipeline, but maybe someday? We may occassionally gripe at CC65’s code generation, but it’s a very mature compiler with a robust toolchain, and sometimes it generates smaller code than its competition.

The speed-size tradeoff is almost forgotten in this era of gigabytes of RAM and terabytes of storage. Looking at 6502 compilers reinforces this tension. After all, Woz wrote a 16-bit virtual machine for the 6502 (SWEET16) just to reduce the size of code, specifically for Integer BASIC line renumbering.

Maybe an optimal C compiler will have to make use of multiple modes: full native mode, 16-bit stack mode, and even a 16-bit virtual machine mode. Or maybe we’ll just keep writing assembler code :)