VCSLib: A C Library for the Atari 2600#

The previous blog post, “Your 8-bit System is a Weird PDP-11”, discussed several challenges faced when developing in C, especially for the Atari 2600/VCS.

We’ve overcome those challenges as best we can, and now we can introduce VCSLib, a support library for programming the Atari 2600/VCS with CC65.

Let’s take a tour.

The kernel loop draws the frame#

The most important task in a VCS program is displaying the screen. This requires setting up PIA timers to generate an exact number of scanlines per frame. In assembly, we use the TIMER_SETUP and TIMER_END macros for this.

In VCSLib, the helper functions kernel_1/2/3/4() do this for you. Just make your own loop, and call these functions in sequence, interleaving your own functions:

void kernel_loop() {
  while (1) {       // run until reset switch
    kernel_1();
    my_preframe();  // your pre-frame routine
    kernel_2();
    my_doframe();   // your display kernel routine
    kernel_3();
    my_postframe(); // your post-frame routine
    kernel_4();
  }
}

Here my_preframe() and my_postframe() run offscreen, while my_doframe() executes your kernel routines. You still have to be careful not to take too long and overrun the PIA timers.

The helper functions also check the reset switch, and when detected issue a BRK instruction to restart the ROM.

Kernel macros help build display kernels#

The VCS supports five moveable graphic objects (player 0-1, missile 0-1, and ball) and a background playfield. These are drawn scanline-by-scanline by a display kernel routine.

If you wanted to draw all of these objects every scanline, it would require 10 register writes – at least 30 CPU cycles, plus however long it takes to read the values to be written. Therefore, our display kernels must compromise on number of objects, vertical resolution, color resolution, artifact prevention, or other qualities.

There is no “default” display kernel as with batariBASIC – you must build each of them in assembly. There are several macros to help you with this. For example, this kernel draws two player sprites with colormaps, and two missiles:

.proc _kernel_2ppmm
  tay             ; A = number of lines to draw
@loop:
  DO_PCOLOR 0     ; set 1st player color
  DO_PCOLOR 1     ; set 2nd player color
  DO_DRAW 0,1     ; draw 1st player sprite (and WSYNC)   
  DO_DRAW 1,0     ; draw 2nd player sprite
  DO_MISSILE 2    ; draw 1st missile
  DO_MISSILE 3    ; draw 2nd missile
  dey
  jne @loop
  jmp _reset_gfx  ; clear graphics registers
.endproc

The DO_ macros are in kernel.inc, and can be combined in several ways, as long as they don’t overflow the 76-cycle-per-scanline limit. VCSLib relies on WSYNC in kernels, so kernels use fewer than 76 cycles per scanline.

If a kernel writes to a register during the visible portion of the scanline, it can generate artifacts. Above a certain completity level you can’t prevent artifacts completely, but you can move around the WSYNC to see where it’s least offensive.

The two kernels used in the demo are defined in demo_kernels.ca65.

Kernel variables#

The kernel macros rely on several variables, defined in your main C program:

byte k_height[NOBJS];         // height of each object
byte k_ypos[NOBJS];           // Y position (modified by kernel)
byte* k_bitmap[NSPRITES];     // bitmap address
byte* k_colormap[NSPRITES];   // colormap address
const byte* k_playfield;      // playfield address (may be modified)

Your code should set up k_ypos every frame, since the kernels modify its values. There are several other steps you must take before calling the kernel.

Setting horizontal positions#

Note that there is no k_xpos variable. The kernels don’t care about X position, you need to define your own variable and set the horizontal position of each object, preferably before the kernel:

  set_horiz_pos(OBJ_PLAYER_0 | xpos_player_0);

The set_horiz_pos() function looks for the object index (0-4) in the high byte of the parameter. This is because the fastcall convention only allows for a single parameter, so it can pass it in the A and X registers. Without fastcall, CC65 uses the stack, which is bad. Even two byte parameters would have to go on the stack.

Setting up player sprites for the kernel#

Setting up a player sprite is more complex, and goes something like this:

void setup_player(byte nlines, byte index) {
  byte y = ypos[index] >> 1;
  byte ofs = nlines - y + 1;
  k_bitmap[index] = (char*) Frame0 - ofs;
  k_colormap[index] = (char*) ColorFrame0 - (ofs - 1 - (ypos[index] & 1));
  k_ypos[index] = y;
  set_horiz_pos((index<<8) | xpos[index]);
}

TIA.vdelp0 = ypos[0];
TIA.vdelp1 = ypos[1];

There’s a lot going on here that’s specific to this particular kernel, and the DO_DRAW macro.

The DO_DRAW macro requires that the bitmap and colormap address be calculated from the sprite’s Y position and the height of the kernel (that’s why we pass nlines.)

We also use set VDELP0/1 registers to get away with only setting sprite registers every two scanlines. This requires us also to modify the colormap address based on whether the sprite Y position is even or odd.

Bank-switching gives you more ROM and RAM#

VCSLib uses the 3E mapping scheme, which is similar to TigerVision (3F) but allows for extra RAM.

This scheme defines two areas:

  • Switchable area ($F000-$F7FF) for ROM and RAM.

  • Permanent area ($F800-$FFFF) for ROM.

The permanent area is important for CC65, because the CPU vectors and support routines must be at a fixed address.

The switchable ROM banks are labeled ROM0 through ROM7. ROM0 starts at $1000, ROM1 at $3000, etc. Only one of these may be selected at a given time. (The VCS considers each of these regions the same area, but we prefer to have the debug symbols not overlap.)

The permanent ROM bank lives at $F800-$FFFF and is labeled PERM. It should be used sparingly, as it fills up quickly.

The extended RAM bank is called XDATA and lives at $F000-$F7FF. It takes over the switchable ROM area when selected.

Trampolines and wrapped-call help you switch banks#

CC65 has some support for bank-switching, but there are some tricks.

When calling a function in another bank, you need to switch to that bank. But unless you call from the PERM bank, your bank will switch out from under you before you make the jump!

A common solution is a trampoline. This is a piece of code that lives at a fixed address, and is never switched out. You pass the trampoline the bank index and the address you want to call, and it performs the switch and calls the function.

CC65 lets you define a trampoline with the wrapped-call pragma. For example:

// put any constant data and code in ROM0
#pragma code-name (push, "ROM0")
#pragma rodata-name (push, "ROM0")

// wrap all functions with the "bankselect" routine
#pragma wrapped-call (push, bankselect, bank)

const char init_data[] = { ... };

void init() {
  // ... you can use init_data here ...
}

#pragma code-name (pop)
#pragma rodata-name (pop)
#pragma wrapped-call (pop)

CC65 defines these segment types:

  • code - C functions and other kinds of code.

  • rodata - Anything defined as const, for example a byte array.

  • bss - Uninitalized variables.

  • data - Variables with an initial value… or arrays where you forgot to add const :)

Additionaly, any code that has to be page-aligned must go in the rodata segment.

You don’t have to add wrapped-call to functions if they will only be called from a wrapped-call function in the same segment. (But you do have to surround them in #pragma code-name to place them in the same segment.)

How to select extended RAM#

To select extended RAM into the bank-switched area, you have to use ramselect instead of bankselect:

#pragma wrapped-call (push, ramselect, 0)

The extended RAM bank cannot be used for normal variables, as least not when it comes to writing. Due to VCS constraints, the “write port” for the RAM is 1024 bytes above the “read port”. Thus you must add $400 to any address in extended RAM you wish to write, something that C cannot do for you. From C, you can instead use the xram functions to write to extended RAM (albeit slowly):

    xramset(dest);
    xramwrite(0xff);

Since the RAM bank cannot co-exist with switched ROM banks, any extended RAM usage must be either done from PERM or from code copied to the XDATA segment.

If you put any code or initialized data into extended RAM, you must call copyxdata() from your main function.

Also: If you have a wrapped-call function in the XDATA segment, you cannot call any wrapped-call functions in ROM banks. Upon return, the wrapper will forget you have RAM selected and instead select the last selected ROM bank.

Two kinds of scoreboards#

VCSLib supports scoreboards in BCD format. First, you have to define a variable in your program, 2 or 3 bytes long depending if you want 4 or 6 digit scores:

byte bcd_score[3];	// support 6-digit score (3 bytes)

4-digit scores#

The 4-digit routine uses the playfield, and is pretty simple. Call scorepf_build() from your preframe or postframe routine. Then scorepf_kernel() from your kernel routine. It’ll draw two digits on the left side, and two digits on the right, for a total of 12 scanlines.

You can change registers in your code, like setting SCORE mode in CTRLPF to make the digits two separate colors:

  TIA.ctrlpf = PF_SCORE;
  do_wsync();
  TIA.colubk = COLOR_CONV(0xa2);
  TIA.colup0 = COLOR_CONV(0x2e);
  TIA.colup1 = COLOR_CONV(0x8e);
  scorepf_kernel();
  TIA.wsync = 0;
  TIA.colubk = 0;
  TIA.ctrlpf = PF_REFLECT;

6-digit scores and 48-pixel bitmaps#

The 6-digit scores are more complex. They rely on a 48-pixel bitmap routine that uses both player sprites, mirrored 3 times. But they’re pretty easy to use.

First, call score6_build() from your preframe or postframe (or kernel, if you must.) Then call bitmap48_kernel(8) in your kernel routine to draw the 6-digit scoreboard, 8 lines high.

You can call score6_add() to add a 4-digit BCD to your 6-digit score:

    score6_add(0x0199);

You can also use the BCD_ADD macro to add to a two-digit BCD score:

BCD_ADD(bcd_score[0], 1);

Using bitmap48 functions#

You can use the bitmap48 functions directly to draw arbitrary bitmaps. For example, you can draw 12-character text using a 30 x 5 bitmap. You can even store multiple bitmaps in extended RAM:

  // in your init() function
  tinyfont48_build(font_bitmap[i], "HELLO WORLD!");

  // in your kernel function
  bitmap48_setheight(5); // must call before bitmap48_setaddress()
  bitmap48_setaddress(font_bitmap[0]);
  bitmap48_setup();
  bitmap48_kernel(5); // 5 lines high

Unless you have lots and lots of strings, you may just want to build 30 x 5 bitmaps ahead of time and store them in your ROM. The bitmaps must be either in the XDATA or PERM segment, because the bitmap48 routines run out of the XDATA segment.

Sound effect library#

VCSLib has a simple sound effect library.

A sound effect is defined as a string of bytes. The bytes are processed in reverse, one byte per frame. Each byte sets one of the volume, control register, or frequency registers: For example, this sets volume and control register and then a short frequency sweep:

const byte sound_2[] = {
  0x02,0x03,0x04,0x08,0x10,0x20,0x10,0x20,0x10,AUDC(4),AUDV(8)
};

From C, you play a sound effect like this:

    sound_play(1);

You must call sound_update() once per frame in your pre/post frame routine.

You can play two sounds simultaneously. If both channels are busy, the library chooses the sound that is closer to the finish.

Playing music#

VCSLib can play music in tandem with sound effects. The music format is similar to that used in the demos and described in the books, converted from MIDI with the midi2song.py script.

When playing music, sound effect #0 must be reserved for the “music envelope”. This is a sound effect that doesn’t change frequency or control register, but just modifies the volume:

// music note envelope
const byte sound_1[] = {
  AUDV(1), AUDV(2), AUDV(3), AUDV(4), AUDV(6),	// decay
  AUDV(8), AUDV(8), AUDV(8),	// sustain
  AUDV(10), AUDV(5), AUDV(3),	// attack
};

To play a note, the music routine will set the frequency and control registers, then pass control of the volume to the sound effect library. This allows sound effects and music to co-exist.

To start a song, pass a pointer to the start of the music data:

music_play(music_2);

You must call music_update() once per frame.

Limitations of VCSLib#

Okay, we’ve got a demo, but can we write a full game? What further problems might we face?

Not a lot of zeropage RAM left over#

There’s not a lot of PIA (zeropage) RAM left over after CC65 and VCSLib take their cut. The default demo leaves only 12 bytes remaining! There aren’t many solutions to this:

  • Creatively reuse variables. For example, instead of several timer variables, have a single counter and use a bitmask to trigger at different intervals.

  • Creatively reuse the tmp1-4 and ptr1-4 variables in assembly routines. Many VCSLib routines already do this, but remember that these variables may be overwritten by generated C code.

  • Use the XDATA segment more often. Just remember that when writing to extended RAM you must add 1024 to any address, or use the VCSLib xdata functions.

Watch out when defining constant arrays. You might see them defined like this:

const char* const CAPTIONS[NCAPTIONS] = { ...

Why? The first const means “read-only pointer”, but the second const means “read-only array”. It’s the second const that tells the compiler to put the array into ROM, not the first.

Not a lot of CPU cycles either#

CC65 doesn’t always produce the most optimized code. Simple assignment to registers is fine, but anything more complex may be expanded into potentially dozens of instructions and calls to support routines. In our simple demo, half of the preframe CPU cycles are already used.

Follow these tips:

  • Avoid passing more than one parameter to functions so that they use the fastcall convention. This saves both CPU time and stack space.

  • Avoid pointers and structs. Prefer direct array accesses with an 8-bit index.

  • Avoid deep function nesting, since you may overflow the tiny stacks.

  • Try rearranging complex C expressions into simpler expressions. For example, if we rewrite the code we saw earlier that sets k_colormap we can improve the speed by 50%. This shows how finicky CC65’s code generator can be.

  • Try to put your wrapped-call functions at the top level, since there’s a lot of overhead for switching banks. Place related functions in the same bank so you can remove wrapped-call from them.

  • You could potentially watch RIOT.intim (the PIA timer) to see if there’s enough time left to run all of your offscreen routines. Or run some routines on even frames, and some on odd frames.

  • Use assembly or inline asm where neccessary.

  • Converse zeropage RAM first, then CPU cycles, then ROM space.

Missing Features#

There are many missing features that are present in batariBASIC, like:

  • Multisprite kernel (sorting and reusing player objects)

  • Modifiable RAM playfield kernel (pixel-drawing)

  • Paddle support

  • Lives indicator

  • Support for other mappers like DPC

  • Visual tools

I don’t see a strong impediment to these features, except for the RAM limitations. The multisprite kernel would potentially use XRAM, since it assumes you’ll be storing data for many different objects.

What have we learned?#

Yes, we can develop Atari 2600 games in C with CC65. But we’ve revealed some real limitations:

  • The 128 bytes of zeropage (PIA) RAM gets used up very quickly.

  • CPU cycles get used up quickly.

  • The permanent ROM bank gets filled up quickly, and so we must rely on bank-switching.

  • Bank-switching incurs CPU overhead.

  • Using extended RAM incurs CPU overhead.

  • C code usually takes up more RAM, ROM, and CPU cycles than hand-coded assembly.

  • We still have to write lots of assembly code. The VCSLib demo is 839 lines of C, and 1801 lines of assembly.

  • The code we write can be vastly different depending on our choice of kernel, banking scheme, extended RAM usage, and other decisions.

Is it worth it? Hard to say – the inefficiencies resulting from C code require a lot of compromises. It would be interesting to port VCSLib to SDCC-6502 or LLVM-6502 to see if we can improve its performance and memory footprint.

But there are still several things that cannot be solved to satisfaction in C alone. For now, assembly language (or something equally low-level like Wiz) is still the only way to go for many aspects of 8-bit programming.