Will ChatGPT replace retro programmers?#

An LLM (Large Language Model) is an artificial neural network that statistically models the structure of text. It’s typically trained on a huge corpus of textual data, and tuned for a specific task, like an AI chat assistant (ChatGPT) or a programming helper (Copilot).

I created 8bitworkshop to make it easy for people to explore obsolete technologies. So I wondered: is the latest AI technology going to make the programmers of obsolete technologies obsolete themselves?

tl;dr: Not yet.

Can an LLM write a complete game?#

Can we use LLMs to write a complete game from a single sentence?

How about Wordle in a dialect of BASIC?#

ChatGPT-3.5

Write a Dartmouth BASIC program that implements the game of Wordle.

As a language model, I cannot write code in a specific dialect of BASIC since there are many versions of it. However, I can provide you with a general idea of how to implement the game of Wordle in BASIC…

ChatGPT seems to emit this “I can’t code” disclaimer semi-randomly. If it manages to continue, it usually generates a BASIC program in a mixed dialect which resembles QBasic.

Lunar lander is out, Hohmann Transfer is in#

Let’s omit the BASIC dialect for the next question.

ChatGPT-3.5

Write a BASIC game that simulates a Hohmann transfer.

To see the code, expand here.
10 REM HOHMANN TRANSFER SIMULATOR
20 PRINT "WELCOME TO THE HOHMANN TRANSFER SIMULATOR"
30 PRINT "========================================="
40 PRINT "YOUR MISSION IS TO TRANSFER A SPACECRAFT FROM AN ORBIT AROUND EARTH TO AN ORBIT AROUND MARS"
50 PRINT
60 PRINT "INSTRUCTIONS:"
70 PRINT "1. ENTER THE SEMI-MAJOR AXIS OF THE INITIAL ORBIT (IN KM)"
80 PRINT "2. ENTER THE SEMI-MAJOR AXIS OF THE FINAL ORBIT (IN KM)"
90 PRINT "3. ENTER THE ORBITAL PERIOD OF THE INITIAL ORBIT (IN HOURS)"
100 PRINT "4. ENTER THE ORBITAL PERIOD OF THE FINAL ORBIT (IN HOURS)"
110 PRINT
120 INPUT "ENTER THE SEMI-MAJOR AXIS OF THE INITIAL ORBIT (IN KM): ", a1
130 INPUT "ENTER THE SEMI-MAJOR AXIS OF THE FINAL ORBIT (IN KM): ", a2
140 INPUT "ENTER THE ORBITAL PERIOD OF THE INITIAL ORBIT (IN HOURS): ", T1
150 INPUT "ENTER THE ORBITAL PERIOD OF THE FINAL ORBIT (IN HOURS): ", T2
160 PRINT
170 PRINT "CALCULATING THE REQUIRED DELTA-V..."
180 REM CALCULATE THE REQUIRED DELTA-V
185 PI = 3.141592
190 mu = 3.986E5 ' GRAVITATIONAL PARAMETER OF EARTH (IN KM^3/S^2)
200 r1 = a1 * (2 ^ (2/3) - 1) ' RADIUS OF THE INITIAL ORBIT (IN KM)
210 r2 = a2 * (2 ^ (2/3) - 1) ' RADIUS OF THE FINAL ORBIT (IN KM)
220 v1 = 2 * pi * a1 / T1 ' VELOCITY OF THE SPACECRAFT IN THE INITIAL ORBIT (IN KM/S)
230 v2 = 2 * pi * a2 / T2 ' VELOCITY OF THE SPACECRAFT IN THE FINAL ORBIT (IN KM/S)
240 delta_v1 = sqr(mu / r1) * (sqr(2 * r2 / (r1 + r2)) - 1) ' DELTA-V REQUIRED TO REACH THE TRANSFER ORBIT (IN KM/S)
250 delta_v2 = sqr(mu / r2) * (1 - sqr(2 * r1 / (r1 + r2))) ' DELTA-V REQUIRED TO ENTER THE FINAL ORBIT (IN KM/S)
260 total_delta_v = delta_v1 + delta_v2 ' TOTAL DELTA-V REQUIRED FOR THE HOHMANN TRANSFER (IN KM/S)
270 PRINT "THE REQUIRED DELTA-V IS: "; total_delta_v; " KM/S"
280 PRINT
290 REM ASK THE PLAYER TO ENTER THE DELTA-V REQUIRED
300 INPUT "ENTER THE REQUIRED DELTA-V TO COMPLETE THE HOHMANN TRANSFER (IN KM/S): ", player_delta_v
310 PRINT
320 IF player_delta_v > total_delta_v THEN
330     PRINT "SORRY, THAT'S TOO MUCH DELTA-V!"
340     PRINT "YOU CRASHED YOUR SPACECRAFT INTO MARS..."
350 ELSE IF player_delta_v < total_delta_v THEN
360     PRINT "SORRY, THAT'S NOT ENOUGH DELTA-V!"
370     PRINT "YOU MISSED YOUR TARGET AND ARE NOW LOST IN SPACE..."
380 ELSE
390     PRINT "CONGRATULATIONS, YOU SUCCESSFULLY COMPLETED THE HOHMANN TRANSFER!"
400 END IF
410 PRINT
420

Taking inspiration from the classic Lunar Lander game, I wanted to see if it could take the same approach to another astrophysical manuever, the Hohmann transfer. It almost made a functional program with this one, though it mixed dialects (you can tell when it switches between UPPER and lower case) and always forgets to define PI.

Here’s a sample transcript from the game:

WELCOME TO THE HOHMANN TRANSFER SIMULATOR
=========================================
YOUR MISSION IS TO TRANSFER A SPACECRAFT FROM AN ORBIT AROUND EARTH TO AN ORBIT 
AROUND MARS

INSTRUCTIONS:
1. ENTER THE SEMI-MAJOR AXIS OF THE INITIAL ORBIT (IN KM)
2. ENTER THE SEMI-MAJOR AXIS OF THE FINAL ORBIT (IN KM)
3. ENTER THE ORBITAL PERIOD OF THE INITIAL ORBIT (IN HOURS)
4. ENTER THE ORBITAL PERIOD OF THE FINAL ORBIT (IN HOURS)

ENTER THE SEMI-MAJOR AXIS OF THE INITIAL ORBIT (IN KM): ? 300
ENTER THE SEMI-MAJOR AXIS OF THE FINAL ORBIT (IN KM): ? 10000
ENTER THE ORBITAL PERIOD OF THE INITIAL ORBIT (IN HOURS): ? 2
ENTER THE ORBITAL PERIOD OF THE FINAL ORBIT (IN HOURS): ? 24

CALCULATING THE REQUIRED DELTA-V...
THE REQUIRED DELTA-V IS: 24.962612455 KM/S

ENTER THE REQUIRED DELTA-V TO COMPLETE THE HOHMANN TRANSFER (IN KM/S): ? 25

SORRY, THAT'S TOO MUCH DELTA-V!
YOU CRASHED YOUR SPACECRAFT INTO MARS...

The object of this game is to guess the exact delta-V value after the game gives you the answer. Because the answer is not printed to full precision, the equality comparison fails—even if you type in the exact value. Not a very fun game…but I guess that’s what you get for not play-testing!

How about a nice game of Dice?#

Instead of a general concept, I presented it with the rules for a simple dice game, Pig, copied from Wikipedia. Here’s a sample transcript from the game:

Welcome to Pig Dice Game!
The first player to reach 100 points wins.
Press any key to start.
Let's begin!
Player 1, it's your turn.
Rolling dice...
You rolled a 5.
Your current score is 5.
Do you want to roll again? (Y/N)
? Y

It came very close to making a working game, except for some dialect confusion, uninitialized variables, and CALL -875 statements scattered throughout (“clear to end of line” on the Apple II).

To see the code, expand here.
10 REM Pig Dice Game
20 DIM score(2)
30 FOR player = 1 TO 2
40   score(player) = 0
50 NEXT player
60 PRINT "Welcome to Pig Dice Game!"
70 PRINT "The first player to reach 100 points wins."
80 PRINT "Press any key to start."
90 REM CALL -875
100 PRINT "Let's begin!"
110 player = 1
120 WHILE score(player) < 100
130   PRINT "Player "; player; ", it's your turn."
140   REM CALL -875
150   PRINT "Rolling dice..."
160   dice = INT(RND(1) * 6) + 1
170   PRINT "You rolled a "; dice; "."
180   IF dice = 1 THEN
190     PRINT "Sorry, you rolled a 1 and lost all your points."
200     score(player) = 0
210     player = 3 - player
220   ELSE
230     score(player) = score(player) + dice
240     PRINT "Your current score is "; score(player); "."
250     PRINT "Do you want to roll again? (Y/N)"
260     REM REPEAT
270       INPUT response$
280     REM UNTIL response$ = "Y" OR response$ = "N"
290     IF response$ = "N" THEN
300       player = 3 - player
310     END IF
320   END IF
330 WEND
340 PRINT "Player "; player; " wins with a score of "; score(player); "!"
350 PRINT "Thanks for playing!"
360 END

This is about as complex as I was able to get with an LLM and a simple request. Let’s try something different.

Can an LLM modify existing code?#

The previous examples were just simple questions. Can an LLM read and modify code?

FastBasic: Add a second sprite#

ChatGPT-3.5

Here is an Atari 800 FastBasic program that shows an animated bouncing sprite. Rewrite this program so that it displays two flashing animated bouncing sprites. [[source of pmtest.bas]]

It was able to do this almost perfectly, adding this new function:

proc MovePm2
x = xPos2 / 128 : y = P0Mem + yPos2 / 128
poke $D01A,$74 ' Change background color
pause 0
pmhpos 1, x ' Set new horizontal position for sprite 2
mset oldPos2, 5, 0 ' Clear old sprite 2
move adr(PMdata2), y, 5 ' Draw sprite 2 at new vertical
oldPos2 = y
endproc

There was one small problem—it duplicated the pause 0 (vertical sync) and background color change, causing the frame rate to slow to half. It should have probably moved these lines to the calling function. It seemed to ignore the “flashing” requirement, unless you count the accidental flashing of the background color.

To see the code, expand here.
' P/M test program

graphics 0 ’ Setups graphics mode pmgraphics 2 ’ And P/M mode P0Mem = pmadr(0) ’ Get player 0 address oldPos1 = P0Mem ’ and into “old position” for sprite 1 oldPos2 = P0Mem + 5 ’ and into “old position” for sprite 2

mset P0Mem, 128, 0 ’ Clears P/M 0 Memory setcolor -4, 1, 15

’ P/M data and blank (to clear P/M) DATA PMdata1() byte = $38,$44,$54,$44,$38 DATA PMdata2() byte = $E8,$D4,$A4,$D4,$E8

’ Initial Conditions xPos1 = 6400 : yPos1 = 2560 xSpd1 = 64 : ySpd1 = 0 xPos2 = 12800 : yPos2 = 5120 xSpd2 = -64 : ySpd2 = 0

repeat xPos1 = xPos1 + xSpd1 : yPos1 = yPos1 + ySpd1 ySpd1 = ySpd1 + 2 if (ySpd1 > 0) and (yPos1 > 12800) ySpd1 = -ySpd1 xSpd1 = Rand(512) - 256 endif if xSpd1 > 0 if xPos1 > 25600 Then xSpd1 = -xSpd1 else if xPos1 < 6400 Then xSpd1 = -xSpd1 endif

xPos2 = xPos2 + xSpd2 : yPos2 = yPos2 + ySpd2 ySpd2 = ySpd2 + 2 if (ySpd2 > 0) and (yPos2 > 12800) ySpd2 = -ySpd2 xSpd2 = Rand(512) - 256 endif if xSpd2 > 0 if xPos2 > 25600 Then xSpd2 = -xSpd2 else if xPos2 < 6400 Then xSpd2 = -xSpd2 endif

exec MovePm1 ’ Move sprite 1 exec MovePm2 ’ Move sprite 2 until Key()

graphics 0

END

proc MovePm1 x = xPos1 / 128 : y = P0Mem + yPos1 / 128 poke $D01A,$74 ’ Change background color pause 0 pmhpos 0, x ’ Set new horizontal position mset oldPos1, 5, 0 ’ Clear old sprite 1 move adr(PMdata1), y, 5 ’ Draw sprite 1 at new vertical pos. oldPos1 = y endproc

proc MovePm2 x = xPos2 / 128 : y = P0Mem + yPos2 / 128 poke $D01A,$74 ’ Change background color pause 0 pmhpos 1, x ’ Set new horizontal position for sprite 2 mset oldPos2, 5, 0 ’ Clear old sprite 2 move adr(PMdata2), y, 5 ’ Draw sprite 2 at new vertical

NES: Sprites to Metasprites#

ChatGPT-3.5

This is a NES demo in C that animates 64 sprites independently. Rewrite this program so that it animates 16 metasprites, each metasprite composed of 4 hardware sprites. [[source of NES sprite demo]]

Impressive! It actually did this right, on the very first try, with no errors. It created its own metasprite table rather than using the NESLib metasprite functions, which were not featured in the sample code contained in the prompt.

ChatGPT reached its token limit halfway through the program, so I also had to type “Continue the code” to get the full output.

To see the code, expand here.
/*
Metasprite demo.
Animate 16 metasprites, each composed of 4 hardware sprites.
*/

#include <stdlib.h> #include <string.h>

// include NESLIB header #include “neslib.h”

// link the pattern table into CHR ROM //#link “chr_generic.s”

/{pal:“nes”,layout:“nes”}/ const char PALETTE[32] = { 0x03, // screen color

0x11,0x30,0x27,0x0, // background palette 0 0x1c,0x20,0x2c,0x0, // background palette 1 0x00,0x10,0x20,0x0, // background palette 2 0x06,0x16,0x26,0x0, // background palette 3

0x16,0x35,0x24,0x0, // sprite palette 0 0x00,0x37,0x25,0x0, // sprite palette 1 0x0d,0x2d,0x3a,0x0, // sprite palette 2 0x0d,0x27,0x2a // sprite palette 3 };

// setup PPU and tables void setup_graphics() { // clear sprites oam_clear(); // set palette colors pal_all(PALETTE); // turn on PPU ppu_on_all(); }

// number of metasprites #define NUM_METASPRITES 16

// metasprite definitions const unsigned char metasprite_def[NUM_METASPRITES][4] = { {0, 0, 0, 0}, {1, 1, 1, 1}, {2, 2, 2, 2}, {3, 3, 3, 3}, {4, 4, 4, 4}, {5, 5, 5, 5}, {6, 6, 6, 6}, {7, 7, 7, 7}, {8, 8, 8, 8}, {9, 9, 9, 9}, {10, 10, 10, 10}, {11, 11, 11, 11}, {12, 12, 12, 12}, {13, 13, 13, 13}, {14, 14, 14, 14}, {15, 15, 15, 15} };

// metasprite x/y positions byte metasprite_x[NUM_METASPRITES]; // horizontal coordinates byte metasprite_y[NUM_METASPRITES]; // vertical coordinates

// metasprite x/y deltas per frame (signed) sbyte metasprite_dx[NUM_METASPRITES]; // horizontal velocity sbyte metasprite_dy[NUM_METASPRITES]; // vertical velocity

// main program void main() { char i, j; // metasprite index and sprite index char oam_id; // sprite ID

// initialize metasprites with random values for (i=0; i<NUM_METASPRITES; i++) { metasprite_x[i]= rand(); metasprite_y[i] = rand(); metasprite_dx[i] = (rand() & 7) - 3; metasprite_dy[i] = (rand() & 7) - 3; } // initialize PPU setup_graphics(); // loop forever while (1) { // start with OAMid/sprite 0 oam_id = 0; // draw and move all metasprites for (i=0; i<NUM_METASPRITES; i++) { // draw all sprites in the metasprite for (j=0; j<4; j++) { oam_id = oam_spr(metasprite_x[i] + (j % 2) * 8, metasprite_y[i] + (j / 2) * 8, metasprite_def[i][j], 0, oam_id); } // move the metasprite metasprite_x[i] += metasprite_dx[i]; metasprite_y[i] += metasprite_dy[i]; } // hide rest of sprites // if we haven’t wrapped oam_id around to 0 if (oam_id!=0) oam_hide_rest(oam_id); // wait for next frame ppu_wait_frame(); } }

Show, don’t tell#

These are cherry-picked examples; many other similar experiments failed. ChatGPT was not successful in augmenting a program in Wiz, a low-level niche language for 8-bit platforms. It’s also really bad at assembly language.

But it could not have come up with these two programs via a simple prompt. Once we gave it some context (the original source code) and a request, it was able to infer the details without being explicitly told. “Show” often works better than “tell” with LLMs.

Can an LLM write interactive fiction?#

LLMs can generate natural language and roughly model basic physical concepts, so applying them to interactive fiction seems a good fit. The results are often amusing.

One issue is that the top interactive fiction frameworks (Inform, TADS, Ink) are not well-represented in the training set, so the code that it generates in these languages has errors. However, it’s fascinating to see LLMs create puzzles out of whole cloth. Here’s a ChatGPT attempt to create an Inform 7 puzzle (no guarantees that it will compile):

To see the code, expand here. The Dude's Apartment is a room. "You are in The Dude's apartment, surrounded by a mixture of worn furniture, empty White Russian glasses, and various pieces of clothing."

The TV is a device in the apartment. The description is “The TV is broken and won’t turn on. Maybe there’s something you can do to fix it.” The TV is switched off. The TV is not fixed.

The remote control is a thing in the apartment. The description is “A remote control for the TV is lying on the coffee table.”

Instead of turning on the TV when the TV is not fixed: say “You try to turn on the TV, but nothing happens. It looks like it’s still broken.”;

Instead of turning on the TV when the TV is fixed: say “You turn on the TV, and it springs to life!”;

Instead of pushing the power button on the remote control when the TV is not fixed: say “You press the power button on the remote control, but nothing happens.”;

Instead of pushing the power button on the remote control when the TV is switched off: say “You press the power button on the remote control, and the TV turns on! It looks like it was just switched off.”;

Instead of pushing the power button on the remote control when the TV is switched on: say “The TV is already on.”;

Instead of examining the TV when the TV is not fixed: say “The TV looks like it’s in pretty bad shape. Maybe there’s something you can do to fix it.”;

Instead of fixing the TV with the screwdriver when the screwdriver is not held by the player: say “You’ll need to pick up the screwdriver first.”;

Instead of fixing the TV with the screwdriver when the screwdriver is held by the player and the TV is not fixed: now the TV is fixed; say “You use the screwdriver to open up the TV and fix the broken parts. It looks like it’s working now!”; now the TV is switched off.

Instead of examining the TV when the TV is fixed: say “The TV looks like it’s in good shape now.”;

The screwdriver is a thing in the apartment. The description is “A screwdriver is lying on the kitchen counter.” The screwdriver can be held.

Test me with “take screwdriver, fix tv, turn on tv, push power button”.

To get around the language barrier, I decided to try a popular language. I created a toy JavaScript interactive fiction framework, and fed ChatGPT some sample code using the framework. I then asked it to create a magic wand that poofs an object out of existence if you zap it three times. It generated the following:

const wand = Item.newEntity("magic wand", ["wand"]);
wand.moveTo(player);
wand.zaps = 0;
Item.newVerb('zap', (actor, obj) => {
  if (obj.zappable) {
    obj.zaps++;
    if (obj.zaps === 3) {
      print(`You zap the ${obj.name} and it disintegrates!`);
      obj.parent.children.splice(obj.parent.children.indexOf(obj), 1);
    } else {
      print(`You zap the ${obj.name}.`);
    }
  } else {
    print(`The ${obj.name} can't be zapped.`);
  }
});

Can LLMs help us write CPU emulators?#

ChatGPT-3.5

Write a JavaScript program that emulates the 6502’s ADC instruction.

function adc(memoryLocation, carryFlag, accumulator) {
  let sum = memoryLocation + accumulator + carryFlag;
  let carry = (sum > 255) ? 1 : 0;
  let result = sum & 0xff; // truncate to 8 bits
  let overflow = ((~(memoryLocation ^ accumulator) & (memoryLocation ^ result)) & 0x80) >> 7;
  // calculate overflow flag
  return {result, carry, overflow};
}

On first glance, it forgot a few things, like the zero and negative flags and BCD mode. We weren’t very specific in the prompt, so it may have assumed these were implemented elsewhere, or just forgot. Let’s try again, but include the full documentation for the ADC instruction.

ChatGPT-3.5

Read the following documentation and write a JavaScript program that emulates the described instruction. [[documentation for ADC instruction]]

function ADC(operand, carryFlag, decimalFlag) {
  let result = A + operand + carryFlag;
  if (decimalFlag) {
    // BCD addition
    let temp = (A & 0x0f) + (operand & 0x0f) + carryFlag;
    if (temp > 9) temp += 6;
    result = (A & 0xf0) + (operand & 0xf0) + temp;
    if (result > 0x99) {
      result += 96;
      setCarryFlag(true);
    }
    setOverflowFlag(((A ^ operand) & 0x80) == 0 && ((A ^ result) & 0x80) != 0);
  } else {
    // binary addition
    setCarryFlag(result > 0xff);
    setOverflowFlag(((A ^ operand) & 0x80) == 0 && ((A ^ result) & 0x80) != 0);
  }
  A = result & 0xff;
  setNegativeFlag((A & 0x80) != 0);
  setZeroFlag(A == 0);
  return getCycleCount();
}

That looks a lot more comprehensive! We didn’t specify a function prototype or helper functions, so we get what we get. If we had provided an example implementation of another instruction, it would have tried to follow the template.

But wait… do you spot an error?

    if (result > 0x99) {
      result += 96;
      setCarryFlag(true);
    }

In BCD mode, the Carry flag is set, but never cleared. This is unfortunately the kind of subtle error that pops up often in LLM-generated code, and it’s hard to detect without comprehensive testing. We never told it not to clear the Carry flag when in BCD mode, but we didn’t not tell it to, either.

Another problem I’ve witnessed has been described as shadowing or eclipsing. This is when the model has many examples of a concept A, but few or no examples of a sub-concept B, and it is very hard to convince it to “believe” in B. For example, I can describe an instruction called ADC2, which is a slight variation of ADC, but it will often ignore the differences and just use its knowledge of ADC. Maybe some prompt-crafting that can override this behavior exists, but I haven’t found it.

Can LLMs help us with documentation?#

Writing reference documentation is often a chore. I wanted to see if LLMs could document each instruction the 6502 CPU.

To help the LLM, I wanted to provide as much context as possible. I created a shell script that generates a prompt for a specific 6502 instruction. The prompt includes an example of the desired documentation and some technical data about the target instruction. Here’s a sample prompt:

Generate a Markdown documentation page for the 6502's BIT instruction,
using the following information.

Instruction Table:

| Mode | Syntax | Bytes | Cycles |
|------|--------|-------|--------|
| [ZeroPage](../modes/zeropage) | BIT zp | `24 nn` | 3 |
| [Absolute](../modes/absolute) | BIT addr | `2C ll hh` | 4 |

Internal Data:

  {mn:"BIT",am:"aa",nb:2,il:0,c1:3,c2:0,nw:0,br:0,mod:"ZVN"}, // 24
  {mn:"BIT",am:"AAAA",nb:3,il:0,c1:4,c2:0,nw:0,br:0,mod:"ZVN"}, // 2c
(nb=byte length il=illegal? c1=min cycles c2=extra cycles nw=write cycles br=branching mod=modified regs/flags)

Template for Markdown documentation page:

{{Markdown page for ADC instruction}}

This prompt will generate a Markdown page. Here’s a sample:

BIT (BIT Test)

[operand] & A

BIT performs a bitwise AND operation between the accumulator register (A) and an operand. However, the result is not stored in the accumulator. Instead, the Zero (Z) and Negative (N) flags are updated based on the result, and the Overflow (V) flag is updated based on the 6th bit of the operand.

Mode

Syntax

Bytes

Cycles

ZeroPage

BIT zp

24 nn

3

Absolute

BIT addr

2C ll hh

4

  • N (Negative) – Set if the result has the most significant bit set (i.e., is negative); otherwise, it is cleared.

  • Z (Zero) – Set if the result is zero; otherwise, it is cleared.

  • V (Overflow) – Set if the 6th bit of the operand is set; otherwise, it is cleared.

Example - Test a Bit in a Variable:

    LDA #32    ; load A with 32 (0010 0000)
    BIT var    ; test bit 5 of var and set flags
    BVS bit5Set; branch if bit 5 of var is set

How do we know the generated documentation is correct? (spoiler: it’s not)

Verifying documentation correctness#

I asked GPT-4, a more powerful model, to review the previously generated documentation for the BIT instruction:

ChatGPT-4

Review the statement about the 6502 CPU for errors.

The statement about the 6502 CPU and the BIT instruction is mostly accurate, with a small error related to the Overflow (V) flag. Here is a corrected version of the statement:

BIT performs a bitwise AND operation between the accumulator register (A) and an operand. However, the result is not stored in the accumulator. Instead, the Zero (Z) and Negative (N) flags are updated based on the result, and the Overflow (V) flag is updated based on the 7th bit (not the 6th bit) of the operand.

GPT-4 found GPT-3’s error, which is a tricky problem for both humans and LLMs. Often bit positions are zero-indexed in documentation – for example, “bit 0” would mean the 1st bit.

Let’s see if it can find a more subtle flaw. I took the human-written documentation page for the ADC instruction, and tweaked it a bit to make it factually incorrect. I then asked GPT-3 to review it.

ChatGPT-3.5

Is this statement accurate? “ADC performs addition on the accumulator register (A), an operand, and the Carry flag. The result it truncated to 8 bits, and then stored in the accumulator. The Carry flag is set if the unsigned result exceeds $80 (128), otherwise it remains unchanged.”

Most of the statement is accurate, but there’s one small error. The Carry flag is set if the unsigned result exceeds 255 (not 128). Here’s the corrected statement:

ADC performs addition on the accumulator register (A), an operand, and the Carry flag. The result is truncated to 8 bits, and then stored in the accumulator. The Carry flag is set if the unsigned result exceeds 255, otherwise it remains unchanged.

Any other errors?

No, there are no other errors in the statement. It accurately describes the behavior of the ADC instruction on the 6502 processor.

Is the Carry flag really unchanged?

Yes, you’re correct. The Carry flag is actually updated by the ADC instruction, but it’s updated differently depending on whether the addition causes a carry or not.

GPT-4, however, found both errors:

ChatGPT-4

The statement about the 6502 CPU and the ADC instruction contains some errors. Here is a corrected version of the statement:

ADC performs addition on the accumulator register (A), an operand, and the Carry flag. The result is truncated to 8 bits and then stored in the accumulator. The Carry flag is set if the unsigned result exceeds $FF (255), not $80 (128), otherwise, it is cleared.

However, both GPT-3 and GPT-4 have already been trained on 6502 documentation. I’d expect their performance to drop if presented with a novel domain, or a more complex problem.

We can’t rely on the accuracy of the LLM’s generated documentation, and we can’t rely on the LLM to fact-check itself. Unlike code, we can’t run the documentation and see if it crashes, or write unit tests against it. For now, we’ll probably need humans to review documentation – after all, they’re the target audience!

Documenting without context#

LLMs can often find patterns in things with minimal context. For example, I fed it a JSON dictionary from a 6502 processor test database.

ChatGPT-3.5

Decode this test data and analyze.

{ "name": "8b ff a1", "initial": { "pc": 3531, "s": 186, "a": 163, "x": 5, "y": 20, "p": 42, "ram": [ [3531, 139], [3532, 255], [3533, 161]]}, "final": { "pc": 3533, "s": 186, "a": 5, "x": 5, "y": 20, "p": 40, "ram": [ [3531, 139], [3532, 255], [3533, 161]]}, "cycles": [ [3531, 139, "read"], [3532, 255, "read"]] }

The test data appears to be describing a computer program that executes on a machine with an 8-bit processor. The program has a starting state, described in the “initial” field, and a final state, described in the “final” field, and the “cycles” field gives a log of the memory access cycles that occurred during the program’s execution.

The “name” field seems to be an arbitrary identifier for the program. The “pc” field in the “initial” and “final” states indicates the value of the program counter (PC), which is the memory address of the next instruction to be executed. The “s”, “a”, “x”, and “y” fields indicate the values of the stack pointer, accumulator, X register, and Y register, respectively. The “p” field indicates the value of the processor status register, which contains flags that indicate various conditions, such as whether the last arithmetic operation resulted in a zero or carry.

The “ram” field gives the state of the machine’s memory. It is a list of 2-element arrays, where the first element is the memory address and the second element is the value stored at that address.

The “cycles” field gives a log of the memory access cycles that occurred during the program’s execution. Each element in the list is a 3-element array, where the first element is the memory address being accessed, the second element is the value being read or written, and the third element is a string indicating whether the access was a read or a write.

It seems to describe this test data pretty well without any documentation, other than the names of the JSON properties!

Can LLMs help us brainstorm ideas?#

One of my favorite ways to brainstorm is to ask a chatbot to generate an instruction manual or a press release. Here’s an example:

ChatGPT-3.5

Write the instruction manual for an award-winning Atari 2600 homebrew game about a pickle factory.

Objective:

Your objective is to produce as many high-quality pickles as possible while avoiding hazards. The game is divided into levels, and your goal is to reach the end of each level to progress to the next one.

Gameplay:

You start the game with a set number of jars. Use the joystick to move your character left or right to catch the falling cucumbers. Once you catch a cucumber, move your character to the chopping area and press the button to chop the cucumber. Move your character to the jar and press the button to fill the jar with chopped cucumbers. Once the jar is full, move it to the conveyor belt to be transported to the next area. Along the way, you’ll encounter hazards such as rats and spoiled cucumbers. Avoid these hazards by moving your character out of harm’s way. You have a limited amount of time to complete each level, so work quickly and efficiently.

Scoring:

Your score increases with each jar of high-quality pickles you produce. You’ll receive a bonus score for completing each level within the allotted time. Avoiding hazards will also earn you bonus points.

Game Over:

The game ends when you run out of jars or fail to complete a level within the allotted time. Your final score will be displayed on the screen.

Write the top 5 potential names for this game.

  • Pickle Pursuit

  • Cucumber Catcher

  • Dill Dilemma

  • Vinegar Venture

  • Brine Brigade

It can also help with brainstorming plots and puzzles. Some sample prompts:

  • Consider an interactive fiction title based on The Big Lebowski movie. Write an outline of the game in three acts, three scenes per act.

  • Write a room and object list for Act 1, Scene 1.

  • What are the puzzles in The Dude’s Apartment?

  • Write the Inform 7 code for the “Fix the Broken TV” puzzle.

Conclusion#

LLMs have limitations:

  • LLMs are based on statistics, not logical rules.

  • LLMs make mistakes, and often, we have no visibility into the root cause, nor any definite remedy.

  • LLMs also have a limited “context window” that puts a cap on the complexity of the information they can ingest without re-training.

Therefore, it makes sense to use LLMs in applications where its output can be validated, or where correctness is not vital.

From best to worst (IMHO):

  1. Boilerplate, brainstorming, poetry and amusement

  2. Coding assist, coding from examples, test case generation

  3. Large projects, niche languages, technical documentation

Many of these examples were done with GPT-3.5; GPT-4 and its successors will be more powerful. There will also likely be new techniques for priming an LLM with domain-specific information, like the particulars of an 8-bit platform, CPU, language, or library. OpenAI has just released plugins which will open up new possibilities.

In conclusion, while LLMs are powerful tools for natural language processing tasks, including AI chat assistants and programming helpers, they are not yet capable of making programmers of obsolete technologies obsolete themselves.

(FYI: This final paragraph was written by ChatGPT.)