Saturday, June 27, 2020

Weak bits floppy disc protection: an alternate origins story on 8-bit

Floppy disc copy protection schemes are varied and interesting. It's an interesting intellectual challenge: what schemes can be created whereby home computers could reliably read a given disc, but not be able to easily (or at all) write that data back in the same format?

When talking about floppy disc copy protection code, I find it useful to split into two separate pieces:
  • The on-disc bits that are tricky to replicate.
  • The loader code that obfuscates the check of the on-disc bits and the game code, making the check hard to "patch out".
This post concerns the former: on-disc bits. Some of my favorite on-disc schemes include the so-called "Spiradisc" protection scheme and the fuzzy bits scheme used by Dungeon Master for the Atari ST, as explained in excellent detail in this article. I also recommend this great overview of Commodore 64 disc protection schemes. To be charitable, the Commodore 64 had a.... "quirky" disc drive setup.

I recently encountered an on-disc scheme on a BBC Micro disc that was more sophisticated than I was expecting: weak bits. Weak bits and similar schemes were celebrated in the 16-bit era but here we have it in the 8-bit era.

Floppy what now?

Only the left two are actually "floppy", and the middle 5.25" disc is actually called a "mini" floppy!

Floppy disc drives and controllers are likely simpler than you think. The job of the floppy drive is to take the analog magnetic information on a disc surface and turn it into a series of digital pulses. The job of the floppy controller is to take the series of digital pulses, discern timing, and generate a string of data bytes. The floppy controller usually also has the responsibility of spotting special marker bytes in the pulse stream so that distinct sectors can be identified.

This oscilloscope trace from a 1980s disc drive may help you visualize things better:

Blue line is the analog amplified read head signal; yellow pulses are the digital read output from the drive

Every peak in the blue line is the drive head sensing a magnetic flux reversal on the disc surface. Upper and lower peaks are treated identically and result in a digital pulse getting sent to the disc controller, which is the yellow line.
Note that tools such as an OmniFlop, KryoFlux or GreaseWeazle might occasionally refer to "raw flux" reads or dumps but beware they that only get to see the yellow line in the trace above, which is a lossy view of the blue line.

Looking at the yellow line, you may notice that there's a fairly regular cadence to the peaks. In fact, every 4 microseconds, there is a timing "slot" and there will either be a pulse or no pulse. If there's a pulse, that's a 1 bit. No pulse and that's a 0 bit. And as simple as that, there's a bit stream for the disc controller to interpret and hand off to the host computer.

Different computers use different encoding schemes. We're focusing on the BBC Micro this post, where most discs used FM encoding, aka. single density. This is a very simple encoding, an evidenced by the fact we can eyeball the 0 and 1 bits in the scope trace above. MFM encoding was generally more common in the era.

Sectors in the stream

The disc controller takes the pulse stream and makes sense of it. As mentioned above, in FM (sometimes called DFM) encoding, each pulse or non-pulse represents a 1 or 0 bit. These bits are a mix of clock bits and actual data bits. Clock bits are needed for a couple of reasons: as a source of timing to sync to, and also to prevent the disc drive from thinking it has lost the signal. There must be a pulse at least every 8us to keep everything reliable. FM encoding takes the simple route: every other on-disc bit is a clock bit and they will all be 1 almost all of the time.

So now we know how to separate data bits from clock bits, let's have a quick look at what a sequence of data bytes on the disc surface might look like. There's a simple protocol for describing sectors on the disc, called the "IBM Diskette", which is described in the 8271 datasheet.

It's fairly simple and it looks like this in hex:

FF FF FF 00 00 00 FE 00 00 00 01 F1 D3 FF FF FF 00 00 00 FB 01 02 03 04 05 ..

The green data bytes are padding between sectors, or between sector headers and sector data. They help the controller maintain correct synchronization, and to re-gain synchronization at the correct point. The padding sequences are typically longer on a real disc, but are shortened above for clarity; they would still suffice.

The sector header identification byte is the FE, which is followed by 4 bytes of sector header and a 2-byte CRC. The sector is declaring it is on track 0, head 0, and it is sector 0, sized 256 bytes. The CRC is correct. The sector data identification byte for non-deleted data is FB, and 256 bytes of data are expected to follow, then another 2-byte CRC. A typical track might contain 10 such padding + sector header + padding + sector data sequences.

The astute reader would ask: "oh! but what if the FE or FB bytes occur in actual sector data, as they are bound to do from time to time?"
That is a great question and the answer lies in the clock bits. Every byte above has 0xFF for its clock bits (all set) except 0xFE and 0xFB, which have 0xC7 for the clock bits. i.e. some of the clock bits are missing! That makes it possible to identify sector header and sector data markers accurately. Note that the combination of these special data bytes plus clock byte is chosen so that still, the invariant is keep that there are never 2 0 bits in a row on the disc surface.

BBC Micro disc protection journey

The BBC Micro disc protection journey is a fairly meandering one. It seems to have been based on a classical arms race approach: a software publisher would publish a new disc with some new quirk on it, and then a new disc copy program would come out which understood the quirk. And repeat a few iterations!
The journey is also complicated by two very different disc controller chips being used during the machine's lifetime: the Intel 8271 and the Western Digital 1770/1772. Both chips had different capabilities and quirks. A good copy protection on one of the chips might be trivial to copy on the other. Also, some earlier BBC Micro discs were made that did not end up being compatible with 1770 based systems.

To enumerate some of the techniques seen:
  • Early Micro Power titles, such as Ghouls, used non-standard numbers of sectors on many tracks, such as 5 sectors of 512 bytes each instead of 10 sectors of 256 bytes each. This is easy to copy if you have a copy program that checks the sector headers first and then decides how many there are and what sizes based on that. It would be easy to scoff at such a simple protection but it's likely it did what it needed to at the time of launch: prevent casual disc copying using the built-in DFS (Disc Filing System) commands *COPY or *BACKUP (both of which expected well-formatted DFS discs, which meant 10 sectors per track).
  • The Superior Software classic, Citadel, used "deleted data". Every sector on a disc can be marked as either normal or deleted by the one byte mark that occurs directly before the sector data bytes. This is again a simple protection that a copier can handle as long as it knows about the concept of deleted data. What's interesting is that Superior did release a few titles with more advanced protection -- most notably the legendary Exile. However there are stories about the protection not loading correctly with some setups, so Superior discs from then on can all be seen using just the simple deleted data protection. Presumably: once bitten, twice shy.
  • The most iconic game on the BBC Micro may well be Elite. One of the tricks it used was an "unformatted track" in the middle of the disc. An unformatted track does not have any recognizable sector headers. This is perhaps the first protection that made disc copy programs sweat. Particularly on the 8271 floppy controller, there is no "unformat" command, only a "format" command. This was resolved decisively in favor of the disc copy programs, though, with a clever trick: if you format a track as one giant 4096 byte sector, you win immediately. What happens is that the single sector header is written at the start of the track, but then the 4096 bytes of sector data wrap around across the end of the track (which is 3125 bytes long) and trash the only sector header! So the track isn't really unformatted, but all that is needed is that the disc controller doesn't see any sector headers -- mission accomplished.
  • Another fairly ubiquitous protection was mismatched physical / logical track IDs in the sector header. This was also used by Elite, which also used unusually large logical sector IDs. This is easy to copy if you know you're looking for the situation. Normally, the disc controller (particularly the 8271) will get upset if it sees a track mismatch but you can fake the controller out by setting its internal track register to match, with a special command.
  • Later on in the BBC Micro's life, some publishers of software on disc upped their game. It became common to see extra data bytes "hidden" in between sectors. Simplistic attempts to copy these bytes would fail by overwriting a following sector. 1770-based disc copiers can handle this situation easily unless the hidden bytes are "reserved" in the 1770's write track protocol. It is possible to copy these situations well with both an 8271 or 1770 based copier, but I haven't yet found one that makes a decent effort. It is necessary to do things like take direct control of the disc controller chip, and issue commands with precise timing, and abort the controller mid-command with similarly precise timing.
Another favorite pastime of disc protection authors was a creative way of reducing copying without necessarily improving the underlying protection mechanism: attack the copy program itself. The copy programs needed a bunch of logic to work out what sector setup they were encountering. It's not easy logic, so breaking it was fruitful. Here's a few shots (under emulation, but real hardware behaves identically :) of Exile breaking Vector 2:

Loading Vector 2

Second track correctly identified as having 18 sectors

A beeeeeep and a crash / hang processing the second track

Sherston Software weak bits protection: introducing "soft lock"

Given the copy protection arms race described for games above, it was a surprise to find a smaller educational software house with an on-disc protection format light years ahead. Sherston Software had a great catalog of software that my kids still enjoy today, with our current favorite being Space Mission Mada.

Many Sherston Software discs use weak bits protection. We should briefly define terms because they are not consistently used; the most sensible and broadly agreed terms appear to be:
  • Flaky bits. Any on-disc bits that do not read back consistently. They are flaky.
  • Weak bits. Flaky bits caused by a weak signal or non-existent magnetic signal on the disc surface. You might also see the term no-flux area (NFA), which is the same as a non-existent signal. Weak bits are almost always a non-existent signal, as opposed to a weak signal. The flaky nature of weak bits actually comes out of the drive electronics: when there are no clear flux changes, the drive just amplifies harder until it starts seeing and signaling ghosts within the noise.
  • Fuzzy bits. Flaky bits caused by a strong, clear signal but where the timings of the read pulses are borderline. The borderline timing means the disc controller chip cannot be sure whether a pulse is supposed to be a 1 bit or a 0 bit. It'll change its mind from read to read.
The use of weak bits is advanced because it's one of the on-disc protections that arise when you do a first principles analysis:
  • Weak bits give a "reliable" read result: you can reliably depend on them to read back differently from read-to-read!
  • Weak bits cannot be written by the standard disc controllers. The disc controllers do everything in their power to lay down bits that read back deterministically. That's their job! There's no "write weak signal" flag and there's no "write timing violations" flag.
I wrote to the author of Space Mission Mada, Simon Hosler. It turns out he also devised the weak bits protection, along with his electronics geek next door neighbor! In Simon's words:

"Soft lock (was what we called it) was actually my system, so what I remember… This came about because I lived next door to an electronics geek! 😊 So break the write data line of the parallel disk cable. Add a bit of electronics to this line. (thank you Mike) Most of the time this electronics does nothing – lets the data go through as normal. If you turn it on (I think I did this through the serial port) and write to a single sector - it would count the bits going through say 256 – and then stop the next 256 bits going through"

I happen to have an original Sherston disc with weak bits protection, Animated Alphabet. Here's what the weak bits patch of disc looks like with the drive wired up to an oscilloscope:

The blue line shows the on-disc magnetic signal come and go as Simon's widget toggles on and off. The yellow line shows the drive still emitting read pulses to the floppy controller (quasi-randomly) even in patches of no signal

The effect of Simon's widget in creating batches of weak bits can clearly be seen. This really was genius for the time, being one of the earlier swings at creating disc surfaces fundamentally uncopyable without special hardware. Not only that, but it works the same on both 8271 and 1770 based systems since it's the drive electronics that are being induced to create the quasi-randomness. This leads to a simpler, more reliable setup with just a single code path. It is also very compatible with all the myriad of different DFS (Disc Filing System) variants because the code to check the copy protection doesn't need fancy DFS calls. It just needs to read a sector -- very standard! -- a few times and see if the bytes coming back vary or not.

State of the art

I don't know when the first flaky bits based disc protection was released. There's probably someone out there who can point to an example from the 1970s! But it's worth comparing dates on some examples we do know about.

Dating the first Sherston Software title to use weak bits protection is tough. I have an image of Mr. Yog and the Nippet, (c) 1984, with weak bits protection. There's even an image of Short Vowel Sounds, (c) 1983, with weak bits protection. However, we also have seen the same Sherston title released with multiple different disc revisions with different copy protection systems. We can establish an initial latest bound on the weak bits protection with an image of The Wizard's Revenge that happens to use the weak bits protection and also includes a commercial duplicator fingerprint on its one-past-the-end track, as seen here in this hex editor view:

This data, starting with 01 02 03 04 05 on the first line, appears to be added by a commercial duplicator from the era

Fingerprints of this nature contain a date and time: 87 01 05, or Jan 5th, 1987 in this case.

So in terms of release timing of early flaky bits based protection, we have:
  • Weak bits: The Wizard's Revenge, Sherston Software, Jan 5th 1987.
  • Fuzzy bits: Dungeon Master, Dec 15th 1987.
So it seems likely that the Sherston weak bits went to production at least a year prior to the Dungeon Master fuzzy bits. Although these disc protections are not identical, they have very similar properties and capabilities.

What's further interesting is that the Dungeon Master fuzzy bits were the subject of a patent filed in 1986. This could be one of those cases where other clever people had come up with prior art. It happens a lot.

Copying weak bits with original 1980s BBC Micro drives and controllers

No retro research would be complete without an attempt to push the bounds of what was thought possible back in the day. Accordingly, would it be possible to create a software only solution to write weak bits to discs? Another way to ask the same question is: would it have been possible to create a disc copier that copied weak bits correctly?

It turns out the answer is yes! Or more specifically, the answer is a double yes. Using two different tricks, it is possible to create actual weak bits with the 8271 disc controller, and non-deterministic reads (resulting in the same sort of read effects as weak bits) with the 1770 disc controller.

8271

The 8271 is a slightly strange chip to program. One thing is has is the concept of "special registers" which can be read and written. One such special register is the "Drive Control Output Port":


The special registers aren't particularly well covered in the data sheet, perhaps because they aren't supposed to be necessary for usage. That said, most 8271 driver implementations write this register to spin up the drive (by setting the LOAD HEAD bit and appropriate drive SELECT bit) in order to control the 8271's propensity to fail read/write commands with "drive not ready".

The trick is to use this register to set the WRITE ENABLE bit as well as the LOAD HEAD and SELECT bits. This can only be done outside any other command, because Write Special Register is itself a command and selecting it will abort any other in-progress command. When WRITE ENABLE is active outside of a command, the disc drive's write head will be energized but no data pulses will be transmitted on the write data pin to the disc drive. The result is that the write head sweeps the disc surface clean of flux transitions. That creates weak bits / a no flux area. With a bit of careful timing, the write head can be energized and de-energized and any point(s) needed on a track to create weak bits where desired.

1770

The trick on the 1770 is different. It may be possible to directly create weak bits by creatively programming the 1770 -- but if it is, I haven't found it. Instead, we focus on the BBC Micro's 1770 control register. This is a register external to the 1770 that is necessary to control external 1770 pins. For example, unlike the 8271, the 1770 selects disc drive 0 vs. drive 1 via an external pin instead of as parameters passed to controller commands. The "Drive Control Register" is documented well at the bottom of this document.

Since this register is external to the 1770, we can mess with it while a command is in progress. The specific trick we use is to start a single density write command and then flip to double density some number of bytes in. Double density (MFM) timing is completely different to single density timing, so reading back the resulting patterns as single density read confuses the disc controller significantly, to the point that non-deterministic read results are returned. This is not weak bits on the disc but the effect is the same as far as the copy protection check is concerned: read the sector twice and check the result varies! The bits definitely end up flaky.

Does it work? Yes. Here's a video of me using my work-in-progress "discbeast" utility. We fix a failed copy of Sherston weak bits by using commands to directly and precisely create flaky bits at the correct point on the disc.


Tuesday, June 23, 2020

A wild bug: 1970s Intel 8271 disc chip ate my data!

If you must suffer data loss then at least make sure you collect a good story from the experience!

I believe I have such a story.

Recently, I've been enjoying keeping my inner engineer sharp by playing around with retro software and hardware. I've taken a particular interest in floppy disc drives and old-school floppy disc protection on the BBC Micro. The BBC Micro had a couple of different popular floppy drive controllers: the older Intel 8271 and the newer Western Digital 1770/1772. Both controllers have different capabilities, and quirks by the bucketload!

My data loss occurred while experimenting with the Intel 8271, a 1970s chip and design. It's a good size and runs a little hot.

Intel 8271 (left) in a BBC Micro model B

There is not a lot about this chip on the internet, but there's a pretty good datasheet, which includes this description regarding the reset register:


It is no doubt my security background that caused me to read this description and then promptly write the value 100 (not 1 or 0) to this register.

It is at this point that my drive became possessed. Re-trying showed that setting bit D2 alone by writing the value 4 is sufficient to cause the demonic infestation:


Different drives behave differently. This drive, an older single head Chinon F-051MD seems to activate all of its widgets: the head wobbles wildly and the motor sometimes spins. My newer Mitsubishi drive makes a terrible, awful high pitch screech. It is my favorite drive and I do not wish to hear that sound ever again so there is no video.

Here is where it gets crazy: when I first did this, there was a disc in my Chinon drive and the data on that drive became toast... despite the hardware write protect tab being present.

What?

Indeed. This is sufficiently bizarre that it is worth investigation! What signals is the floppy drive receiving when the poltergeist is present? As a reminder, the BBC Micro uses a Shugart style (not PC style!) connection to the drive:

From the controller to the drive, we have Select 0, Select 1, notDIR (step direction), pin 20 (step; not labeled in the service manual diagram above), Write/DATA, WRite/ENable, Side/SELect. To sample a couple of these:

Pin 16, MOTOR

Pin 24, WRite/ENable

There are similar signals on other pins from the controller to the drive -- and they are not normal! Flipping the motor on and off at 375kHz is not an expected controller behavior and it may even be bad for a motor; I don't know.

The disc corruption comes from the abnormal signal above on the write enable pin. This is how the controller requests the drive to power up its write head, and the drive is being asked to do so at 300+kHz. There's also one of these crazy signals on the step and step direction pins, so the drive is seeking all over a range of tracks while being asked to activate the write head. Not good.

It is probably reasonable for us to fault the controller here. Under the heading "Write Protect" on page 8-124 of the data sheet, it says "The 8271 will not write to a disk when this input pin is active". This contract is clearly being violated as the write protect is being signaled by the drive.

It takes two to tango

But there is more to this! Most floppy drives also have a last-line-of-defense. For example, the application notes for my Mitubishi MF503B drive clearly state "This signal goes to ‘0’ when a write protected disk is inserted into the drive. Writing is inhibited even if the write gate is active." My Chinon drive seems to get this right most of the time, but not in the presence of wild oscillating signals. There appears to be a race condition -- perhaps if the drive is selected while the write enable gate is already active -- where a bit of current will go through the write head before the write protect condition is noticed.

Looks like I got unlucky -- wrong controller, wrong drive.

Two of my more colorful BBC Micro original discs, silver write protect tabs on display. Luckily neither of these is the disc I trashed!

Intel 8271 speculation

There isn't a whole lot on the internet about the Intel 8271. It's a slightly older floppy controller chip from the 1970s. Some of its capabilities hint of design tastes and directions that were retired in the 80s. For example, the chip is able to use host memory DMA to asynchronously scan within disc sectors for arbitrary byte strings. Perhaps a relic from the mainframe era? It wasn't used on the BBC Micro. The DMA capability wasn't wired up at all.

No other 1980s home computer that I'm aware of used the 8271. There were good reasons not to. Probably most significantly, the 8271 doesn't support double density aka. MFM, which the Western Digital 1770/1772 did. So using the 8271 carried a hit on storage capacity. The BBC Micro lived with this hit even as 1770 based machines became common. Commercial software needed to work on all machines, including 8271 based ones.

One suspicion I have is that the 8271 could be based on the Intel MCS-48 microcontroller. A few things line up, such as dates of design and manufacture. But there are also clues in the API offered. The "Read/Write Special Registers" commands is of particular interest:

As you can see there are a lot of special registers, with a fairly sparse few documented. The highest documented index is 0x22, or 34 in decimal.
One theory would be that the read/write special register simply reads or writes one of the 64 bytes of RAM present in an 8048 (or variant). There's plenty of evidence that most special register indexes seem to do something. For example, the seek and settle timing parameters end up in special registers 0x0D and 0x0E even though it is not documented anywhere. Also undocumented, 0x1D appears to mirror the formal data register. And so on.

And if the 8048 theory holds water, perhaps the undocumented and crazy bits of the reset register correspond to one of the test modes, single step mode, or external memory access mode?

From the archived MCS-48 users' manual

Summary

This is the most fun I've had while sustaining serious data corruption.

If you have any knowledge or memories of how the 8271 is actually architected, or details of undocumented behaviors, please get in touch!

Tuesday, June 2, 2020

Sampled sound 1980s style: from SN76489 square waves to samples

In the early 1980s, the Texas Instruments SN76489 sound chip was found in a variety of personal computers, consoles and arcade cabinets.

The legend, the chip itself

It is a fairly simple chip, consisting of 3 "tone channels", and one "noise channel", typically mapped to 3 note channels and one drum channel. The tone channels play square waves and the noise channel can emit either white noise or "periodic" noise. Despite this simplicity, human creativity -- as usual -- came to the rescue to bang out some good sounding tunes.

A couple of examples for you: Galaforce title music and Icarus title music (both BBC Micro).

Programmers even worked out how to coax sampled sound out of this beast back in the day! Sampled sound, from square waves?! The (very brief) sampled speech intro to Spy Hunter is a good example.

This blog post concerns how the mechanics of the sampled sound actually work, but first a quick demo of my own attempt to advance the art from sample discs produced back it the day. My own sample disc playback looks like this:

"sampstream" running in beebjit
 
Amusingly, the sample playback quality on a 1980 chip is obviously not too badly mangled because YouTube content ID recognizes emulator recordings of the commercial songs I chose as demos 😆 But you can click here or click here to play these sampstream demo discs in the jsbeeb emulator. It sounds similar to real hardware, although real hardware sounds less harsh and louder (you may need to crank your volume up).

"sampstream" features:
  • Near instant start and uninterrupted 400k of sample playback. It uses a custom asynchronous disc loader that auto-detects the Intel 8271 vs. Western Digital 1770 disc controller and drives it directly and asynchronously.
  • A real-time 50Hz oscilloscope view.
  • Noise bands that show disc read bandwidth vs. sample consumption bandwidth
sampstream does not feature best-in-class sample pre-processing and playback. It also does not use compression, despite samples being pretty compressible. So it's a tech demo that could be made to sound better and play for longer. The pcmenc project is the state-of-the-art I'm aware of for good quality playback. (A lot of what it does it fight against poor 4-bit and non-linear per-sample resolution, using clever techniques such as multi-channel balancing and mid-point rejigging.)

[sampstream, and the quicdisc library is it based, are open source and hosted on GitHub here]

So how does sample playback actually work? The sampstream example keeps things simple and uses only one tone channel. For a SN76489 tone channel, we control:
  • The channel period, i.e. the frequency of the square wave output -- 1024 possible values.
  • The channel volume -- 16 possible values.
  • A special register that outputs a byte of sampled sound. (Hahaha no.)
There's a few good references on the SN76489 online, including how to get sampled sound out of the thing. I recommend this resource over at SMS Power, as well as the MAME SN76489 driver source code to give you an idea of the flavors and variants.

However, some sources hand-wave regarding how it actually works, so let's cover that here by getting out our trusty oscilloscope and applying it to the SN76489 in the BBC Micro.

The usual basis of sampled sound playback is driving the SN76489 as a PCM device, i.e. a sequence of sample amplitude values. Everyone seems to agree this is initiated by setting a tone channel to a very high frequency, or very short period. There is disagreement on what period value to use, and why it works.

In the BBC Micro, the clock input to the sound chip is 2MHz and it has an internal divide-by-8, making it a 250kHz device. If we set a tone channel period to 1, it will invert that channel's output every clock tick (high, low, high, low...), forming a square wave. A full waveform cycle will need to have a high segment and a low segment, meaning that the output waveform should be a 125kHz square wave.

Investigating the output with an oscilloscope is easy because the SN76489 pinout is trivial:
Not too many pins!

The pin of interest is pin 7, "SND OUT". What does that look like with all channels silent, except for one tone channel, set to period 1 and maximum volume?

SND OUT from the SN76489 in blue

In this scope capture, the blue line is the SND OUT output from the SN76489. As can be seen it is a square wave as expected. It's at a 125kHz frequency, 720mV from peak to peak with its upper peak around 3.3v. Right away we notice a surprise: we might have expected an audio signal to center around 0v, but this signal is centered around 3v or so! The chip works like this:
  • Each audio channel contributes two alternating voltages (peak and trough) to the SND OUT total voltage.
  • Neither of the alternating voltages are negative.
  • For silent channels, the two alternating voltages will in fact be the same (about 0.8v).
  • For max volume channels, the two alternating voltages will be 0v and about 0.8v.
  • For this particular SN variant, it is "inverted" compared to what you might expect: a silent channel will output a constant voltage of 0.8v or so, not 0v. This inversion is interesting but does not affect ability to play sampled sound.
The yellow line above is the output a couple of components later in the audio chain: an LM386N-1 audio amplifier, after traversing an LM324N op-amp. We can see that already, the 125kHz frequency has been partially filtered out by the LM chips. This is why we use such a high frequency as a PCM carrier: it won't affect the audible output. The audio chain will filter it out; the speaker won't be able to respond fast enough to any remnants; and at any rate, the human ultimately at the end of the audio chain cannot hear it.

The PCM stream is modulated simply by rapidly altering the volume of the 125kHz wave, resulting in a voltage output looking like this (for our 1 tone channel case):


The above graph shows a few samples at max volume at the start, some interim, and a few at silent volume (the flat top line) at the end. The red trend line shows what the actual PCM waveform will look like with the 125kHz carrier filtered out. The key for sample playback is likely that the mid-point voltage differs for each volume. Note how its amplitude is half that of the carrier wave. This is likely one reason sampled sound playback is relatively quiet.

So now we've seen how a PCM waveform can emerge from just a 125kHz square wave and volume changes. There's some extra ciruitry between the LM386N-1 and the speaker, to keep the speaker happy. It normalizes the oscillation around the positive voltage point to be an oscillation around 0v.

It's important to recap how the SN76489 voltage output does not behave. It does not oscillate evenly around a fixed 0v point, which would look like this:


The above would not be conducive to sampled sound playback as the average signal after eliminating the 125kHz carrier would be a constant 0v!

Although we're done with explaining sampled sound output, it's also worth looking at a few other oscilloscope traces to understand SN76489 behavior. Here's a look at the other extreme, the longest period this particular model can output for a 121Hz sound:

SN76489 output in green, LM386N-1 in blue

Some references state that the SN76489 cannot hold a steady square wave output, particularly for longer periods. However, this is clearly not true -- at least for this variant. The square wave output is very clean and it is likely the LM324N or LM386N-1 that is introducing interesting voltage decays into the output waveforms.

Finally, here's another SN76489 vs. LM386N-1 output for an 8kHz square wave:


The output is close to a sine wave, not a square wave! This interesting distortion at 8kHz may be because some sources reference an 8kHz low-pass filter as a part of the LM324N setup.

Monday, April 13, 2020

Clocking a 6502 to 15GHz (!)

6-5-0-who?

The 6502 is an iconic processor that dominated home computing in the late 70s and early to mid 80s. It was used in machines ranging from the Apple II to the Atari 2600 to the Commodore 64 to the Nintendo Entertainment System. In pop culture, it powered Bender, not to mention the Terminator!



It often clocked at a modest 1MHz, with faster variants available. In the UK, a 2MHz variant powered the legendary BBC Micro, with a 3MHz 6502 co-processor box available to bolt on the side if you so desired.



Introducing beebjit

It was not the intention of this side project to produce a working, accurate emulator, but one appears to have popped out. The core of beebjit -- and original side project -- is a just-in-time style translation of 6502 code blocks to quasi-equivalent Intel x64 code blocks. Surrounding that core is emulation for the various supporting chips of the BBC Micro, such as the Hitachi 6845 CRTC for display; the 6522 VIA for timing and I/O; the Intel 8271 floppy disc controller; the Texas Instruments SN76489 for sound; etc.

beebjit is an open-source project on GitHub.

In tests, the translated x64 code has been able to demonstrate up to 15GHz of effective 6502 speed, for purely computational tasks such as BBC BASIC programs. For games, things can be a lot slower due to heavy interactions with the graphics, keyboard and timing hardware. To run games accurately (or in some cases at all) requires cycle perfect synchronization of different virtual hardware chips. Despite these challenges, speeds in the GHz range can still be expected.

What does a mutli-GHz BBC Micro system look like? It’s a lot of fun, perhaps best illustrated with a video. Here, we see the classic space shooter Galaforce. You can play it online here, in a JavaScript BBC Micro emulator called jsbeeb. In the video, we see Galaforce loading up from a disc simulated at real-time -- don’t forget to watch out for the track-to-track floppy disc seeks, visible as stutters in the loading of the title screen image! Then, after we get a feel for the speed at which the demo screens are running and cycling, we switch up into high gear. Will you be able to tell at which moment that occurs? (Hint: yes.)



Translating 6502 to x64

When starting out, I had no idea what sort of speed to expect. A naive initial guess would be “fast” because the 6502 instruction set is simple. There are only 256 available combinations of opcode type and addressing mode and only 4 or so 8-bit registers. This is a tiny subset of what the x64 can do in its instruction set, so you’d expect a trivial, fast mapping.

As is often the case, things are rarely so simple. A bit more thinking reveals a ton of factors why it might be fast and a ton why it might be slow:

Things that would tend to make 6502 -> x64 faster

  • In the “captain obvious” category, the 6502 in the BBC runs at 2MHz where a modern Intel chip is clocked in the GHz range.
  • The simplest 6502 instructions take 2 clock cycles. One example would be LDA #10, which sets register A to 10. The Intel processor can dispatch such a triviality in 1 clock cycle.
  • Depending on the level of parallelism inherent in the code, modern Intel processors can execute multiple instructions per clock cycle.
  • If you want to get into it, trivial optimizations are possible. For example, the 6502 can only shift its A register by 1 bit at a time with its LSR and ASL instructions, taking 2 clock cycles apiece. It’s common to see a few of these in a row. By contrast, modern Intel processors can shift a register by an arbitrary number of bits in just one clock cycle!
  • The x64 architecture has an abundance of registers compared to 6502, perhaps offering further opportunities to speed things up.
  • With 32KB RAM on the base model BBC model B, you could start to hope for a workload that fits into L1 caches.


Things that would tend to make 6502 -> x64 slower

  • Some 6502 instructions actually don’t map cleanly to a single x64 equivalent. There are some less common ones such as PHP (push processor flags), or SED (set decimal flag, eww!). A couple of extremely common ones, especially in critical loops, are LDA (),Y and STA (),Y. These read / write memory indirectly. The final read / write address is calculated based on issuing two preceding 8-bit reads to make a 16-bit pointer and then adding the Y register. Long story short, the “1 cycle per instruction” nirvana alluded to above certainly does not apply here. Even servicing reads from L1 cache, modern Intel processors take ~4 cycles.
  • The 6502 is from an era where performance was quite deterministic. Memory ran at the same speed (or in some cases faster!) then the processor, so caches and pipelines and reordering and pain had not yet been inflicted upon us. So, the 5-cycle 6502 instruction LDA (),Y that reads 2 bytes of instructions and 3 other bytes always takes 5 cycles. You won’t get icache misses, dcache misses, DLTB misses, ITLB misses, stalls for a million other reasons, etc. You get the point.
  • Self-modifying code. This was not only supported, but actually very common, particularly in the most performance critical code. I have to suspect that coders were enjoying themselves. Anyway, this poses huge problems. As we execute any 6502 instruction, we have to consider if it was blown away from under us. If we try to optimize anything, we have to beware of crazy stuff like an optimized-away instruction getting self-modify-nuked.
  • Synchronization and interrupts. At every 6502 instruction, we have to consider whether an interrupt condition exists that would cause us to need to execute the IRQ sequence instead. Related, there may be other hardware chips requiring cycle perfect synchronization with the 6502.
  • Annoying 6502 variable cycle counts. Some 6502 instructions that would otherwise translate very simply to x64 have a quirk where they might take a cycle longer depending on the X or Y register value. LDA ,Y and LDA (),Y are both affected. In a cycle perfect emulation, this needs accounting for.
  • Paging and hardware register access. Certain memory regions may not be stable. A lot of 6502 systems circumvented the 64KB address space restriction by allowing parts of it to be paged. Additional parts of the address space are typically given over to hardware registers. Hitting these under emulation is guaranteed to be painfully slow as some hardware chip or another will need to be emulated. ROM memory is another headache. Games can and do write to ROM memory, which must be detected and ignored!


We’ve already revealed that it’s possible to attain GHz speeds. Re-reading the above lists, it would seem that the “slows” outweigh the “fasts” so it’s worth a look at some tricks pulled to do it!

Tricks for fast 6502 -> x64 translation

Trick #1: make good use of x64’s 16 registers

To get us going, we can take a simple 6502 sample and x64 it!

   0x0000000020100031: mov    $0xa,%eax LDA #10
   0x0000000020100036: test   %al,%al
   0x0000000020100038: mov    %al,%bl TAX
   0x000000002010003a: test   %bl,%bl
   0x000000002010003c: mov    %al,0x40(%rbp) STA $C0
   0x000000002010003f: jmpq   0x20400000 JMP $4000

In the above (unoptimized) x64, we see that we’ve (permanently!) assigned the al register to the 6502 A register and bl to the X register. Note how the 6502 LDA instruction sets zero and negative flags but the x64 mov instruction does not, so the x64 code needs an extra test instruction -- another potential source of slowdown but see below. Some 6502 flags are actually stored in the x64 flags register. The rbp register is used to point to the 6502 memory (actually at offset 0x80 so that every zero page address can be accessed with a 1-byte displacement for a much shorter x64 instruction; the JIT code is performance sensitive to code size). Finally, note how the jmpq is jumping to a specific code location. By using a simple fixed mapping of 6502 code addresses to x64 code addresses, we can keep the generated code small and fast.

Trick #2: use the host’s memory management for ROM and paging

As mentioned above, a lot of software seems to like to write to ROM. In the BBC, the region $C000 - $FFFF is always the OS ROM (with some hardware registers mapped towards the end). The region $8000 - $BFFF is usually ROM, with the paged-in ROM selectable at runtime. (For example, BASIC might usually be paged in but the Disc Filing System ROM would be paged in when doing I/O to disc.) However, a lot of BBCs were upgraded to have some extra RAM that could be paged into this region.
Obviously, writes to ROM must be ignored. But we really do not want to be adding a test and branch instructions to every JIT memory write! Memory management and paging in the host comes to the rescue. JIT reads and JIT writes are aimed at two different 64KB mapping regions, representing views of the emulated BBC address space. For the JIT reads, ROMs are present as expected. For JIT writes, ROM regions are replaced with “dummy” host pages that are writable but write to a scratch area of memory. (An earlier version of beebjit used fault + fixup for ROM writes instead but this was too slow for Galaforce, which just loves to write to ROM.)
This all works nicely because the ROM boundaries are at 4KB alignments, which joyously happens to match the host granularity! There’s a minor wart caused by Windows having worse alignment granularity than other operating systems, but creative straddling of boundaries saves the day.

Trick #3: do simple self-modify invalidation on write

One of the most important decisions is how to handle self-modifying code. We don’t want to add in code to perform a check before executing every 6502 opcode, so the alternative is to consider doing something on writes. This is what we do and here is some code:

   0x000000002010003c: mov    %al,0x3001040(%rbp) STA $10C0
   0x0000000020100042: mov    0x4370(%rdi),%r8d
   0x0000000020100049: movw   $0x17ff,(%r8)

As can be seen, after the actual write to $10C0 is dispatched via the rbp register, there is a follow-up read and write. rdi permanently points to a context object that contains various important things.
In this instance, the JIT code is looking up a compiler maintained map of 6502 addresses to host code. The final x64 instruction writes to host code for the invalidation. But what is this magic constant 0x17ff? It is the 2 byte sequence for the x64 instruction callq *(%rdi). This ensures that if a 6502 address is invalidated, the x64 code representing it will cleanly call out of JIT and into some fix up and recompile routine.
It’s tempting to try and avoid the extra read and write: what if we can look up whether a given write address contains code or not? Whereas these sorts of optimization may be possible, adding the check adds a branch (extra code; bloats branch caches; very slow if not a predictable branch) and will almost certainly be slower than just unconditionally doing the write. It’s worth noting that nothing depends on the extra read / write and that’s exactly the sort of case where the single core parallelism of modern Intel processors can give a boost.

Trick #4: monitor excessive recompiles

If you consider the above scheme for handling self-modification, you’ll immediately notice a serious performance issue: every self-modification appears to incur a huge cost! Firstly, although Intel processors are documented as supporting self-modifying code, it is not fast. You’ll invalidate all sorts of caches and pipelines. Secondly, calling out of JIT to some C routine involves all sorts of register and calling convention fixups. Thirdly, the act of compiling is slow relative to the act of executing.
Fortunately, this can all be cleared up with a compiler that generates code differently based on monitoring what is going on. If a given 6502 opcode is being repeatedly recompiled, we can maintain some metadata about the exact situation and deal with it. To give a concrete example, here’s a couple of 6502 instructions from the Galaforce sprite routine at $B00:

   0B00: STA $0B4F
   …
   0B4E: LDA $2FA0,X

This sort of pattern covers many performance critical self-modifying situations and the compiler can make it go very fast. Firstly, the hard-coded $2FA0 value is replaced by a fetch from $0B4F and $0B50 so that the latest in-memory contents are used as the operand. Secondly, the write invalidations address table is set up so that 6502 writes to $0B4F and $0B50 no longer invalidate any x64 code because it no longer needs invalidation. And then execution proceeds, almost at 100% original speed but supporting self-modified operands!
beebjit doesn’t yet handle all self-modification situations in a recompile-free manner, but there is no fundamental reason it can’t / won’t in the future.

Trick #5: compile to blocks and check timing once per block

Looking at the JIT code for the above Galaforce sprite routine at $0B00, it starts off like this:

   0x00000000200b0000: lea    -0x33(%r15),%r15
   0x00000000200b0004: bt     $0x3f,%r15
   0x00000000200b0009: jb     0x3100b000
   0x00000000200b000f: mov    %al,0x3000acf(%rbp) STA $0B4F
   0x00000000200b0015: mov    0x2dac(%rdi),%r8d
   0x00000000200b001c: movw   $0x17ff,(%r8)

What are those three instructions at the start? The register r15 permanently stores a variable called “countdown”. Here it is being decremented by 51 (0x33) and checked to see if it goes negative, with a branch if so.
The whole timing model revolves around this countdown. As the 6502 code interacts with hardware and sets up timer IRQs, vsync IRQs, timers tick down towards 0 when some event must occur (raise IRQ perhaps). There’s also a timer ticking that fires every now and again to check the status of the physical keyboard for input. At any time, r15 tracks how soon the soonest timer is due to fire. Since we’re running a 2MHz processor, chances are that it could be thousands of processor ticks between interesting events.
In the case of the above block, it contains 51 6502 cycles worth of 6502 instructions. In the very common case, nothing interesting is due to happen in these 51 cycles, so the r15 decrement will not go negative and the branch will not be taken. This is great news for the performance of the block! Since nothing interesting is happening, we know that nothing will be raising any IRQ and the whole block can proceed without having to check for an interrupt condition every instruction. We also gain freedom to implement many optimizations without fear of some interrupt landing in the middle of a couple of coalesced 6502 instructions, for example. It’s also worth noting that nothing else in the block depends upon r15 so the Intel processor can do the check in parallel with other work.

Trick #6: optimize zero page writes until you can’t

This one is simple but highly effective. Zero page is the memory region $0000 - $00FF on the 6502. Reads / writes here are faster than accessing addresses >= $0100. So, a lot of 6502 code puts fast path variables in the zero page. We should make zero page handling as fast as possible. By default, the assumption is that no 6502 code is executed in the zero page, enabling zero page writes to go faster by not worrying about if they might be modifying code. In the unusual case that code is ever compiled in the zero page, it’s simple enough to handle it by invalidating all JIT code and regenerating on-demand with appropriate invalidation for zero page writes. (For an example of a game that triggers this, there’s Wizadore. It could be making the clever observation that it’s slightly faster to modify code in the zero page because zero page writes are faster.)
Note that very similar approaches apply to the stack page, which is $0100 - $01FF on the 6502. These are not yet implemented in beebjit.

Trick #7: use slow path faults to turbocharge some fast paths

Let’s look at a couple of 6502 reads from the same address -- a hardware register at $FE00 -- but with different addressing modes.

   0x0000000020100031: [1] lea    0xc0(%rbx),%edx LDA ($C0,X)
   0x0000000020100037: [2] movzbl %dl,%edx
   0x000000002010003a: [3] mov    0x12017f01(%rdx),%r9b
   0x0000000020100041:     mov    %rdx,%r8
   0x0000000020100044: [4] mov    -0x7f(%rdx,%rbp,1),%dh
   0x0000000020100048:     mov    -0x80(%r8,%rbp,1),%dl
   0x000000002010004d: [5] movzbl -0x80(%rdx,%rbp,1),%eax
   0x0000000020100052: [6] test   %al,%al
   0x0000000020100054:     mov    $0x12009010,%r10d LDA $FE00
   0x000000002010005a:     jmpq   0x42d613

There is plenty going on here. Unfortunately, the LDA (,X) addressing mode is not a trivial translation to x64 and implementing it correctly, as beebjit tries to do, incurs cost to handle the corner cases. It is worth a brief digression to break down the different parts corresponding to the numbers in blue above:

  1. Calculate 6502 X + the constant $C0.
  2. Corner case alert! Truncate the calculation result to 8-bits, as a 6502 does.
  3. Corner case alert! Arrange for a host access violation / SIGSEGV if the truncated result is exactly $FF. This is because the upcoming 16-bit pointer read would cross a page boundary and should wrap.
  4. Load a 16-bit 6502 pointer. NOTE! This is done in two 8-bit x64 reads. If done instead as a single 16-bit read, which you assume could be faster, you likely assume wrong as I did initially. You run the risk of colliding with store-to-load forwarding because of mixing 16-bit reads with 8-byte writes at the same memory address.
  5. Load the actual result into 6502 A from the 16-bit 6502 pointer we constructed.
  6. 6502 LDA updates zero and negative flags, so do the same in x64.

By contrast, look at the compiled code for LDA $FE00. The JIT engine has essentially given up on compiling this and is throwing it over to the slower but very capable interpreter. For this 6502 addressing mode, the address of the read is statically known at compile time, and it is also known to be a hardware register. So we jump to the interpreter to handle this complicated case.
However, for the indirect read of the LDA (,X) addressing mode, we don’t know what memory address will be eventually read so we don’t know if a hardware register will be hit or not. But wait -- there isn’t a check in the compiled code on the final address before it is read. So how does the hardware register work at all?
The answer is fault + fixup! The observation here is that it’s actually very uncommon for an indirect read to hit a hardware register, so we don’t check for that in the common fast path. However, the host virtual address space for the read is a special “indirect mapping” version of the main 6502. This mapping will cause a fault if the hardware register space is touched. The fault is slow and needs fixing up but it almost never happens so it’s a win.
You can also see the fault + fixup concept used to trap the exceptionally unusual case where the LDA (,X) instruction can fetch a 16-bit pointer from $FF, which wraps back to $00. We don’t want to pollute every (,X) mode instruction with code to compare against $FF and possibly branch so we can replace that with a single read which will fault approximately never. (The game Camelot does manage to somehow hit this.)

Trick #8: optimize!

The goal here is not to write a modern optimizing compiler, but to look for and address a couple of common patterns:

  • Common 6502 sequences that compile to x64 sequences with opportunities for coalescing at the x64 level.
  • Common 6502 sequences that can be expressed differently to take advantage of known details in the sequence.

In both cases, the word “sequence” is key. Whereas a 6502 interpreter has to consider each instruction in isolation, a 6502 JIT compiler can look across sequences and make use of knowledge of concrete values and concrete nearby instructions. Optimizing across sequences is not as gnarly as it could be because we know interrupts won’t be coming in at arbitrary points, but self-modification is still a hazard, as is the fault + fixup usage described above. To briefly get a flavor of the sort of optimizations possible, let’s look at one last 6502 compilation:

  LDA #1
   0x0000000020100034: movb   $0x1,0x3000000(%rbp) STA $80
   0x000000002010003b: mov    $0x2,%eax LDA #2
   0x0000000020100040: or     0x1(%rbp),%al ORA $81
ASL
ASL
   0x0000000020100043: shl    $0x3,%al ASL
CLC
   0x0000000020100046: add    0x2(%rbp),%al ADC $82
   0x0000000020100049: setb   %r14b
   0x000000002010004d: seto   %r12b

As can be seen, entire 6502 instructions have been eliminated; some have been coalesced; and the annoying test instructions after the LDA instructions have been eliminated because those flags are never observable on account of subsequent flag overwrites.


While the above tricks are enough to turbocharge a wide range of BBC software, some corner cases remain where speed drops below the GHz range, sometimes precipitously. These corner cases typically relate to the interaction between the 6502 and BBC hardware, not the 6502 -> x64 JIT compilation. There are also a few significant optimizations left on the TODO list that may raise performance for broad swathes of games. These optimizations will be undertaken as time and interest dictate.


Other beebjit tricks

Aside from the boosted 6502 core, beebjit is a reasonably capable BBC Micro emulator. Currently in beta-ish status, it should be able to emulate any BBC Micro game including classics such as Exile, Elite, Revs, etc.
It is cycle accurate enough to run the horror show known as Kevin Edwards’ copy protection for the Nightshade tape.

There are some major things beebjit doesn’t do well or at all yet. To satisfy these needs, you may wish to try jsbeeb or one of the many other BBC emulators.

  • Portability. beebjit is Linux native. An early attempt at a Windows build does exist. Have a look in the beebjit builds directory.
  • Broad model support. beebjit only emulates a BBC model B with a few optional bells and whistles such as sideways RAM support. Notable, BBC Master support is not (yet?) implemented.
  • GUI. beebjit renders to a window but does not have a GUI. There’s a fair amount of configurability but you will need to apply yourself to the command line!
  • Peripherals. beebjit does not emulate peripherals unless they were very common back in the day (e.g. disc drives and tape players). Things like joysticks, mice, second processors etc. are not (yet?) emulated because a vast majority of software does not need them. Other emulators do fill this need well, though.


There are also some things beebjit does that are not so common and may interest you:

  • Speed. As covered in this post.
  • Protected disc support. beebjit has a more accurate emulation of the disc, disc drive and disc controller than you will typically find. This means it will load images of original, protected discs. (The archival status of original protected BBC discs is not satisfactory and is something I am working on with others in the community.)
  • Capture and replay. Like most emulators, beebjit executes deterministically. Only external input -- currently just the keyboard -- affects the execution path. Therefore, beebjit has a mode to record external input and replay it (deterministically). This feature combines well with speed. You can replay very quickly, effectively providing an alternative to save states but with very small files and the ability to exit a replay early for a many-in-one save state file.


15GHz

Courtesy of Matt Godbolt (@mattgodbolt), here’s beebjit hitting 15GHz on a couple of the “CLOCKSP” BBC Basic tests, and 12GHz+ overall. This is on an i9-9980XE, which has robust single-core performance: 9th gen Intel core, boost clock @ 4.5GHz. No-one has tried on a 5GHz beast yet, or any 10th gen parts.



Collage of the GHz



With thanks

Special thanks to the following communities for their help, support and encouragement:

  • bitshifters. Pushing the BBC Micro to perform feats that no-one worked out how to do “back in the day”. I particularly recommend Twisted Brain.
  • StarDot.