Tuesday, June 2, 2020

Sampled sound 1980s style: from SN76489 square waves to samples

In the early 1980s, the Texas Instruments SN76489 sound chip was found in a variety of personal computers, consoles and arcade cabinets.

The legend, the chip itself

It is a fairly simple chip, consisting of 3 "tone channels", and one "noise channel", typically mapped to 3 note channels and one drum channel. The tone channels play square waves and the noise channel can emit either white noise or "periodic" noise. Despite this simplicity, human creativity -- as usual -- came to the rescue to bang out some good sounding tunes.

A couple of examples for you: Galaforce title music and Icarus title music (both BBC Micro).

Programmers even worked out how to coax sampled sound out of this beast back in the day! Sampled sound, from square waves?! The (very brief) sampled speech intro to Spy Hunter is a good example.

This blog post concerns how the mechanics of the sampled sound actually work, but first a quick demo of my own attempt to advance the art from sample discs produced back it the day. My own sample disc playback looks like this:

"sampstream" running in beebjit
Amusingly, the sample playback quality on a 1980 chip is obviously not too badly mangled because YouTube content ID recognizes emulator recordings of the commercial songs I chose as demos 😆 But you can click here or click here to play these sampstream demo discs in the jsbeeb emulator. It sounds similar to real hardware, although real hardware sounds less harsh and louder (you may need to crank your volume up).

"sampstream" features:
  • Near instant start and uninterrupted 400k of sample playback. It uses a custom asynchronous disc loader that auto-detects the Intel 8271 vs. Western Digital 1770 disc controller and drives it directly and asynchronously.
  • A real-time 50Hz oscilloscope view.
  • Noise bands that show disc read bandwidth vs. sample consumption bandwidth
sampstream does not feature best-in-class sample pre-processing and playback. It also does not use compression, despite samples being pretty compressible. So it's a tech demo that could be made to sound better and play for longer. The pcmenc project is the state-of-the-art I'm aware of for good quality playback. (A lot of what it does it fight against poor 4-bit and non-linear per-sample resolution, using clever techniques such as multi-channel balancing and mid-point rejigging.)

[sampstream, and the quicdisc library is it based, are open source and hosted on GitHub here]

So how does sample playback actually work? The sampstream example keeps things simple and uses only one tone channel. For a SN76489 tone channel, we control:
  • The channel period, i.e. the frequency of the square wave output -- 1024 possible values.
  • The channel volume -- 16 possible values.
  • A special register that outputs a byte of sampled sound. (Hahaha no.)
There's a few good references on the SN76489 online, including how to get sampled sound out of the thing. I recommend this resource over at SMS Power, as well as the MAME SN76489 driver source code to give you an idea of the flavors and variants.

However, some sources hand-wave regarding how it actually works, so let's cover that here by getting out our trusty oscilloscope and applying it to the SN76489 in the BBC Micro.

The usual basis of sampled sound playback is driving the SN76489 as a PCM device, i.e. a sequence of sample amplitude values. Everyone seems to agree this is initiated by setting a tone channel to a very high frequency, or very short period. There is disagreement on what period value to use, and why it works.

In the BBC Micro, the clock input to the sound chip is 2MHz and it has an internal divide-by-8, making it a 250kHz device. If we set a tone channel period to 1, it will invert that channel's output every clock tick (high, low, high, low...), forming a square wave. A full waveform cycle will need to have a high segment and a low segment, meaning that the output waveform should be a 125kHz square wave.

Investigating the output with an oscilloscope is easy because the SN76489 pinout is trivial:
Not too many pins!

The pin of interest is pin 7, "SND OUT". What does that look like with all channels silent, except for one tone channel, set to period 1 and maximum volume?

SND OUT from the SN76489 in blue

In this scope capture, the blue line is the SND OUT output from the SN76489. As can be seen it is a square wave as expected. It's at a 125kHz frequency, 720mV from peak to peak with its upper peak around 3.3v. Right away we notice a surprise: we might have expected an audio signal to center around 0v, but this signal is centered around 3v or so! The chip works like this:
  • Each audio channel contributes two alternating voltages (peak and trough) to the SND OUT total voltage.
  • Neither of the alternating voltages are negative.
  • For silent channels, the two alternating voltages will in fact be the same (about 0.8v).
  • For max volume channels, the two alternating voltages will be 0v and about 0.8v.
  • For this particular SN variant, it is "inverted" compared to what you might expect: a silent channel will output a constant voltage of 0.8v or so, not 0v. This inversion is interesting but does not affect ability to play sampled sound.
The yellow line above is the output a couple of components later in the audio chain: an LM386N-1 audio amplifier, after traversing an LM324N op-amp. We can see that already, the 125kHz frequency has been partially filtered out by the LM chips. This is why we use such a high frequency as a PCM carrier: it won't affect the audible output. The audio chain will filter it out; the speaker won't be able to respond fast enough to any remnants; and at any rate, the human ultimately at the end of the audio chain cannot hear it.

The PCM stream is modulated simply by rapidly altering the volume of the 125kHz wave, resulting in a voltage output looking like this (for our 1 tone channel case):

The above graph shows a few samples at max volume at the start, some interim, and a few at silent volume (the flat top line) at the end. The red trend line shows what the actual PCM waveform will look like with the 125kHz carrier filtered out. The key for sample playback is likely that the mid-point voltage differs for each volume. Note how its amplitude is half that of the carrier wave. This is likely one reason sampled sound playback is relatively quiet.

So now we've seen how a PCM waveform can emerge from just a 125kHz square wave and volume changes. There's some extra ciruitry between the LM386N-1 and the speaker, to keep the speaker happy. It normalizes the oscillation around the positive voltage point to be an oscillation around 0v.

It's important to recap how the SN76489 voltage output does not behave. It does not oscillate evenly around a fixed 0v point, which would look like this:

The above would not be conducive to sampled sound playback as the average signal after eliminating the 125kHz carrier would be a constant 0v!

Although we're done with explaining sampled sound output, it's also worth looking at a few other oscilloscope traces to understand SN76489 behavior. Here's a look at the other extreme, the longest period this particular model can output for a 121Hz sound:

SN76489 output in green, LM386N-1 in blue

Some references state that the SN76489 cannot hold a steady square wave output, particularly for longer periods. However, this is clearly not true -- at least for this variant. The square wave output is very clean and it is likely the LM324N or LM386N-1 that is introducing interesting voltage decays into the output waveforms.

Finally, here's another SN76489 vs. LM386N-1 output for an 8kHz square wave:

The output is close to a sine wave, not a square wave! This interesting distortion at 8kHz may be because some sources reference an 8kHz low-pass filter as a part of the LM324N setup.

Monday, April 13, 2020

Clocking a 6502 to 15GHz (!)


The 6502 is an iconic processor that dominated home computing in the late 70s and early to mid 80s. It was used in machines ranging from the Apple II to the Atari 2600 to the Commodore 64 to the Nintendo Entertainment System. In pop culture, it powered Bender, not to mention the Terminator!

It often clocked at a modest 1MHz, with faster variants available. In the UK, a 2MHz variant powered the legendary BBC Micro, with a 3MHz 6502 co-processor box available to bolt on the side if you so desired.

Introducing beebjit

It was not the intention of this side project to produce a working, accurate emulator, but one appears to have popped out. The core of beebjit -- and original side project -- is a just-in-time style translation of 6502 code blocks to quasi-equivalent Intel x64 code blocks. Surrounding that core is emulation for the various supporting chips of the BBC Micro, such as the Hitachi 6845 CRTC for display; the 6522 VIA for timing and I/O; the Intel 8271 floppy disc controller; the Texas Instruments SN76489 for sound; etc.

beebjit is an open-source project on GitHub.

In tests, the translated x64 code has been able to demonstrate up to 15GHz of effective 6502 speed, for purely computational tasks such as BBC BASIC programs. For games, things can be a lot slower due to heavy interactions with the graphics, keyboard and timing hardware. To run games accurately (or in some cases at all) requires cycle perfect synchronization of different virtual hardware chips. Despite these challenges, speeds in the GHz range can still be expected.

What does a mutli-GHz BBC Micro system look like? It’s a lot of fun, perhaps best illustrated with a video. Here, we see the classic space shooter Galaforce. You can play it online here, in a JavaScript BBC Micro emulator called jsbeeb. In the video, we see Galaforce loading up from a disc simulated at real-time -- don’t forget to watch out for the track-to-track floppy disc seeks, visible as stutters in the loading of the title screen image! Then, after we get a feel for the speed at which the demo screens are running and cycling, we switch up into high gear. Will you be able to tell at which moment that occurs? (Hint: yes.)

Translating 6502 to x64

When starting out, I had no idea what sort of speed to expect. A naive initial guess would be “fast” because the 6502 instruction set is simple. There are only 256 available combinations of opcode type and addressing mode and only 4 or so 8-bit registers. This is a tiny subset of what the x64 can do in its instruction set, so you’d expect a trivial, fast mapping.

As is often the case, things are rarely so simple. A bit more thinking reveals a ton of factors why it might be fast and a ton why it might be slow:

Things that would tend to make 6502 -> x64 faster

  • In the “captain obvious” category, the 6502 in the BBC runs at 2MHz where a modern Intel chip is clocked in the GHz range.
  • The simplest 6502 instructions take 2 clock cycles. One example would be LDA #10, which sets register A to 10. The Intel processor can dispatch such a triviality in 1 clock cycle.
  • Depending on the level of parallelism inherent in the code, modern Intel processors can execute multiple instructions per clock cycle.
  • If you want to get into it, trivial optimizations are possible. For example, the 6502 can only shift its A register by 1 bit at a time with its LSR and ASL instructions, taking 2 clock cycles apiece. It’s common to see a few of these in a row. By contrast, modern Intel processors can shift a register by an arbitrary number of bits in just one clock cycle!
  • The x64 architecture has an abundance of registers compared to 6502, perhaps offering further opportunities to speed things up.
  • With 32KB RAM on the base model BBC model B, you could start to hope for a workload that fits into L1 caches.

Things that would tend to make 6502 -> x64 slower

  • Some 6502 instructions actually don’t map cleanly to a single x64 equivalent. There are some less common ones such as PHP (push processor flags), or SED (set decimal flag, eww!). A couple of extremely common ones, especially in critical loops, are LDA (),Y and STA (),Y. These read / write memory indirectly. The final read / write address is calculated based on issuing two preceding 8-bit reads to make a 16-bit pointer and then adding the Y register. Long story short, the “1 cycle per instruction” nirvana alluded to above certainly does not apply here. Even servicing reads from L1 cache, modern Intel processors take ~4 cycles.
  • The 6502 is from an era where performance was quite deterministic. Memory ran at the same speed (or in some cases faster!) then the processor, so caches and pipelines and reordering and pain had not yet been inflicted upon us. So, the 5-cycle 6502 instruction LDA (),Y that reads 2 bytes of instructions and 3 other bytes always takes 5 cycles. You won’t get icache misses, dcache misses, DLTB misses, ITLB misses, stalls for a million other reasons, etc. You get the point.
  • Self-modifying code. This was not only supported, but actually very common, particularly in the most performance critical code. I have to suspect that coders were enjoying themselves. Anyway, this poses huge problems. As we execute any 6502 instruction, we have to consider if it was blown away from under us. If we try to optimize anything, we have to beware of crazy stuff like an optimized-away instruction getting self-modify-nuked.
  • Synchronization and interrupts. At every 6502 instruction, we have to consider whether an interrupt condition exists that would cause us to need to execute the IRQ sequence instead. Related, there may be other hardware chips requiring cycle perfect synchronization with the 6502.
  • Annoying 6502 variable cycle counts. Some 6502 instructions that would otherwise translate very simply to x64 have a quirk where they might take a cycle longer depending on the X or Y register value. LDA ,Y and LDA (),Y are both affected. In a cycle perfect emulation, this needs accounting for.
  • Paging and hardware register access. Certain memory regions may not be stable. A lot of 6502 systems circumvented the 64KB address space restriction by allowing parts of it to be paged. Additional parts of the address space are typically given over to hardware registers. Hitting these under emulation is guaranteed to be painfully slow as some hardware chip or another will need to be emulated. ROM memory is another headache. Games can and do write to ROM memory, which must be detected and ignored!

We’ve already revealed that it’s possible to attain GHz speeds. Re-reading the above lists, it would seem that the “slows” outweigh the “fasts” so it’s worth a look at some tricks pulled to do it!

Tricks for fast 6502 -> x64 translation

Trick #1: make good use of x64’s 16 registers

To get us going, we can take a simple 6502 sample and x64 it!

   0x0000000020100031: mov    $0xa,%eax LDA #10
   0x0000000020100036: test   %al,%al
   0x0000000020100038: mov    %al,%bl TAX
   0x000000002010003a: test   %bl,%bl
   0x000000002010003c: mov    %al,0x40(%rbp) STA $C0
   0x000000002010003f: jmpq   0x20400000 JMP $4000

In the above (unoptimized) x64, we see that we’ve (permanently!) assigned the al register to the 6502 A register and bl to the X register. Note how the 6502 LDA instruction sets zero and negative flags but the x64 mov instruction does not, so the x64 code needs an extra test instruction -- another potential source of slowdown but see below. Some 6502 flags are actually stored in the x64 flags register. The rbp register is used to point to the 6502 memory (actually at offset 0x80 so that every zero page address can be accessed with a 1-byte displacement for a much shorter x64 instruction; the JIT code is performance sensitive to code size). Finally, note how the jmpq is jumping to a specific code location. By using a simple fixed mapping of 6502 code addresses to x64 code addresses, we can keep the generated code small and fast.

Trick #2: use the host’s memory management for ROM and paging

As mentioned above, a lot of software seems to like to write to ROM. In the BBC, the region $C000 - $FFFF is always the OS ROM (with some hardware registers mapped towards the end). The region $8000 - $BFFF is usually ROM, with the paged-in ROM selectable at runtime. (For example, BASIC might usually be paged in but the Disc Filing System ROM would be paged in when doing I/O to disc.) However, a lot of BBCs were upgraded to have some extra RAM that could be paged into this region.
Obviously, writes to ROM must be ignored. But we really do not want to be adding a test and branch instructions to every JIT memory write! Memory management and paging in the host comes to the rescue. JIT reads and JIT writes are aimed at two different 64KB mapping regions, representing views of the emulated BBC address space. For the JIT reads, ROMs are present as expected. For JIT writes, ROM regions are replaced with “dummy” host pages that are writable but write to a scratch area of memory. (An earlier version of beebjit used fault + fixup for ROM writes instead but this was too slow for Galaforce, which just loves to write to ROM.)
This all works nicely because the ROM boundaries are at 4KB alignments, which joyously happens to match the host granularity! There’s a minor wart caused by Windows having worse alignment granularity than other operating systems, but creative straddling of boundaries saves the day.

Trick #3: do simple self-modify invalidation on write

One of the most important decisions is how to handle self-modifying code. We don’t want to add in code to perform a check before executing every 6502 opcode, so the alternative is to consider doing something on writes. This is what we do and here is some code:

   0x000000002010003c: mov    %al,0x3001040(%rbp) STA $10C0
   0x0000000020100042: mov    0x4370(%rdi),%r8d
   0x0000000020100049: movw   $0x17ff,(%r8)

As can be seen, after the actual write to $10C0 is dispatched via the rbp register, there is a follow-up read and write. rdi permanently points to a context object that contains various important things.
In this instance, the JIT code is looking up a compiler maintained map of 6502 addresses to host code. The final x64 instruction writes to host code for the invalidation. But what is this magic constant 0x17ff? It is the 2 byte sequence for the x64 instruction callq *(%rdi). This ensures that if a 6502 address is invalidated, the x64 code representing it will cleanly call out of JIT and into some fix up and recompile routine.
It’s tempting to try and avoid the extra read and write: what if we can look up whether a given write address contains code or not? Whereas these sorts of optimization may be possible, adding the check adds a branch (extra code; bloats branch caches; very slow if not a predictable branch) and will almost certainly be slower than just unconditionally doing the write. It’s worth noting that nothing depends on the extra read / write and that’s exactly the sort of case where the single core parallelism of modern Intel processors can give a boost.

Trick #4: monitor excessive recompiles

If you consider the above scheme for handling self-modification, you’ll immediately notice a serious performance issue: every self-modification appears to incur a huge cost! Firstly, although Intel processors are documented as supporting self-modifying code, it is not fast. You’ll invalidate all sorts of caches and pipelines. Secondly, calling out of JIT to some C routine involves all sorts of register and calling convention fixups. Thirdly, the act of compiling is slow relative to the act of executing.
Fortunately, this can all be cleared up with a compiler that generates code differently based on monitoring what is going on. If a given 6502 opcode is being repeatedly recompiled, we can maintain some metadata about the exact situation and deal with it. To give a concrete example, here’s a couple of 6502 instructions from the Galaforce sprite routine at $B00:

   0B00: STA $0B4F
   0B4E: LDA $2FA0,X

This sort of pattern covers many performance critical self-modifying situations and the compiler can make it go very fast. Firstly, the hard-coded $2FA0 value is replaced by a fetch from $0B4F and $0B50 so that the latest in-memory contents are used as the operand. Secondly, the write invalidations address table is set up so that 6502 writes to $0B4F and $0B50 no longer invalidate any x64 code because it no longer needs invalidation. And then execution proceeds, almost at 100% original speed but supporting self-modified operands!
beebjit doesn’t yet handle all self-modification situations in a recompile-free manner, but there is no fundamental reason it can’t / won’t in the future.

Trick #5: compile to blocks and check timing once per block

Looking at the JIT code for the above Galaforce sprite routine at $0B00, it starts off like this:

   0x00000000200b0000: lea    -0x33(%r15),%r15
   0x00000000200b0004: bt     $0x3f,%r15
   0x00000000200b0009: jb     0x3100b000
   0x00000000200b000f: mov    %al,0x3000acf(%rbp) STA $0B4F
   0x00000000200b0015: mov    0x2dac(%rdi),%r8d
   0x00000000200b001c: movw   $0x17ff,(%r8)

What are those three instructions at the start? The register r15 permanently stores a variable called “countdown”. Here it is being decremented by 51 (0x33) and checked to see if it goes negative, with a branch if so.
The whole timing model revolves around this countdown. As the 6502 code interacts with hardware and sets up timer IRQs, vsync IRQs, timers tick down towards 0 when some event must occur (raise IRQ perhaps). There’s also a timer ticking that fires every now and again to check the status of the physical keyboard for input. At any time, r15 tracks how soon the soonest timer is due to fire. Since we’re running a 2MHz processor, chances are that it could be thousands of processor ticks between interesting events.
In the case of the above block, it contains 51 6502 cycles worth of 6502 instructions. In the very common case, nothing interesting is due to happen in these 51 cycles, so the r15 decrement will not go negative and the branch will not be taken. This is great news for the performance of the block! Since nothing interesting is happening, we know that nothing will be raising any IRQ and the whole block can proceed without having to check for an interrupt condition every instruction. We also gain freedom to implement many optimizations without fear of some interrupt landing in the middle of a couple of coalesced 6502 instructions, for example. It’s also worth noting that nothing else in the block depends upon r15 so the Intel processor can do the check in parallel with other work.

Trick #6: optimize zero page writes until you can’t

This one is simple but highly effective. Zero page is the memory region $0000 - $00FF on the 6502. Reads / writes here are faster than accessing addresses >= $0100. So, a lot of 6502 code puts fast path variables in the zero page. We should make zero page handling as fast as possible. By default, the assumption is that no 6502 code is executed in the zero page, enabling zero page writes to go faster by not worrying about if they might be modifying code. In the unusual case that code is ever compiled in the zero page, it’s simple enough to handle it by invalidating all JIT code and regenerating on-demand with appropriate invalidation for zero page writes. (For an example of a game that triggers this, there’s Wizadore. It could be making the clever observation that it’s slightly faster to modify code in the zero page because zero page writes are faster.)
Note that very similar approaches apply to the stack page, which is $0100 - $01FF on the 6502. These are not yet implemented in beebjit.

Trick #7: use slow path faults to turbocharge some fast paths

Let’s look at a couple of 6502 reads from the same address -- a hardware register at $FE00 -- but with different addressing modes.

   0x0000000020100031: [1] lea    0xc0(%rbx),%edx LDA ($C0,X)
   0x0000000020100037: [2] movzbl %dl,%edx
   0x000000002010003a: [3] mov    0x12017f01(%rdx),%r9b
   0x0000000020100041:     mov    %rdx,%r8
   0x0000000020100044: [4] mov    -0x7f(%rdx,%rbp,1),%dh
   0x0000000020100048:     mov    -0x80(%r8,%rbp,1),%dl
   0x000000002010004d: [5] movzbl -0x80(%rdx,%rbp,1),%eax
   0x0000000020100052: [6] test   %al,%al
   0x0000000020100054:     mov    $0x12009010,%r10d LDA $FE00
   0x000000002010005a:     jmpq   0x42d613

There is plenty going on here. Unfortunately, the LDA (,X) addressing mode is not a trivial translation to x64 and implementing it correctly, as beebjit tries to do, incurs cost to handle the corner cases. It is worth a brief digression to break down the different parts corresponding to the numbers in blue above:

  1. Calculate 6502 X + the constant $C0.
  2. Corner case alert! Truncate the calculation result to 8-bits, as a 6502 does.
  3. Corner case alert! Arrange for a host access violation / SIGSEGV if the truncated result is exactly $FF. This is because the upcoming 16-bit pointer read would cross a page boundary and should wrap.
  4. Load a 16-bit 6502 pointer. NOTE! This is done in two 8-bit x64 reads. If done instead as a single 16-bit read, which you assume could be faster, you likely assume wrong as I did initially. You run the risk of colliding with store-to-load forwarding because of mixing 16-bit reads with 8-byte writes at the same memory address.
  5. Load the actual result into 6502 A from the 16-bit 6502 pointer we constructed.
  6. 6502 LDA updates zero and negative flags, so do the same in x64.

By contrast, look at the compiled code for LDA $FE00. The JIT engine has essentially given up on compiling this and is throwing it over to the slower but very capable interpreter. For this 6502 addressing mode, the address of the read is statically known at compile time, and it is also known to be a hardware register. So we jump to the interpreter to handle this complicated case.
However, for the indirect read of the LDA (,X) addressing mode, we don’t know what memory address will be eventually read so we don’t know if a hardware register will be hit or not. But wait -- there isn’t a check in the compiled code on the final address before it is read. So how does the hardware register work at all?
The answer is fault + fixup! The observation here is that it’s actually very uncommon for an indirect read to hit a hardware register, so we don’t check for that in the common fast path. However, the host virtual address space for the read is a special “indirect mapping” version of the main 6502. This mapping will cause a fault if the hardware register space is touched. The fault is slow and needs fixing up but it almost never happens so it’s a win.
You can also see the fault + fixup concept used to trap the exceptionally unusual case where the LDA (,X) instruction can fetch a 16-bit pointer from $FF, which wraps back to $00. We don’t want to pollute every (,X) mode instruction with code to compare against $FF and possibly branch so we can replace that with a single read which will fault approximately never. (The game Camelot does manage to somehow hit this.)

Trick #8: optimize!

The goal here is not to write a modern optimizing compiler, but to look for and address a couple of common patterns:

  • Common 6502 sequences that compile to x64 sequences with opportunities for coalescing at the x64 level.
  • Common 6502 sequences that can be expressed differently to take advantage of known details in the sequence.

In both cases, the word “sequence” is key. Whereas a 6502 interpreter has to consider each instruction in isolation, a 6502 JIT compiler can look across sequences and make use of knowledge of concrete values and concrete nearby instructions. Optimizing across sequences is not as gnarly as it could be because we know interrupts won’t be coming in at arbitrary points, but self-modification is still a hazard, as is the fault + fixup usage described above. To briefly get a flavor of the sort of optimizations possible, let’s look at one last 6502 compilation:

  LDA #1
   0x0000000020100034: movb   $0x1,0x3000000(%rbp) STA $80
   0x000000002010003b: mov    $0x2,%eax LDA #2
   0x0000000020100040: or     0x1(%rbp),%al ORA $81
   0x0000000020100043: shl    $0x3,%al ASL
   0x0000000020100046: add    0x2(%rbp),%al ADC $82
   0x0000000020100049: setb   %r14b
   0x000000002010004d: seto   %r12b

As can be seen, entire 6502 instructions have been eliminated; some have been coalesced; and the annoying test instructions after the LDA instructions have been eliminated because those flags are never observable on account of subsequent flag overwrites.

While the above tricks are enough to turbocharge a wide range of BBC software, some corner cases remain where speed drops below the GHz range, sometimes precipitously. These corner cases typically relate to the interaction between the 6502 and BBC hardware, not the 6502 -> x64 JIT compilation. There are also a few significant optimizations left on the TODO list that may raise performance for broad swathes of games. These optimizations will be undertaken as time and interest dictate.

Other beebjit tricks

Aside from the boosted 6502 core, beebjit is a reasonably capable BBC Micro emulator. Currently in beta-ish status, it should be able to emulate any BBC Micro game including classics such as Exile, Elite, Revs, etc.
It is cycle accurate enough to run the horror show known as Kevin Edwards’ copy protection for the Nightshade tape.

There are some major things beebjit doesn’t do well or at all yet. To satisfy these needs, you may wish to try jsbeeb or one of the many other BBC emulators.

  • Portability. beebjit is Linux native. An early attempt at a Windows build does exist. Have a look in the beebjit builds directory.
  • Broad model support. beebjit only emulates a BBC model B with a few optional bells and whistles such as sideways RAM support. Notable, BBC Master support is not (yet?) implemented.
  • GUI. beebjit renders to a window but does not have a GUI. There’s a fair amount of configurability but you will need to apply yourself to the command line!
  • Peripherals. beebjit does not emulate peripherals unless they were very common back in the day (e.g. disc drives and tape players). Things like joysticks, mice, second processors etc. are not (yet?) emulated because a vast majority of software does not need them. Other emulators do fill this need well, though.

There are also some things beebjit does that are not so common and may interest you:

  • Speed. As covered in this post.
  • Protected disc support. beebjit has a more accurate emulation of the disc, disc drive and disc controller than you will typically find. This means it will load images of original, protected discs. (The archival status of original protected BBC discs is not satisfactory and is something I am working on with others in the community.)
  • Capture and replay. Like most emulators, beebjit executes deterministically. Only external input -- currently just the keyboard -- affects the execution path. Therefore, beebjit has a mode to record external input and replay it (deterministically). This feature combines well with speed. You can replay very quickly, effectively providing an alternative to save states but with very small files and the ability to exit a replay early for a many-in-one save state file.


Courtesy of Matt Godbolt (@mattgodbolt), here’s beebjit hitting 15GHz on a couple of the “CLOCKSP” BBC Basic tests, and 12GHz+ overall. This is on an i9-9980XE, which has robust single-core performance: 9th gen Intel core, boost clock @ 4.5GHz. No-one has tried on a 5GHz beast yet, or any 10th gen parts.

Collage of the GHz

With thanks

Special thanks to the following communities for their help, support and encouragement:

  • bitshifters. Pushing the BBC Micro to perform feats that no-one worked out how to do “back in the day”. I particularly recommend Twisted Brain.
  • StarDot.

Tuesday, September 19, 2017

Excited to join Dropbox!

I’m excited to announce that I’ve joined Dropbox as their new Head of Security. Truth be told, I’ve been here a little while and I’ve been enjoying on-boarding too much to make the announcement. If you were wondering why my blog has been quiet for a while, now you know why!

I exited a fun period of semi-retirement to take up this challenge. What attracted me to Dropbox enough to make the switch? Many things but briefly:
  • Scale and sensitivity of the data. Half a billion users storing sensitive files is a worthy stash to protect.
  • The excellent caliber and decent size of the existing security team. Working with strong leaders and team members is a big draw.
  • Perhaps above all else, the warmth of the people and the culture. This is the friendliest, most collaborative company I’ve worked at. I fully expect to become less of a jerk by imbibing the vibe! :)

The assertion about the warmth of the people and culture deserves some supporting evidence. This is a little story from before I joined. As you may recall, I was researching server-side usage of ImageMagick and one of my discoveries affected Dropbox in a fairly minor way. The response was spectacular -- and warm, and competent. Of course, the foundations you expect from a solid security program were present: a public bug bounty program with a fast response time. Beyond that, upon submission of what was considered an interesting bug, I was…. invited up to Dropbox HQ for chai(!), a snack, and a chat with Dev (@frgx) plus the author of the sandboxing for this area. What a great experience.

It would be remiss of me to not mention that Dropbox is hiring for all types of security roles. The team is already a decent size, but we are growing. This job req is what you are looking for.

On a social note, this move means that I’m now up in SF city a lot of the time. Hit me up if you want to grab a drink and talk about security.

Monday, June 19, 2017

Introducing Qualys Project Zero?

Google's Project Zero team was announced in July 2014. Since then, it has become very well known for publishing offensive security research of exceptional quality. This is especially welcome to defenders at a time where top quality offensive security research is drying up. For most important software targets, it's getting harder to find and exploit bugs. And for those individuals who remain in the pool of people still capable of playing this game, many have been attracted by positions that seek to abuse vulnerabilities rather than get them fixed.

Perhaps, too few positions exist where top-tier researchers:
  1. Are left alone to self-select research on whatever they deem important.
  2. Get to work alongside other top-tier researchers.
  3. Are expected to openly publish everything they discover.
  4. Are never given bounds or direction from management.
  5. Are supported by their organization when navigating the minefield of vulnerability disclosure.
  6. Are given good resources and top tier corporate pay.
Google deserves a lot of props for hiring a decently sized team that broadly hits the above points.

And there is nothing to stop a different company from doing the same.

Has Qualys done the same?

Since the inception of Project Zero, I've personally seen it as a team called Project Zero, that just happens to be generously funded by Google. Also, an umbrella term under which any team could operate if they adhere to the principles of openness and freedom of employed researchers.

Qualys keeps pumping out research that I find surprisingly cool. Today, we had The Stack Clash, a great read, which finally tipped me over the edge to write this post. Previously, I had noted and enjoyed attacks against the OpenSSH client and a glibc resolver buffer overflow. Three strikes and you're out, perhaps?

So Qualys clearly have one or more very talented hackers on the payroll. (We don't know who because the publications in question are not credited with individual names. I suspect some old-school hacker.) How does Qualys' observable behavior stack up against the list above?
  1. Left alone to self-select research: YES. Hard to imagine management saying "hey, how about you go find, like, 10+ CVEs across all the most popular UNIXes, covering crashing the stack into the heap, and don't forget about 64-bit?".
  2. Top-tier team: YES. This is defined by output. As one example, Stack Clash is top-tier output. The writing in that report uses "we", suggesting a top-tier team instead of a single top-tier individual.
  3. Open publication: YES. (A presumption that all things of significance are published, but seems likely.)
  4. Lack of management direction: MAYBE. Hard to tell whether the Qualys Labs team is left alone to research 100% of the time, or if they are directed to take certain audits at certain times, or even contracted out. Some of the research targets are a bit weird, such as "Trend Micro Interscan Web Security Virtual Appliance" (link). This does not sound like a research target that a top-tier researcher would self-select.
  5. Organizational support for disclosure: MAYBE. We haven't yet seen if Qualys would be willing to stand up to a (e.g.) Microsoft or Apple trying to bully Qualys or one of their individual recruiters. There's also the worrying signal that Qualys unilaterally extended a disclosure embargo, going back on an agreement with Solar (see oss-sec post here). There's also the question of whether "customers" get access to details before patches are available -- which would be counter to Project Zero principles.
  6. Good resources / pay: YES. (A presumption that Qualys pays reasonably relative to the hot market for security professionals.)

With just a few public clarifications and maybe tweaks, Qualys would be able to use the term "Qualys Project Zero" without violating the spirit behind the name.

I'd be delighted at any such outcome: multiple Project Zeros, each funded by some benefactor with an interest in security, is one way to scale this thing.

Friday, May 19, 2017

*bleed, more powerful: dumping Yahoo! authentication secrets with an out-of-bounds read


In my previous post on Yahoobleed #1 (YB1), we saw how an uninitialized memory vulnerability could lead to disclosure of private images belonging to other users. The resulting leaked memory bytes were subject to JPEG compression, which is not a problem for image theft, but is somewhat lacking if we wanted to steal memory content other than images.

In this post, we explore an alternative *bleed class vulnerability in Yahoo! thumbnailing servers. Let's call it Yahoobleed #2 (YB2). We'll get around the (still present) JPEG compression issue to exfiltrate raw memory bytes. With subsequent usage of the elite cyberhacking tool "strings", we discover many wonders.

Yahoo! fixed YB2 at the same time as YB1, by retiring ImageMagick. See my previous post for a more detailed list of why Yahoo! generally hit this out of the park with their response.


The above patch of noise is a PNG (i.e lossless) and zoomed rendering of a 64x4 subsection of a JPEG image returned from Yahoo! servers when we trigger the vulnerability.

There are many interesting things to note about this image, but for now we'll observe that... is that a couple of pointers we see at the top?? I have a previous Project Zero post about pointer visualization and once you've seen a few pointers, it gets pretty easy to spot them across a variety of leak situations. Well, at least for Linux x86_64, where pointers often have a common prefix of 0x00007f. The hallmarks of the pointers above are:
  • Repeating similar structures with the same alignment (8 bytes on 64-bit).
  • In blocks of pointers, vertical stripes of white towards where the most significant bytes are aligned, representing 0x00. (Note that the format of the leaked bytes is essentially "negated" because of the input file format.)
  • Again in blocks of pointers, vertical stripes of black next to that, representing 7 bits set in a row of the 0x7f value.
  • And again in blocks of pointers, a thin vertical stripe of white next to that, representing the 1 bit not set in the 0x7f value.
But we still see JPEG compression artifacts. Will that be a problem for precise byte-by-byte data exfiltration? We'll get to it later.

The vulnerability

The hunt for this vulnerability starts with the realization that we don't know if Yahoo! is running an uptodate ImageMagick or not. Given that we know Yahoo! supports the RLE format, perhaps we can use the same techniques outlined in my previous post about memory corruption in Box.com? Interestingly enough, the crash file doesn't seem to crash anything but all the test files do seem to render instead of failing cleanly as would be expected with an uptodate ImageMagick. After some pondering, the most likely explanation is that the Yahoo! ImageMagick is indeed old and vulnerable, but the different heap setup we learned out about with YB1 means that the out-of-bounds heap write test case has no effect due to alignment slop.

To test this hypothesis, we make an out-of-bounds heap write RLE file that substantially overflows a smaller chunk (64 bytes, with a 16 byte or so overflow), and upload that to Yahoo!
And sure enough, hitting the thumbnail fetch URL fairly reliably (50% or so) results in:

This looks like a very significant backend failure, and our best guess is a SIGSEGV due to the presence of the 2+ years old RLE memory corruption issue.

But our goal today is not to exploit an RCE memory corruption, although that would be fun. Our goal is to exfiltrate data with a *bleed attack. So, we have an ImageMagick that is about 2.5 years old. In that timeframe, surely lots of other interesting vulnerabilities have been fixed? After a bit of looking around, we settle on an interesting candidate: this 2+ years old out-of-bounds fix in the SUN decoder. The bug fix appears to be taking a length check and applying it more thoroughly so that it includes images with a bit depth of 1. Looking at the code slightly before this patch, and tracing through the decode path for an image with a bit depth of 1, we get (coders/sun.c):

    number_pixels=(MagickSizeType) image->columns*image->rows;
    if ((sun_info.type != RT_ENCODED) && (sun_info.depth >= 8) &&
        ((number_pixels*((sun_info.depth+7)/8)) > sun_info.length))
    sun_data=(unsigned char *) AcquireQuantumMemory((size_t) sun_info.length,
    count=(ssize_t) ReadBlob(image,sun_info.length,sun_data);
    if (count != (ssize_t) sun_info.length)

    if (sun_info.depth == 1)
      for (y=0; y < (ssize_t) image->rows; y++)
        if (q == (Quantum *) NULL)
        for (x=0; x < ((ssize_t) image->columns-7); x+=8)
          for (bit=7; bit >= 0; bit--)
            SetPixelIndex(image,(Quantum) ((*p) & (0x01 << bit) ? 0x00 : 0x01),

So in the case of an image with a depth of 1, we see a fairly straightforward problem:
  • Let's say we have width=256, height=256, depth=1, length=8.
  • Image of depth 1 bypasses check of number_pixels against sun_info.length.
  • sun_data is allocated to be a buffer of 8 bytes, and 8 bytes are read from the input file.
  • Decode of 1 bit per pixel image proceeds. (256*256)/8 == 8192 bytes are required in sun_data but only 8 are present.
  • Massive out-of-bounds read ensues; rendered image is based largely on out-of-bounds memory.

The exploit

The exploit SUN file is only 40 bytes, so we might as well show a gratuitous screenshot of the file in a hex editor, and then dissect the exact meaning of each byte:

59 A6 6A 95:             header
00 00 01 00 00 00 01 00: image dimensions 256 x 256
00 00 00 01:             image depth 1 bit per pixel
00 00 00 08:             image data length 8
00 00 00 01:             image type 1: standard
00 00 00 00 00 00 00 00: map type none, length 0
41 41 41 41 41 41 41 41: 8 bytes of image data

The most interesting variable we can twiddle in this exploit file is the image data length. As long as we keep it smaller than (256 * 256) / 8, we'll get the out of bounds read. But where will this out of bounds read start? It will start from the end of the allocation of the image data. By changing the size of the image data, we may end up occupying different relative locations within the heap area (perhaps more towards the beginning or end with certain sizes). This flexibility gives us a greater chance of being able to read something interesting.


Exfiltration is where the true usefulness of this exploit becomes apparent. As noted above in the vizualization section, the result of our exfiltration attempt is a JPEG compressed file. We actually get a greyscale JPEG image out of ImageMagick, since the 1 bit per pixel SUN file generates a black and white drawing. A greyscale JPEG is a good start for reliable exfiltration, because JPEG compression is designed to hack human perception. Human visualization is more sensitive to color brightness than it is to actual color, so color data is typically compressed more lossily than brightness data. (This is achieved via the YCbCr colorspace.) But with greyscale JPEGs there is only brightness.

But looking at the exfiltrated JPEG image above, we still see JPEG compression artifacts. Are we doomed? Actually, no. Since our SUN image was chosen to be a 1 bit per pixel (black or white) image, then we only need to preserve 1 bit of accurate entropy per pixel in the exfiltrated JPEG file! Although some of the white pixels are a little bit light grey instead of fully white, and some of the black pixels are a little bit dark grey instead of fully black, each pixel is still very close to black or white. I don't have a mathematical proof of course, but it appears that for the level of JPEG compression used by the Yahoo! servers, every pixel has its 1 bit of entropy intact. The most deviant pixels from white still appear to be about 85% white (i.e. very light grey) whereas the threshold for information loss would of course be 50% or below.

We can attempt to recover raw bytes from the JPEG file with an ImageMagick conversion command like this:

convert yahoo_file.jpg -threshold 50% -depth 1 -negate out.gray

For the JPEG file at the top of this post, the resulting recovered original raw memory bytes are:

0000000 d0 f0 75 9b 83 7f 00 00 50 33 76 9b 83 7f 00 00
0000020 31 00 00 00 00 00 00 00 2c 00 00 00 00 00 00 00

Those pointers, 0x00007f839b75f0d0 and 0x00007f839b763350, look spot on.

The strings and secrets

So now that we have a reliable and likely byte perfect exfiltration primitive, what are some interesting strings in the memory space of the Yahoo! thumbnail server? With appropriate redactions representing long high entropy strings,

SSLCOOKIE: SSL=v=1&s=redacted&kv=0

Yahoo-App-Auth: v=1;a=yahoo.mobstor.client.mailtsusm2.prod;h=;t=redacted;k=4;s=redacted


Yeah, it's looking pretty serious.

In terms of interesting strings other than session secrets, etc., what else did we see? Well, most usefully, there's some paths, error messages and versions strings that would appear to offer a very precise determination that indeed, ImageMagick is here and indeed, it is dangerously old:

ImageMagick 6.8.9-6 Q16 x86_64 2014-07-25 http://www.imagemagick.org
unrecognized PerlMagick method

Obviously, these strings made for a much more convincing bug report. Also note the PerlMagick reference. I'm not familiar with PerlMagick but perhaps PerlMagick leads to an in-process ImageMagick, which is why our out-of-bounds image data reads can read so much interesting stuff.


This was fun. We found a leak that encoded only a small amount of data per JPEG compressed pixel returned to us, allowing us to reliably reconstruct original bytes of exfiltrated server memory.

The combination of running an ImageMagick that is both old and also unrestricted in the enabled decoders is dangerous. The fix of retiring ImageMagick should take care of both those issues :)

Thursday, May 18, 2017

*bleed continues: 18 byte file, $14k bounty, for leaking private Yahoo! Mail images


*bleed attacks are hot right now. Most notably, there's been Heartbleed and Cloudbleed. In both cases, out-of-bounds reads in server side code resulted in private server memory content being returned to clients. This leaked sensitive secrets from the server process' memory space, such as keys, tokens, cookies, etc. There was also a recent client-side bleed in Microsoft's image libraries, exposed through Internet Explorer. One of the reason *bleed attacks are interesting is that they are not affected by most sandboxing, and they are relatively easy to exploit.

Presented here is Yahoobleed #1 (YB1), a way to slurp other users' private Yahoo! Mail image attachments from Yahoo servers.

YB1 abuses an 0-day I found in the ImageMagick image processing software. This vulnerability is now a so-called 1-day, because I promptly reported it to upstream ImageMagick and provided a 1-line patch to resolve the issue, which landed here. You can refer to it as CESA-2017-0002.

The previous *bleed vulnerabilities have typically been out-of-bounds reads, but this one is the use of uninitialized memory. An uninitialized image decode buffer is used as the basis for an image rendered back to the client. This leaks server side memory. This type of vulnerability is fairly stealthy compared to an out-of-bounds read because the server will never crash. However, the leaked secrets will be limited to those present in freed heap chunks.

Yahoo! response

The Yahoo! response has been one of best I've seen. In order:
  1. They have a bug bounty program, which encourages and rewards security research and fosters positive hacker relations, etc.
  2. Upon receiveing a bug, they serve a 90-day response deadline on themselves. This is very progressive and the polar opposite of e.g. Microsoft, who occasionally like to turn reasonable disclosure deadlines into a pointless fight.
  3. And indeed the bug was fixed well within 90 days.
  4. The communication was excellent, even when I was sending far too many ping requests.
  5. The fix was particularly thorough: ImageMagick was retired.
  6. A robust bounty of $14,000 was issued (for this combined with a similar issue, to be documented separately). $778 per byte -- lol!
  7. I'm donating this reward to charity. Upon being asked about charitable matching, Yahoo! accepted a suggestion to match (i.e. double) the reward to $28,000.
  8. :sunglasses_like_a_boss:

The attack vector for these demos was to attach the 18-byte exploit file (or a variant) as a Yahoo! Mail attachment, send it to myself, and then click on the image in the received mail to launch the image preview pane. The resulting JPEG image served to my browser is based on uninitialized, or previously freed, memory content.

The following three images have had entropy stripped from them via a variety of transforms. Originals have been destroyed.

In image #1, you can see the capital letter A inside a black circle. Imagine suddenly and unexpectedly seeing this in a returned image while investigating a vulnerability you expected to be boring! There are various possible reasons for the repeated nature of the recovered image -- assuming the original in-memory image does not consist of a repeated nature.

Most obviously, the in-memory image dimensions probably don't match the dimensions of our uninitialized canvas, 1024x1024 in this particular case. Depending on how the in-memory image is stored, this will lead to repetition and / or offsetting as both seen here.

Also, the in-memory image representation might not match the colorspace, colorspace channel order or alpha channel status (yes or no) of the uninitialized RLE decode canvas. The thumbnail decode and re-encode pipeline will leave all sorts of different in memory artifacts in the course of doing its job.

Don't be in any doubt though: correct reconstruction of the original image would be possible, but that's a non-goal.

In image #2, you might still be able to make out the remains of a human face. Perhaps a forehead, perhaps a nose and even a forehead and cheekbone? Or maybe not, because of the stripping of entropy and transforms applied. But you can appreciate that at the time, seeing a random face was a shock and illustrated the severity of the leak. At that point, I ceased, desisted, destroyed all files based on uninitialized memory and reported the bug.

In image #3, the color has been left in because the image shows interesting vertical bands of magenta, cyan and yellow. What in-memory structure leads to this pattern? I have no idea. At first I was thinking some CMYK colorspace representation but I don't think that makes sense. JPEGs are typically coded in the YCbCr colorspace, so perhaps a partially encoded or decoded JPEG is involved.

The vulnerability

The vulnerability exists in the obscure RLE (Utah Raster Toolkit Run Length Encoded) image format, which previously featured in my blog post noting memory corruption in box.com. The new ImageMagick 0-day is in this code snippet here, from coders/rle.c:

    pixels=(unsigned char *) GetVirtualMemoryBlob(pixel_info);
    if ((flags & 0x01) && !(flags & 0x02))
          Set background color.
        for (i=0; i < (ssize_t) number_pixels; i++)
              for (j=0; j < (ssize_t) (number_planes-1); j++)
              *p++=0;  /* initialize matte channel */
      Read runlength-encoded image.
      switch (opcode & 0x3f)
    } while (((opcode & 0x3f) != EOFOp) && (opcode != EOF));

It's a tricky vulnerability to spot because of the abstraction and also because this is a vulnerability caused by the absence of a necessary line of code, not the presence of a buggy line of code. The logic is approximately:
  1. Allocate a suitably sized canvas for the image decode. Note that the GetVirtualMemoryBlob call does NOT guarantee zero-filled memory, as you would expect if it were backed by mmap(). It's just backed by malloc().
  2. Depending on some image header flags, either initialize the canvas to a background color, or don't.
  3. Iterate a loop of RLE protocol commands, which may be long or may be empty.
  4. After this code snippet, the decoded canvas is handed back to the ImageMagick pipeline for whatever processing is underway.
As you can now see, the attacker could simply create an RLE image that has header flags that do not request canvas initialization, followed by an empty list of RLE protocol commands. This will result in an uninitialized canvas being used as the result of the image decode. Here's an RLE file that accomplishes just that. It's just 18 bytes!

And these 18 bytes parse as follows:

52 CC:       header
00 00 00 00: top, left at 0x0.
00 04 00 04: image dimensions 1024 x 1024
02:          flags 0x02 (no background color)
01:          1 plane (i.e. grayscale image)
08:          8 bits per sample
00 00 00:    no color maps, 0 colormap length, padding
07:          end of image (a protocol command is consumed pre-loop)
07:          end of image (end the decode loop for real)

There are a few bytes that are important for experimentation of exploitation: the number of planes (i.e. a choice between a greyscale vs. a color image), and the image dimensions, which determine the size of the unininitialized malloc() chunk.


Exploitation is an interesting discussion. Here are some of the factors that affect exploitation of this vulnerability, both generally and in the Yahoo! case specifically:
  • Decoder lockdown. Yahoo! did not appear to implement any form of whitelisting for only sane ImageMagick decoders. RLE is not a sane decoder on the modern web. Anyone using ImageMagick really needs to lock down the decoder list to just the ones needed.
  • Sandboxing. As noted above, sandboxing doesn't have much of an effect on this vulnerability type. Although it doesn't make much difference, my best guess is that the server in question might be using Perl and PerlMagick, as well as handling various network traffic, suggesting that a tight sandbox policy is difficult without extensive refactoring.
  • Process isolation. This is critical. *bleed bugs for most usages of ImageMagick are mostly harmless because it's typical to kick off e.g. a thumbnail request by launching the ImageMagick convert binary, once per thumbnail request. This means that the process memory space of the exploited process is only likely to contain the attacker's image data, making exploitation much less interesting (but not irrelevant as we saw in my recent Box and DropBox blog posts).
    The Yahoo! Mail process which is handling thumbnailing has an unusual property, though: it appears to be long lived and to process images from a variety of different users. Suddenly, a memory content leak is a very serious vulnerability.
  • Heap implementation. The heap implementation used is significant. For the input RLE file decomposed above, the canvas allocation is 1024 x 1024 x 4, or 4MB. This is large allocation and for example, the default malloc() tuning on Linux x86_64 leads to such a large allocation being satisfied by mmap(), which zero-fills memory and will not lead to interesting memory content leakage! However, we definitely see leaked heap content with such a large allocation in the Yahoo! context, so a non-default heap setup is clearly in use. With our ability to leak heap content, we could likely fingerprint the exact heap implementation if we cared to do so. Who knows, maybe we'd find tcmalloc or jemalloc, both popular allocators with large service providers.
  • Thumbnail dimensions. It's worth a brief note on thumbnail dimensions. If we're in a situation where our uninitialized canvas gets resized down to a smaller thumbnail, this would compress out detail from the original leaked memory content. This would spoil certain exploitation attempts. However, Yahoo! Mail's preview panel seems to display very large images (tested to 2048x2048) verbatim, so no worries here.
  • Thumbnail compression. This one is interesting. Yahoo! Mail returns thumbnails and image previews as JPEG images. As we know, JPEG compression is lossy. This is not a particular issue if we're only interested in pulling image data out of Yahoo! servers. But if we're interested in looking at raw bytes of Yahoo! server memory, then lossy compression is losing us data.
    As an example of this, I tried to extract concrete pointer values from memory dumps -- perhaps we want to remotely defeat ASLR? I used greyscale images because JPEG compresses color data more than it compresses intensity data. But still, with a 16-byte sized data exfiltration I see pointer values such as 0x020081c5b0be476f and 0x00027c661ac2722a. Hmm. You can see that these may be trying to be Linux x86_64 user space pointers (0x00007f....) but there's a lot of information loss there.
    I think it would be possible to overcome this exfiltration problem. Most likely, there's some thumbnailing endpoint that returns (or can be asked to return with some URL parameter) lossless PNG images, which would do the trick. Or for the brave, mathematical modelling of JPEG compression??


Broadly: in a world of decreasing memory corruption and increasing sandboxing, *bleed bugs provide a compelling option for easily stealing information from servers.

On design: taking a long running process and linking in ImageMagick as a library has to be a discouraged design choice, as it takes bugs that might not be very serious and makes them very serious indeed. Limiting ImageMagick decoders to a minimal required set is a must for any processing of untrusted input.

GraphicsMagick vs. ImageMagick, again. Well, well, look at this :) GraphicsMagick fixed this issue in March 2016, for the v1.3.24 release, tucked away in a changeset titled "Fix SourceForge bug #371 "out-of-bounds read in coders/rle.c:633:39" (see the second memset()). This is another case where tons of vulnerabilities are being found and fixed in both GraphicsMagick and ImageMagick with little co-ordination. This seems like a waste of effort and a risk of 0-day (or is it 1-day?) exposure. It goes both ways: the RLE memory corruption I referenced in my previous blog post was only fixed in GraphicsMagick in March 2016, having been previously fixed in ImageMagick in Dec 2014.

Linux distributions. This vulnerability is now 1-day in the sense that it is broadly unpatched by entities that repackage upstream ImageMagick, such as Linux distributions. (As a side note, it's worth noting that the RLE decoder in Ubuntu 16.04 LTS is totally borked due to a bogus check that can be seen getting removed here.)

But most significantly for this conclusion, I wanted to highlight some questions about ecosystem responsibility.
Researcher responsibility: what is a researcher's responsibility and where does it end? As a researcher who becomes aware of a software risk, I believe I have a responsibility to let the "owner" of that software immediately know about the issue and the fact that I think it's a security risk. Let's assume that the owner makes a fix in a reasonable time. Are we good now? Well, what about all the downstream usages (eventually cascading to end users) that don't yet have the fix? So is it my responsibility to go and find and harangue every cloud provider that is affected? No, probably not. We might let one or two providers with bounty programs know as a positive feedback for encouraging security research, but the broader problem remains.
Upstream vendor responsibility: how should the upstream vendor respond to a security report other than identifying it (if not already clearly flagged) and fixing it promptly? Personally, I think there should be some form of notification in a well defined place (mailing list; web site announcement area; etc.) and the upstream vendor could also take the burden of getting the CVE.
Consumer responsibility: what should consumers such as Yahoo!, Box, DropBox, Ubuntu, etc. do? Well, if the upstream vendor is doing a good job, the consumer security teams can all be subscribed to the same authoritative source and take action whenever a new announcement is made. (Probably less trivial than it sounds; both Box and Yahoo! appear to have been running old versions of ImageMagick with known vulnerabilities.)