It seems, though, that these various techniques require the user to choose between "fast" or "thorough protection". This isn't ideal. Shortly, I'll document my own idea for how to try and get both fast and thorough. But first, a recap on what we mean by fast and thorough.
Fast vtable protection
Protecting vtables typically involves inserting machine instructions around vtable pointer load or virtual calls. Going fast is simple: only insert a very small number of fast instructions (i.e. no hard-to-predict branches). This is the approach taken by vtguard. If you look at page 14 in the vtguard PDF linked above, you'll see that there's just a single cmp and a single jne (short, and never taken in normal execution) added to the hot path.
Tangentially, another task commonly undertaken when adding vtable protections to a given program is to remove as many virtual calls as possible, by annotating classes and methods with the "final" keyword and/or applying whole-program optimizations.
Thorough vtable protection
Describing what we want in a thorough vtable protection is a little more involved. We want:
- Defeating ASLR does not defeat the vtable check. (vtguard lacks this property, whereas the GCC implementation has it.)
- Only a valid vtable pointer can be used.
- Furthermore, only a vtable pointer corresponding to the correct hierarchy for the call site can be used.
- Ideally, only a vtable pointer corresponding to the correct hierarchy level for the call site can be used.
A fast solution for thorough vtable protection?
How can we get all of the protections above and get them fast? My idea revolves around separating the problem into two pieces:
- Work out whether we can trust the vtable pointer or not.
- Validate that the class type represented by the vtable pointer is appropriate for the call site.
To trust or not to trust?
Current schemes trust the vtable pointer or not, based either on an some secret (vtguard, xor-based LLVM approach), a fixed table of valid values (GCC, some LLVM approaches) or by constraining values that might appear in the vtable position (heap partitioning).
The new scheme would be to reserve a certain portion of the address space for vtables. We know that nothing else can be mapped there, so by suitably masking any proposed vtable pointer, we know it is valid. I haven't fully thought this through for 32-bit, but look at this 64-bit variant:
- Host vtables in the lower 4GB of address space.
- Use the dereference of a 32-bit register to load the vtable entry. This provides masking for free and even saves a byte in the instruction sequence. It works because loading 4-bytes into a 64-bit register zero extends the result.
- Optionally, save memory by having the compiler use 4-byte vtables.
This scheme is approximately free, maybe even performance positive in some situations. Furthermore, one possible implementation is to stop somewhere around here for a very fast protection scheme that is "ok" in thoroughness.
On the downside, you've lost the 64-bit invariant that "nothing is mapped in the bottom 4GB", but the percentage of space used is going to be small. If that bothers us, then we can use the same trick to load a 4-byte vtable pointer and then "or" it with 0x100000000 (use bts if you dare) or some other value.
Validating class type
Once you know you trust your vtable pointer, validating the class type becomes a lot simpler. Instead of messing with secrets inside the vtable, you can just store a compact representation of the class type inside the vtable, with the aim of satisfying validation needs with a single compare.
The one trick we want to play is to make it easy to validate various different positions in a class hierarchy with minimal work. To do this, we can store class details in a hierarchical format. To take a simple case, imagine that we have the following classes in the system:
A1, A1:B1, A2, A2:B1, A2:B1:C1
We encode these using one byte per hierarchy level, the basemost class being the LSB: 00000001, 00000101, 00000002, 00000102, 00010102. (Note that this will be an approximation. For example, if you have more than 256 basemost classes with virtual functions, you would need to represent the first level with 2 or more bytes.)
Finally, our "is this object of the correct type for the callsite?" check becomes a simple compare. Depending on the position in the hierarchy, we may be able to achieve the compare with no masking and therefore a single instruction.
For example, for a call site expecting an object of type A1, it's just "cmpb $1, (%eax)". That's a 4-byte sequence, which is much shorter than the 10-byte sequence noted in the vtguard PDF. For a call site expecting an object of type A2:B1, it's "cmpw $0x102, (%eax)".
Closing notes
Will it work well? Who knows. I haven't had time to implement this, nor am I likely to in the near future. Feel free to take this and run with it.
Note that this idea doesn't cover what to do with raw function pointer calls. If you want to head towards complete control flow integrity, you'll want to look at protecting those, as well as return addresses (the current canary-based stack defenses do nothing against an arbitrary read/write primitive).