Bochspwn Reloaded
Bochspwn Reloaded
Bochspwn Reloaded
June 2018
Abstract
One of the responsibilities of modern operating systems is to enforce
privilege separation between user-mode applications and the kernel. This
includes ensuring that the influence of each program on the execution en-
vironment is limited by the defined security policy, but also that programs
may only access the information they are authorized to read. The latter
goal is especially difficult to achieve considering that the properties of C
– the main programming language used in kernel development – make
it highly challenging to securely pass data between different security do-
mains. There is a significant risk of disclosing sensitive leftover kernel
data hidden amidst the output of otherwise harmless system calls, unless
special care is taken to prevent the problem. Issues of this kind can help
bypass security mitigations such as KASLR and StackGuard, or retrieve
information processed by the kernel on behalf of the system or other users,
e.g. file contents, network traffic, cryptographic keys and so on.
In this paper, we introduce the concept of employing full system em-
ulation and taint tracking to detect the disclosure of uninitialized kernel
stack and heap/pool memory to user-space, and discuss how it was suc-
cessfully implemented in the Bochspwn Reloaded project based on the
open-source Bochs IA-32 emulator. To date, the tool has been used to
identify over 70 memory disclosure vulnerabilities in the Windows kernel,
and more than 10 lesser bugs in Linux. Further in the document, we
evaluate alternative ways of detecting such information leaks, and outline
data sinks other than user-space where uninitialized memory may also
leak from the kernel. Finally, we provide suggestions on related research
areas that haven’t been fully explored yet. Appendix A details several
further ideas for system-wide instrumentation (implemented using Bochs
or otherwise), which can be used to discover other programming errors in
OS kernels.
1
Contents
1 Introduction 4
2
5 Alternative detection methods 62
5.1 Static analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62
5.2 Manual code review . . . . . . . . . . . . . . . . . . . . . . . . . 62
5.3 Cross-version kernel binary diffing . . . . . . . . . . . . . . . . . 65
5.4 Differential syscall fuzzing . . . . . . . . . . . . . . . . . . . . . . 67
5.5 Taintless Bochspwn-style instrumentation . . . . . . . . . . . . . 71
7 Future work 76
8 Conclusion 78
9 Acknowledgments 78
3
1 Introduction
Modern operating systems running on x86/x86-64 CPUs are multi-threaded
and use a client—server model, where user-mode applications (clients) execute
independently of each other, and only call into the kernel (server ) once they
intend to operate on a resource managed by the system. The mechanism used
by code running in ring 3 to call into a predefined set of ring 0 functions is
known as system calls or syscalls in short. While CPU registers are used to
return the exit code and to pass the first few arguments on 64-bit platforms,
the primary I/O data exchange channel is user-mode memory. Consequently,
the kernel continuously operates on ring 3 memory during its activity. The life
of a typical system call is illustrated in Figure 1.
Shared Memory
User-mode Program System Kernel
(user-mode)
Syscall logic
The execution flow is simple in principle, but the fact that user-mode mem-
ory is a shared resource that can be read or written to at any point in time by
another thread makes the kernel’s interactions with the memory prone to race
conditions and other errors, if not implemented carefully. One way to achieve
a secure implementation is to have every system call handler adhere to the
following two rules (in that order):
1. Fetch data from each user-mode memory location at most once, with an
active exception handler in place, and save a local copy in the kernel space
for further processing.
2. Write to each user-mode memory location at most once, with an active
exception handler in place, only with data intended for the user-mode
client.
4
The usage of pointer annotations (such as user in Linux) and dedicated
functions for operating on client memory (e.g. copy from user, copy to user,
get user, put user in Linux) helps maintain a healthy user-mode interaction
hygiene, by forcing developers to make conscious decisions about when to per-
form these operations. On the other hand, the looser approach observed in the
Windows kernel (direct pointer manipulation) seems to provoke the presence of
more security issues, as demonstrated later in the paper.
Breaking each of the above requirements has its own unique consequences,
with varying degree of impact on system security:
The above bug classes are addressed in more detail in Appendix A. The
primary subject of this paper is the breaking of the following rule:
5
2 Memory disclosure in operating systems
A disclosure of privileged system memory occurs when the kernel returns a larger
region of data than is necessary to store the relevant information contained
within. Frequently, the redundant bytes originate from a kernel memory region
which used to store data previously processed in another context, and are not
pre-initialized to ensure that the old values are not propagated to new data
structures.
In some cases, fault can be clearly attributed to insufficient initialization
of certain variables or structure fields in the code. At other times, information
leaks occur in the kernel even though the corresponding source code is seemingly
correct (sometimes only on specific CPU architectures). In either case, memory
disclosure between different privilege levels running C code is a problem hardly
visible to the naked eye.
6.7.9 Initialization
...
10 If an object that has automatic storage duration is not initialized
explicitly, its value is indeterminate.
6
The part most applicable to system code is the one concerning stack-allocated
objects, as kernels typically have dynamic allocation interfaces with their own
semantics (not necessarily consistent with the C standard library, as discussed
in Section 2.2.1 “Memory reuse in dynamic allocators”).
To our best knowledge, none of the three most popular C compilers for
Windows and Linux (Microsoft C/C++ Compiler, gcc, LLVM) produce code
which pre-initializes otherwise uninitialized objects on the stack in release mode
(or equivalent). There are compiler switches enabling the poisoning of stack
frames with marker bytes (e.g. /RTCs in Microsoft Visual Studio), but they are
not used in production builds for performance reasons. As a result, uninitialized
objects on the stack inherit old values of their corresponding memory areas.
Let’s consider an example of a fictional Windows system call implementation,
which multiplies the input integer by two and returns the product (Listing 1).
It is evident that in the corner case of InputValue=0, the OutputValue variable
remains uninitialized and is copied in that form back to the client. Such a bug
would enable the disclosure of four bytes of kernel stack memory on each syscall
invocation.
While possible, information leaks through standalone variables are not very
common in practice, as modern compilers will often detect and warn about
such problems, and being functional bugs, they may also be identified during
development or testing. However, a second example in Listing 2 shows that a
leak may also take place through a structure field.
In this case, the Reserved field is never explicitly used in the code, but is still
copied back to user-mode, and thus also discloses four bytes of kernel memory
to the caller. This example distinctly shows that having every field of every
structure returned to the client initialized on every code path is a difficult task,
and in many cases it is plainly counterintuitive to do so, if the field in question
does not play any practical role in the code.
Overall, the fact that uninitialized variables on the stack and in dynamic
allocations take the contents of data previously stored in their memory regions
lies at the core of the kernel memory disclosure problem.
7
1 typedef struct _SYSCALL_OUTPUT {
2 DWORD Sum;
3 DWORD Product;
4 DWORD Reserved;
5 } SYSCALL_OUTPUT, *PSYSCALL_OUTPUT;
6
7 NTSTATUS NTAPI NtArithOperations(
8 DWORD InputValue,
9 PSYSCALL_OUTPUT OutputPointer
10 ) {
11 SYSCALL_OUTPUT OutputStruct;
12
13 OutputStruct.Sum = InputValue + 2;
14 OutputStruct.Product = InputValue * 2;
15
16 RtlCopyMemory(OutputPointer, &OutputStruct, sizeof(SYSCALL_OUTPUT));
17
18 return STATUS_SUCCESS;
19 }
8
Practically speaking, C compilers for the x86(-64) architectures apply natu-
ral alignment to structure fields of primitive types, which means that each such
field is N-byte aligned, where N is the given field’s width. Furthermore, entire
structures and unions are also aligned such that when they are declared in an
array, the alignment requirements of the nested fields are still met. In order to
accommodate the alignment, padding bytes are artificially inserted into struc-
tures where necessary1 . While not directly accessible in the source code, these
bytes also inherit old values from the underlying memory regions and may leak
information to user-mode, if they are not reset in time.
In the example shown in Listing 3, a SYSCALL OUTPUT structure is passed
back to the caller. It contains a 4-byte and an 8-byte field, separated by 4 bytes
of padding required to align LargeSum to an 8-byte boundary. Even though
both fields are properly initialized, the padding bytes are not explicitly set,
which again leads to a kernel stack memory disclosure. The specific layout of
the structure in memory is illustrated in Figure 2.
if they are of type char and long long (or equivalent), respectively. In some rare and extreme
cases, the padding can be as long as 15, 31 or even 63 bytes, for such esoteric types as 80-bit
long double, m256 or m512.
9
Sum LargeSum
3B 05 00 00 00 00 00 00 00 00 00 00
0x0 0x1 0x2 0x3 0x4 0x5 0x6 0x7 0x8 0x9 0xA 0xB 0xC 0xD 0xE 0xF
region prior to initializing any of its fields and copying it to user-mode, like so:
memset(&OutputStruct, 0, sizeof(OutputStruct));
10
6.2.5 Types
...
20 Any number of derived types can be constructed from the ob-
ject and function types, as follows: [...] A union type describes an
overlapping nonempty set of member objects, each of which has an
optionally specified name and possibly distinct type.
The problem here is that if a union consists of several fields of various widths,
and only one of the smaller fields is set, then the remaining bytes allocated to
accommodate the larger ones are left uninitialized. Let’s examine an example
of an imaginary system call handler presented in Listing 4, together with the
memory layout of the SYSCALL OUTPUT union illustrated in Figure 3.
As can be seen, the total size of the SYSCALL OUTPUT union is 8 bytes, due to
the width of the larger LargeSum field. However, the function only sets the value
of the smaller field, leaving the 4 trailing bytes uninitialized and subsequently
disclosed to the client application.
11
LargeSum
Sum
3B 05 00 00
0x0 0x1 0x2 0x3 0x4 0x5 0x6 0x7
A safe implementation should only set the Sum field in the user address space,
instead of copying the entire object with potentially unused regions of memory.
Another frequently seen fix is the usage of the memset function to reset the
kernel’s copy of the union prior to setting any of its fields and passing it back
to user-mode.
12
2.2 System-specific properties
There are certain kernel design decisions, programming practices and code pat-
terns which affect how prone operating systems are to memory disclosure vul-
nerabilities. They are briefly discussed in the following subsections.
Windows The core of the Windows kernel pool allocator is the ExAllocate-
PoolWithTag function [82], which may be used directly, or via several available
wrappers: ExAllocatePool{∅, Ex, WithQuotaTag, WithTagPriority}. None
of these routines reset the returned memory regions, neither by default, nor
through any input flags. On the contrary, all of them have the following warning
in their corresponding MSDN documentation entries:
There are six primary pool types the callers can choose from: NonPagedPool,
NonPagedPoolNx, NonPagedPoolSession, NonPagedPoolSessionNx, PagedPool
and PagedPoolSession. Each of them has a designated region in the virtual
address space, and so memory allocations may only be reused within the scope
of their specific pool type. The reuse rate of memory chunks is very high, and
zeroed out areas are generally only returned when no suitable entry is found in
the lookaside lists, or the request is so large that fresh memory pages need to be
mapped to facilitate it. In other words, there are currently hardly any factors
preventing pool memory disclosure in Windows, and almost every such bug can
be exploited to leak sensitive data from various parts of the kernel.
13
Linux The Linux kernel offers three main interfaces for dynamically allocating
memory:
• The kmalloc function has a kzalloc counterpart, which makes sure that
the returned memory is cleared.
• An optional GFP ZERO flag can be passed to kmalloc, kmem cache alloc
and several other functions to achieve the same result.
• The kmem cache create routine accepts a pointer to an optional construc-
tor function, called to pre-initialize each object before returning it to the
requestor. The constructor may be implemented as a wrapper around
memset to reset the given memory area.
14
As can be expected, large buffers are rarely used to their full capacity, and
the remaining storage space is often not reset. This may lead to particularly
severe leaks of long, continuous areas of kernel memory. In the fictional example
shown in Listing 5, a system call uses a RtlGetSystemPath function to load the
system path into a local buffer, and if the call succeeds, all of the 260 bytes are
passed to the caller, regardless of the actual length of the string.
C:\\Windows\\System32\0
15
1 NTSTATUS NTAPI NtMagicValues(LPDWORD OutputPointer, DWORD OutputLength) {
2 if (OutputLength < 3 * sizeof(DWORD)) {
3 return STATUS_BUFFER_TOO_SMALL;
4 }
5
6 LPDWORD KernelBuffer = Allocate(OutputLength);
7
8 KernelBuffer[0] = 0xdeadbeef;
9 KernelBuffer[1] = 0xbadc0ffe;
10 KernelBuffer[2] = 0xcafed00d;
11
12 RtlCopyMemory(OutputPointer, KernelBuffer, OutputLength);
13 Free(KernelBuffer);
14
15 return STATUS_SUCCESS;
16 }
EF BE AD DE FE 0F DC BA 0D D0 FE CA
16
subsequent requests of the same length, it becomes possible to have the
leaked allocations overlap with ones that had contained a specific kind of
confidential data in the past.
As such, this is one of the most dangerous types of memory disclosure. It can
be addressed by keeping track of the number of bytes written to the temporary
kernel buffer, and passing only this amount of data back to the client.
17
User-mode Program User-Mode System API System Kernel
Syscall logic
Extract
meaningful data
Disclosed memory
Return the specific requested values lost here
Stacks The distinction between stacks and heaps/pools is that stacks gener-
ally store information directly related to the control flow, such as addresses of
kernel modules, dynamic allocations, stacks, and the secret values of stack cook-
18
ies installed by exploit mitigations such as StackGuard on Linux [24] and /GS
on Windows [83]. These are consistent pieces of information immediately useful
for potential attackers who intend to combine them with memory corruption
exploits. However, the variety of data they offer is limited2 , which leads us to
believe that they don’t present much value as standalone vulnerabilities.
case in macOS in October 2017 [34], and in Linux until April 2018 [41].
19
CPU emulator, and memory taint tracking was not implemented. More infor-
mation on this approach can be found in Section 5.5 “Taintless Bochspwn-style
instrumentation”. To our best knowledge, aside from Bochspwn Reloaded, this
was the first attempt to automatically identify kernel memory disclosure in a
closed-source operating system.
While not directly related to memory disclosure, it is also worth noting that
in 2010/2011, several types of ring 0 addresses were found to be leaked through
uninitialized return values of several win32k system calls [104, 51]. The problem
was supposed to be mitigated in Windows 8 [42], but in 2015, Matt Tait spotted
that the fix had been incomplete and discovered three further issues [74].
Mitigation One of the few generic mitigations we are aware of is that since
June 2017, the Windows I/O manager resets the output memory regions for all
buffered I/O operations handled by kernel drivers [39]. This change killed an
entire class of infoleaks where the IOCTL handlers left uninitialized chunks of
bytes in the buffer, or didn’t initialize the buffer at all.
Another minor security improvement is the fact that in Visual Studio 15.5
and later, POD (plain old data) structures that are initialized at declaration
using a “= {0}” directive are filled with zeros as a whole. This is different from
the previous behavior, where the first member was set-by-value to 0 and the
rest of the structure was reset starting from the second field, thus potentially
leaking the contents of the padding bytes between the first and second member.
2.5.2 Linux
In contrast to Windows, the Linux community has been vocal about memory
disclosure for many years, with the biggest spike of interest starting around 2010.
Since then, a number of research projects have been devised, focusing on au-
tomatic detection of existing kernel infoleaks and on reducing or completely
nullifying the impact of presumed, yet unknown issues. We believe that the gap
in the current state of the art between Windows and Linux is primarily caused
by the open-source nature of the latter, which enables easy experimentation
with a variety of approaches – static, dynamic, and a combination of both.
Detection Throughout the last decade, there have been dozens of kernel in-
foleak fixes committed to the Linux kernel every year. According to Chen
et al. [22], there were 28 disclosures of uninitialized kernel memory fixed in the
period from January 2010 to March 2011. An updated study by Lu K. [44] shows
that further 59 such vulnerabilities were reported between January 2013 and
May 2016. A large portion of the findings can be attributed to a small group of
researchers. For example, Rosenberg and Oberheide collectively discovered more
than 25 memory disclosure vulnerabilities in Linux in 2009-2010 [25, 26, 35],
mostly from the kernel stack. They subsequently demonstrated the usefulness
of such disclosures in the exploitation of grsecurity/PaX-hardened Linux ker-
nels in 2011 [37, 36]. Kulikov found over 25 infoleaks in 2010-2011 with manual
20
analysis and Coccinelle [107]. Similarly, Krause identified and patched 21 kernel
memory disclosure issues in 2013 [73], and more than 50 such bugs overall.
There are several tools readily available to detect infoleaks and other uses
of uninitialized memory in Linux, mostly designed with kernel developers in
mind. The most basic one is the -Wuninitialized compiler flag supported
by both gcc and LLVM, capable of detecting simple instances of uninitialized
reads within the scope of single functions. A more advanced option is the
kmemcheck debugging feature [108], which can be considered a kernel-mode
equivalent of Valgrind’s memcheck. At the cost of a significant CPU and mem-
ory overhead, the dynamic checker detects all uses of uninitialized memory
occurring in the code. The feature was recently removed from the mainline
kernel [17, 76], as the project is now considered inferior to the new and more
powerful KernelAddressSanitizer [12] and KernelMemorySanitizer checkers [19].
In the last few months, KMSAN coupled with the coverage-based syzkaller sys-
tem call fuzzer [11] has identified 19 reads of uninitialized memory [7], including
three leaks of kernel memory to userspace.
There have also been some notable efforts to use static analysis to detect
Linux kernel infoleaks. In 2014-2016, Peiró et al. demonstrated successful use
of model checking with the Coccinelle engine [3] to identify stack-based mem-
ory disclosure in Linux kernel v3.12 [97, 98]. The model checking was based on
taint tracking objects in memory from stack allocation to copy to user calls,
and yielded eight previously unknown vulnerabilities. In 2016, Lu et al. imple-
mented a project called UniSan [45, 44] – advanced, byte-granular taint tracking
performed at compile time to determine which stack and dynamic allocations
could leak uninitialized memory to one of the external data sinks (user-mode
memory, files and sockets). While the tool was primarily meant to mitigate
infoleaks by clearing all potentially unsafe allocations, the authors randomly
chose and analyzed a sample of about 20% of them (350 of about 1800), and
reported 19 new vulnerabilities in the Linux and Android kernels as a result.
Finally, several authors have proposed a technique of multi-variant program
execution to identify use of uninitialized memory. The basic premise of the ap-
proach is to concurrently run several replicas of the same software, capture their
output and compare it. If all legitimate sources of entropy are virtualized to re-
turn stable data across all replicas, then any differences in the output may only
be caused by leaking or using uninitialized memory. The non-determinism may
originate from entropy introduced by ASLR, or from different marker bytes used
to initialize new stack/heap allocations. The method was implemented for user-
mode programs in the DieHard [21] and BUDDY [44] projects in 2006 and 2017,
respectively. A similar approach was discussed by North in 2015 [96]. Lastly,
authors of SafeInit [94] also stated that their tool was meant as a software hard-
ening mechanism, but could be combined with a multi-variant execution system
to achieve bug detection capability. The technique was extensively evaluated
for client applications, but to our knowledge it hasn’t been successfully imple-
mented for the Linux kernel. In Section 5.4 “Differential syscall fuzzing”, we
present how a similar concept was shown to be effective in identifying infoleaks
in a subset of Windows system calls.
21
Mitigation Generic mitigations against kernel memory disclosure generally
revolve around zeroing old memory regions to prevent leftover data from being
inherited by new, unrelated objects. The ultimate advantage of this method is
that it addresses the problem on a fundamental level, completely eliminating the
threat of uninitialized memory, and killing existing and future kernel infoleaks
at once. Only a small fraction of memory allocated by the kernel is ever copied
to user-mode; on the other hand, resetting all memory areas prior or after usage
incurs a significant overhead. Finding the optimal balance between system
performance and the degree of protection against memory disclosure is currently
the main point of discussion.
In mainline kernel, the CONFIG PAGE POISONING and CONFIG DEBUG SLAB op-
tions can be set to enable the poisoning of all freed dynamic allocations with
marker bytes. The result is that each piece of data is overwritten at the mo-
ment it is discarded by the caller, which prevents it from being leaked back
to user-mode later on. Since all allocations are subject to poisoning, the op-
tions come with a considerable performance hit, and they don’t protect stack
allocations, which seem to constitute a majority of Linux kernel infoleaks.
The grsecurity/PaX project [4, 9] provides further hardening features. Set-
ting the PAX MEMORY SANITIZE flag causes the kernel to erase memory pages and
slab objects when they are freed, except for low-risk slabs whitelisted for perfor-
mance reasons. Furthermore, the PAX MEMORY STRUCTLEAK option is designed to
zero-initialize stack-based objects such as structures, if they are detected to be
copied to userland. It may prevent leaks through uninitialized fields and padding
bytes, but it is a relatively lightweight feature that may be subject to false nega-
tives. A more exhaustive, but also more costly option is PAX MEMORY STACKLEAK,
which erases the used portion of the kernel stack before returning from each sys-
tem call. This eliminates any disclosure of stack memory shared between two
subsequent syscalls, but doesn’t affect leaks of data put on the stack by the vul-
nerable syscall itself. Currently, the Kernel Self Protection Project is making
efforts to port the STACKLEAK feature to the mainline kernel [18, 38].
Other researchers have also proposed various variants of zeroing objects at
allocation and deallocation times in the Linux kernel. Secure deallocation [23]
(Chow et al., 2005) reduces the lifetime of data in memory by zeroing all regions
at deallocation or within a short, predictable period of time. A prototype of the
concept was implemented for Linux user-mode applications and the kernel page
allocator. Split Kernel [43] (Kurmus and Zippel, 2014) protects the system from
exploitation by untrusted processes by clearing entire kernel stack frames upon
entering each hardened function. SafeInit [94] (Milburn et al., 2017) clears
all stack and dynamic allocations before they are used in the code, to ulti-
mately eliminate information leaks and use of uninitialized memory. UniSan [45]
(Lu et al., 2016) reduces the overhead of SafeInit by performing advanced mem-
ory taint tracking at compile time, to conservatively determine which allocations
are safe to be left without zero-initialization, while still clearing the remaining
stack and heap-based objects.
As shown above, Linux has been subject to extensive experimentation in the
area of data lifetime and kernel memory disclosure.
22
3 Bochspwn Reloaded – detection with software
x86 emulation
Bochs [1] is an open-source IA-32 (x86) PC emulator written in C++. It in-
cludes emulation of the Intel x86 CPU, common I/O devices and a custom
BIOS, making up a complete, functional virtual machine fully emulated in soft-
ware. It can be compiled to run in a variety of configurations, and similarly
it can correctly host most common operating systems, including Windows and
GNU/Linux. In our research, we ran Bochs on Windows 10 64-bit as the host
system.
Among many of the project’s qualities, Bochs provides an extensive instru-
mentation API, which makes it a prime choice for performing DBI (Dynamic
Binary Instrumentation) against OS kernels. At the time of this writing, there
are 31 supported instrumentation callbacks invoked by the emulator on many
occassions during the emulated computer’s run time, such as:
23
3.1 Core logic – kernel memory taint tracking
The fundamental idea behind Bochspwn Reloaded is memory taint tracking
performed over the entire kernel address space of the guest system. In this case,
taint is associated with every kernel virtual address, and represents information
about the initialization state of each byte residing in ring 0.
The high-level logic used to detect disclosure of uninitialized (tainted) kernel
memory is as follows:
24
3.1.1 Shadow memory representation
The taint information of the guest kernel address space is stored in so-called
shadow memory, a memory region in the Bochs process that maps each byte
(or chunk of bytes) to the corresponding metadata. As the scope of the metadata
and its low-level representation is different for 32-bit and 64-bit target systems,
we address them separately in the following paragraphs.
Each of the values is 32 bits wide. In total, the above fields together with the
taint boolean consume 17 bytes. If a separate descriptor entry was allocated for
each byte of the kernel address space, the overhead would be 34 GB for Windows
guests and 17 GB for Linux. To optimize the memory usage, we increased the
granularity of the above allocation-related metadata from 1 to 8 bytes, which
reduced the effective overhead from a factor of 17 to 3. This was sufficient to run
the instrumentation on a typical modern workstation equipped with a standard
amount of RAM. A summary of the information classes is shown in Table 1.
Memory usage
Information class Type Granularity
Windows Linux
Taint uint8 1 byte 2 GB 1 GB
Allocation size uint32 8 bytes 1 GB 512 MB
Allocation base uint32 8 bytes 1 GB 512 MB
Allocation tag/flags uint32 8 bytes 1 GB 512 MB
Allocation origin uint32 8 bytes 1 GB 512 MB
Total 6 GB 3 GB
Table 1: Summary of metadata information classes stored for x86 guest systems
25
Shadow memory in x86-64 Representing taint information for 64-bit sys-
tems is more difficult, primarily because the address range subject to shad-
owing is as large as the user-mode address space of the Bochs emulator it-
self. Valid kernel memory addresses fall between 0xffff800000000000 and
0xffffffffffffffff, which is a 128 terabyte area that cannot be mapped to a
statically allocated region, both due to physical memory and virtual addressing
limitations. As a result, this technical challenge called for a substantial rework
of the metadata storage.
To start things off, we removed some of the less critical information related
to memory allocations, specifically the size, base address and tag/flags. While
certainly useful to have, we could still triage all potential reports without these
information classes, and maintaining them for x86-64 targets would be a sig-
nificant burden. However, we deemed the allocation origin important enough
to stay, as it was often the only way to track the control flow back to the vul-
nerable code area, e.g. when universal interfaces (such as ALPC in Windows)
were used to send data to user space. As allocating the data structure stat-
ically was no longer an option, we converted the 0x10000000-item long array
to a std::unordered map<uint64, uint64> hash map container, semantically
nearly identical to its 32-bit counterpart.
The last information class that needed to be considered was the taint itself.
It was possible to also use a hash map in this case, but it was not optimal and
would have significantly slowed the instrumentation down. Instead, we made
use of the fact that the taint state for each kernel byte could be represented
as a single bit. As a result, the taint information of the kernel address space
was packed into bitmasks, thus mapping the 0x800000000000-byte (128 TB)
region into a 0x100000000000-byte (16 TB) shadow memory. While an area of
this size still cannot be allocated all at once, it can be reserved in the virtual
address space. During the run time of the emulator, the specific pages accessed
by the instrumentation are mapped on demand, resembling a mechanism known
as memory overcommittment.
Memory overcommittment is not supported on Windows – our host system
of choice – but it is possible to implement it on one’s own using exception
handling:
1. Reserve the overall shadow memory area with a VirtualAlloc API call
and a MEM RESERVE flag.
2. Set up an exception handler using the AddVectoredExceptionHandler
function.
3. In the exception handler, check if the accessed address falls within the
shadow memory, and exception code equals EXCEPTION ACCESS VIOLATION.
If this is the case, commit the memory page in question and return with
EXCEPTION CONTINUE EXECUTION.
With the modifications explained above, the shadow memory was success-
fully ported to work with 64-bit guest systems.
26
Double-tainting In addition to setting taint on allocations in the shadow
memory, our instrumentation also fills the body of new memory regions with
fixed marker bytes3 – 0xaa for heap/pools and 0xbb for stack objects. While
not essential for the correct functioning of the infoleak detection, it is a useful
debugging feature.
First of all, the mechanism enables the instrumentation to verify the cor-
rectness of its own taint tracking by cross checking the information from two
different sources. If a specific region is uninitialized according to the shadow
memory, but in fact it no longer contains the marker bytes, this indicates that
the memory was overwritten in a way our tool was not aware of (e.g. by a disk
controller or another external device). In such cases, the instrumentation can
correct the taint early on, instead of propagating the wrong information further
in memory. Similarly, the behavior guarantees a nearly 100% true-positive ratio
of the reported bugs, as they are verified both against the taint information and
the guest virtual memory.
Furthermore, the markers are easily recognizable under kernel debuggers
attached to the guest systems, which often aids in understanding the current
system state and thus makes it easier to establish the root cause of the dis-
covered bugs. Lastly, the mechanism may potentially expose other types of
vulnerabilities in the process, such as use of uninitialized memory.
3 Unless the caller explicitly requests a zeroed-out area, e.g. by passing the GFP ZERO flag
to kmalloc in Linux.
27
1 void bx_instr_before_execution(CPU cpu, instruction i) {
2 if (!cpu.protected_mode ||
3 !os::is_kernel_address(cpu.eip) ||
4 !os::is_kernel_address(cpu.esp)) {
5 return;
6 }
7
8 if (i.opcode == SUB || i.opcode == ADD || i.opcode == AND) {
9 if (i.op[0] == ESP) {
10 globals::esp_changed = true;
11 globals::esp_value = cpu.esp;
12 }
13 }
14 }
15
16 void bx_instr_after_execution(CPU cpu, instruction i) {
17 if (globals::esp_changed && cpu.esp < globals::esp_value) {
18 set_taint(/*from= */cpu.esp,
19 /*to= */globals::esp_value - 1,
20 /*origin=*/cpu.eip);
21 }
22
23 globals::esp_changed = false;
24 }
28
.text:00548D94 public __chkstk
.text:00548D94 __chkstk proc near
.text:00548D94
[...]
.text:00548DA8
.text:00548DA8 cs10:
.text:00548DA8 cmp ecx, eax
.text:00548DAA jb short cs20
.text:00548DAC mov eax, ecx
.text:00548DAE pop ecx
.text:00548DAF xchg eax, esp
.text:00548DB0 mov eax, [eax]
.text:00548DB2 mov [esp+0], eax
.text:00548DB5 retn
.text:00548DB6
[...]
.text:00548DBD __chkstk endp
29
3.1.3 Tainting heap/pool allocations
Contrary to stack frames and automatic objects, detecting and tainting dynamic
allocations is a highly system-specific task, which must be implemented for each
tested OS dedicatedly. As a general rule, the instrumentation should intercept
the addresses of all newly allocated regions and their lengths, optionally together
with further information such as the origin, tag/flags etc. The specifics of the
Windows and Linux kernel allocators are outlined in the paragraphs below.
30
1 PVOID AllocFreeTmpBuffer(SIZE_T Size) {
2 PVOID Result;
3
4 if (Size > 0x1000 ||
5 (Result = InterlockedExchange(gpTmpGlobalFree, NULL)) == NULL) {
6 Result = AllocThreadBufferWithTag(Size, ’pmTG’);
7 }
8
9 return Result;
10 }
31
• kmem cache create
– Prologue: save the cache size and the constructor function pointer.
– Epilogue: if the function succeeded, save the address of the newly
created cache and set a breakpoint on the constructor function, if
present.
• kmem cache destroy – remove the cache from the internal structures and
clear the breakpoint on the cache’s constructor, if present.
• kmem cache alloc
32
On the x86 and x86-64 platforms, the instruction dedicated to copying con-
tinuous memory blocks is rep movsb4 , which moves ECX bytes at address DS:ESI
to address ES:EDI (equivalent 64-bit registers are used on x86-64). It is very
frequently used in kernels, both as part of the standard library memcpy im-
plementation and as an inlined form of the function. From the perspective of
instrumentation, the instruction is very convenient, as it specifies the source,
desination and length of the copy all at the same time. This makes it trivial
to propagate taint in the shadow memory, and is the main reason why taint
propagation in our tool was built around the special handling of rep movs.
In the paragraphs below, we explain how practical the core idea was with
regards to the actual binary code of each of the tested systems, and what further
steps we took to maximize the scope of the detected memory copying activity.
33
.text:00439260 _memcpy proc near
.text:00439260
.text:00439260 arg_0 = dword ptr 8
.text:00439260 arg_4 = dword ptr 0Ch
.text:00439260 arg_8 = dword ptr 10h
.text:00439260
.text:00439260 55 push ebp
.text:00439261 8B EC mov ebp, esp
[...]
.text:00439278 3B F8 cmp edi, eax
.text:0043927A 0F 82 7C+ jb CopyDown
.text:00439280
.text:00439280 CopyUp:
.text:00439280 F7 C7 03+ test edi, 3
.text:00439286 75 14 jnz short CopyLeadUp
.text:00439288 C1 E9 02 shr ecx, 2
.text:0043928B 83 E2 03 and edx, 3
.text:0043928E 83 F9 08 cmp ecx, 8
.text:00439291 72 29 jb short CopyUnwindUp
.text:00439293 F3 A5 rep movsd
.text:00439295 FF 24 95+ jmp ds:off_4393AC[edx*4]
34
.text:0000000140095720 mcpy90:
.text:0000000140095720 mov r9, [rdx+rcx]
.text:0000000140095724 mov r10, [rdx+rcx+8]
.text:0000000140095729 movnti qword ptr [rcx], r9
.text:000000014009572D movnti qword ptr [rcx+8], r10
.text:0000000140095732 mov r9, [rdx+rcx+10h]
.text:0000000140095737 mov r10, [rdx+rcx+18h]
.text:000000014009573C movnti qword ptr [rcx+10h], r9
.text:0000000140095741 movnti qword ptr [rcx+18h], r10
.text:0000000140095746 mov r9, [rdx+rcx+20h]
.text:000000014009574B mov r10, [rdx+rcx+28h]
.text:0000000140095750 add rcx, 40h
.text:0000000140095754 movnti qword ptr [rcx-20h], r9
.text:0000000140095759 movnti qword ptr [rcx-18h], r10
.text:000000014009575E mov r9, [rdx+rcx-10h]
.text:0000000140095763 mov r10, [rdx+rcx-8]
.text:0000000140095768 dec eax
.text:000000014009576A movnti qword ptr [rcx-10h], r9
.text:000000014009576F movnti qword ptr [rcx-8], r10
.text:0000000140095774 jnz short mcpy90
.text:0000000140189700 lcpy40:
.text:0000000140189700 movdqu xmm0, xmmword ptr [rdx+rcx]
.text:0000000140189705 movdqu xmm1, xmmword ptr [rdx+rcx+10h]
.text:000000014018970B movntdq xmmword ptr [rcx], xmm0
.text:000000014018970F movntdq xmmword ptr [rcx+10h], xmm1
.text:0000000140189714 add rcx, 40h
.text:0000000140189718 movdqu xmm0, xmmword ptr [rdx+rcx-20h]
.text:000000014018971E movdqu xmm1, xmmword ptr [rdx+rcx-10h]
.text:0000000140189724 movntdq xmmword ptr [rcx-20h], xmm0
.text:0000000140189729 movntdq xmmword ptr [rcx-10h], xmm1
.text:000000014018972E dec eax
.text:0000000140189730 jnz short lcpy40
35
.text:0000000140095600 ; Exported entry 1365. RtlCopyMemory
.text:0000000140095600 ; Exported entry 1541. RtlMoveMemory
.text:0000000140095600 ; Exported entry 2063. memcpy
.text:0000000140095600 ; Exported entry 2065. memmove
.text:0000000140095600
.text:0000000140095600 ; Attributes: library function
.text:0000000140095600
.text:0000000140095600 public memmove
.text:0000000140095600 memmove proc near
.text:0000000140095600
.text:0000000140095600 mov r11, rcx
.text:0000000140095603 sub rdx, rcx
.text:0000000140095606 jb mmov10
.text:000000014009560C cmp r8, 8
kernel images across the system is the same, meaning that we could successfully
identify all copies of memcpy in memory by recognizing a single unique signature
of the routine’s prologue. In our testing, we used the first 16 bytes of the
function code, which corresponded to the four initial assembly instructions.
Such a signature proved to uniquely identify the procedure in question, enabling
our tool to track the memory taint on 64-bit Windows platforms.
One currently unresolved problem is the fact that with each newer version
of Windows, an increasing number of memcpy instances with constant length
are compiled as inlined sequences of mov instructions, instead of rep movs or
direct calls into the library function. This is clearly visible in the numbers of
references to the function – on Windows 7 64-bit (January 2018 patch), the
win32k.sys module calls into memcpy at 1133 unique locations in the code.
However, the combined drivers on Windows 10 Fall Creators Update 64-bit
(win32k.sys, win32kbase.sys and win32kfull.sys) only invoke the function
696 times. The remaining instances were replaced with mov instructions located
directly in the client functions, causing our tool to lose track of a large part of
the kernel memory taint. We hope that the issue was partially mitigated by the
fact that we performed the testing against Windows 7 and 10, so bugs dating
back to Windows 7 should have been successfully detected and fixed in both
versions, even if the reduced effectiveness of the taint tracking would prevent
their discovery on Windows 10. Nonetheless, this circumstance consistutes a
significant problem in the current scheme of the Bochspwn Reloaded project.
36
175 /*
176 * No 3D Now!
177 */
178
179 #ifndef CONFIG KMEMCHECK 0
180
181 #if (__GNUC__ >= 4)
182 #define memcpy(t, f, n) __builtin_memcpy(t, f, n)
183 #else
184 #define memcpy(t, f, n) \
185 (__builtin_constant_p((n)) \
186 ? __constant_memcpy((t), (f), (n)) \
187 : __memcpy((t), (f), (n)))
188 #endif
189 #else
190 /*
191 * kmemcheck becomes very happy if we use the REP instructions
unconditionally,
192 * because it means that we know both memory operands in advance.
193 */
194 #define memcpy(t, f, n) memcpy((t), (f), (n))
195 #endif
Linux Linux being an open-source kernel, it gives us full control over how
memcpy is compiled. Following brief experimentation, we determined that only
minor code modifications were necessary to make it compliant with our tool,
as the function’s assembly code was largely influenced by the kernel configura-
tion flags. More specifically, we set the CONFIG X86 GENERIC option to y and
CONFIG X86 USE 3DNOW to n, and applied the patch shown in Listing 15 to un-
conditionally redirect memcpy invocations to the memcpy function comprising
of the rep movsd and rep movsb instructions. These actions were sufficient
to ensure that the resulting memory-copying assembly worked with the taint
propagation mechanism used in our instrumentation.
37
There are no system-specific considerations related to bug detection in Win-
dows. In Linux, we additionally implemented support for identifying informa-
tion leaks through simple variable types, and more general detection of use of
uninitialized memory. Both efforts are documented in the paragraphs below.
Leaks through primitive types in Linux In Linux, there are two interfaces
facilitating the writing of data from the kernel into user space – copy to user
and put user. The copy to user function is an equivalent of memcpy, and is
subject to the regular bug detection logic if the kernel is compiled with the
CONFIG X86 INTEL USERCOPY option set to n. On the other hand, put user is
designed to allow the kernel to write values of simple types into ring 3, such as
characters, integers or pointers. It uses direct pointer manipulation, and at the
binary level, the data in question is copied through registers, so it is not subject
to kernel infoleak detection based on the rep movs instruction. Considering that
put user is used extensively in the Linux kernel, and that we had the power to
recompile the software to adjust it to our needs, we implemented an additional
bug detection mechanism dedicated to put user, which required changes in the
source code of both the Linux kernel and the Bochs instrumentation.
The main problem related to the sanitization of data written with put user
is the fact that it is passed through value and not reference (pointer). Con-
sequently, the first argument of the macro may be a constant, variable, struc-
ture/union field, array item, function return value, or an expression involving
components of any of the above types. Therefore, it is not clear which particular
memory region should be sanitized in each specific case, and it is difficult to iso-
late only the simple cases on the level of either kernel code or instrumentation.
While most CPU architectures supported by Linux have their own imple-
mentation of the put user macro, there is also a generic version declared in
include/asm-generic/uaccess.h. Listing 16 shows the body of put user,
an internal macro which sits at the core of put user. As we can see in line 147,
the expression passed by the caller to be written to user space is evaluated and
stored in a local, helper variable called x. While analyzing the code, we de-
cided to sanitize all memory reads executed as part of the expression evaluation,
as long as they occured in the context of the current function (i.e. not in nested
function calls). To that end, we modified the macro to wrap the initialization of
the x variable with two assembly instructions – prefetcht1 and prefetcht2.
The diff of the change is presented in Listing 17. As a result, all accesses to mem-
ory that was passed to user-mode through put user were placed between the
two artificially inserted instructions. Examples of disassembled code snippets
from a Linux kernel compiled in this manner are shown in Listing 18.
In Bochs, the prefetcht1 and prefetcht2 instructions aren’t emulated with
any dedicated logic, but are instead handled by the BX CPU C::NOP method.
This makes them prime candidates to be used as hypercalls – special opcodes
that don’t have any effect on the execution of the guest system, but are used
to communicate with the emulator. In this case, our instrumentation detects
the execution of prefetcht1 and treats it as a signal to start sanitizing all
38
145 #define __put_user(x, ptr) \
146 ({ \
147 __typeof__(*(ptr)) __x = (x); \
148 int __pu_err = -EFAULT; \
149 __chk_user_ptr(ptr); \
150 switch (sizeof (*(ptr))) { \
151 case 1: \
152 case 2: \
153 case 4: \
154 case 8: \
155 __pu_err = __put_user_fn(sizeof (*(ptr)), \
156 ptr, &__x); \
157 break; \
158 default: \
159 __put_user_bad(); \
160 break; \
161 } \
162 __pu_err; \
163 })
39
.text:C1027F72 prefetcht1 byte ptr [eax]
.text:C1027F75 mov eax, [ebp+var_B4]
.text:C1027F7B mov [ebp+var_AC], eax
.text:C1027F81 prefetcht2 byte ptr [eax]
[...]
.text:C1035910 prefetcht1 byte ptr [eax]
.text:C1035913 mov eax, [ebp+var_14]
.text:C1035916 mov edx, edi
.text:C1035918 call getreg
.text:C103591D mov [ebp+var_10], eax
.text:C1035920 prefetcht2 byte ptr [eax]
[...]
.text:C1071AD7 prefetcht1 byte ptr [eax]
.text:C1071ADA mov edx, [ebp+var_1C]
.text:C1071ADD shl edx, 8
.text:C1071AE0 or edx, 7Fh
.text:C1071AE3 mov [ebp+var_10], edx
.text:C1071AE6 prefetcht2 byte ptr [eax]
Listing 18: Instances of the compiled put user macro with added marker
instructions
40
Considering that our instrumentation didn’t recognize if the uninitialized
data had any influence on the kernel control flow, the output logs included
a number of false-positives where leftover memory was read (e.g. while being
copied), but never actually used in a meaningful way. On the upside, the vol-
ume of the reports turned out to be manageable, and the false-positives were
relatively easy to filter out upon brief analysis of the kernel code.
It is also important to note that while uses of uninitialized memory are typ-
ically real bugs, they are often functional and not security problems. For exam-
ple, a majority of the issues identified by Bochspwn Reloaded in Linux had very
little to no security impact. Nonetheless, proposed patches for all discovered
bugs were submitted and subsequently accepted by the kernel developers.
• The syscall entrypoints are located at fixed offsets in the kernel images.
• Bochs provides a BX INSTR WRMSR instrumentation callback which receives
notifications about all MSR writes taking place in the emulated system.
41
Once the base address of the primary image is established, listing other
loaded modules is a matter of traversing through simple system-specific linked
lists of driver descriptors in the guest memory. In Bochspwn Reloaded, the
traversing is performed every time the instrumentation encounters an address
that cannot be associated with any of the currently known modules.
In Windows, a static nt!PsLoadedModuleList variable points to the head
of a doubly-linked list consisting of LDR MODULE structures. In Linux, the cor-
responding head pointer is named modules, and the list is made of module
structures. In both cases, each such structure describes a single kernel driver,
including its name, base address and size in memory. The layouts of these lists
are illustrated in Figures 7 and 8.
ntoskrnl.exe
vmlinux
42
3.2.2 Unwinding stack traces
Collecting full stack traces upon detecting kernel infoleaks helps both dedupli-
cate the bugs (to avoid flooding logs with multiple instances of the same issue),
and understand how the execution flow reached the affected code. Depending
on the bitness of the guest system, the goal can be achieved in different ways.
On 32-bit builds of Windows and Linux, the stack trace has a simple and
consistent form – consecutive stack frames of nested functions are chained to-
gether through stack frame pointers (saved values of the EBP register), with the
current stack frame being pointed to by EBP itself. At any point of execution,
it is possible to iterate through this chain to unwind the call stack and save the
observed return addresses, as shown in Figure 9. This logic was implemented
in our tool for 32-bit guest systems.
ESP
locals
foo()
EBP
saved ebp
return address
locals
bar()
saved ebp
return address
locals
syscall()
saved ebp
return address
Trap frame
In 64-bit Windows, the RBP register is not saved as part of the stack frame
creation anymore, and thus the callstack cannot be traversed by the instrumen-
tation without access to additional debug information. The necessary informa-
tion is provided by Microsoft’s debug symbols (.pdb files) corresponding to the
kernel modules. Regardless of stack trace unwinding, our tool loads symbol files
for every new detected driver for the purpose of address symbolization (see Sec-
tion 3.2.3). In the presence of these debug symbols, obtaining the full callstack
can be achieved using the StackWalk64 API function [90] from the Debug Help
Library (DbgHelp). As part of the input, the function expects to receive the
full CPU context, which is available through the internal BX CPU C object. An-
other required primitive is a pointer to a custom ReadMemoryRoutine function,
invoked by DbgHelp to read the virtual memory of the target process/kernel.
In our case, it is a simple wrapper around read lin mem [72], a helper function
43
for reading guest system memory. After putting these pieces together, retriev-
ing the full call stack can be implemented as a straightforward loop over the
StackWalk64 call.
Support for 64-bit builds of Linux was not implemented, so we didn’t study
the problem in that configuration.
Microsoft Windows The symbols for nearly all Windows system files are
available on the Microsoft symbol server [85] and can be downloaded using the
SymChk tool (symchk.exe), shipped with Debugging Tools for Windows. The
command line syntax for downloading the symbols for a specific file to a chosen
directory is as follows:
44
3.2.4 Breaking into the kernel debugger
The textual reports generated by our tool are verbose, but not always sufficient
to understand the underlying vulnerabilities. One example of such scenario
is when the leaked uninitialized data travels a long way (i.e. is copied across
multiple locations in memory) before arriving at the final memcpy to user space.
In cases like this, it is useful to attach a kernel debugger to the emulated system
and learn about the memory layout and contents, kernel objects involved in the
disclosure, the user-mode caller that invoked the affected system call, and any
other important details about the state of the execution environment.
Both Windows and Linux support remote kernel debugging through COM
ports. In the Bochs emulator on a Windows host, guest COM ports can be redi-
rected to named Windows pipes by including the following line in the bochsrc
configuration file:
With the above configuration set up, it is possible to attach the WinDbg or
gdb debuggers to the tested systems. However, this by itself is not enough to
put the debuggers to good use, as long as we can’t break the execution of the
OS precisely at the moment of each new detected disclosure. To achieve this,
we need assistance from the Bochs instrumentation.
In the x86(-64) architectures, breakpoints are installed by placing an int3
instruction (opcode 0xcc) at the desired location in code. This is no different
in the emulated environment. To stop the kernel execution and pass control
to the debugger, the instrumentation only needs to write the 0xcc byte to EIP
or RIP, depending on the system bitness. This suffices to break into the kernel
debugger, but since the overwritten first byte of the next instruction is never
restored in the above logic, the user has to do it themselves by looking up the
value of the original byte in the executable image of the driver in question.
To address this inconvenience, our instrumentation declares a callback for
the bx instr interrupt event invoked by Bochs every time an interrupt is
generated, including the #BP trap triggered by the injected breakpoint. In
that handler, we restore the previously saved value of the overwritten byte, thus
bringing the original instruction back to its original form even before control
is transferred to the kernel debugger. From the perspective of the emulated
system, the #BP exception is generated for no apparent reason, as the int3
instruction only lasts in memory for as long as it is needed to be fetched and
executed by Bochs, and disappears shortly after.
Listing 19 presents an example WinDbg log from a brief investigation of a
bug identified by Bochspwn Reloaded. The debugger informs us that a break-
point exception was triggered, but the disassembly of memory under RIP shows
the original sub rdx, rcx instruction, as further confirmed with the u com-
mand. To find out more about the circumstances of the leak, we check the
value of the RCX register (the destination argument of memmove), to make sure
that it points into user-mode memory. To follow up, we dump the memory
45
area between RDX and RDX+R8-1, where RDX is the source address and R8 is the
number of bytes to copy. At offsets 0x4 through 0x7, we can observe the 0xaa
values, which is the filler byte for pool allocations. We can therefore assume
that these bytes are the subject of the disclosure, and as we look closely at the
contents of the buffer, we can also deduce that it is a UNICODE STRING structure
followed by the corresponding textual data. Lastly, we display the stack trace
to establish how the control flow reached the current point. After our analysis
is completed, we can continue the system execution with the g command.
46
kd> g
Break instruction exception - code 80000003 (first chance)
nt!memmove+0x3:
fffff800‘026fc603 482bd1 sub rdx,rcx
kd> u
nt!memmove+0x3:
fffff800‘026fc603 482bd1 sub rdx,rcx
fffff800‘026fc606 0f829e010000 jb nt!memmove+0x1aa
fffff800‘026fc60c 4983f808 cmp r8,8
fffff800‘026fc610 7262 jb nt!memmove+0x74
fffff800‘026fc612 f6c107 test cl,7
fffff800‘026fc615 7437 je nt!memmove+0x4e
fffff800‘026fc617 f6c101 test cl,1
fffff800‘026fc61a 740c je nt!memmove+0x28
kd> ? rcx
Evaluate expression: 2554392 = 00000000‘0026fa18
kd> k
# Child-SP RetAddr Call Site
00 fffff880‘03d0b8c8 fffff800‘02a75319 nt!memmove+0x3
01 fffff880‘03d0b8d0 fffff800‘02938426 nt!IopQueryNameInternal+0x289
02 fffff880‘03d0b970 fffff800‘0294cfa8 nt!IopQueryName+0x26
03 fffff880‘03d0b9c0 fffff800‘0297713b nt!ObpQueryNameString+0xb0
04 fffff880‘03d0bac0 fffff800‘0271d283 nt!NtQueryVirtualMemory+0x5fb
05 fffff880‘03d0bbb0 00000000‘77589ada nt!KiSystemServiceCopyEnd+0x13
[...]
kd> g
47
Figure 10: Windows 7 kernel address space layout; green: stack pages, red:
pool pages
Figure 11: Windows 10 kernel address space layout; green: stack pages, red:
pool pages
Figure 12: Ubuntu 16.04 kernel address space layout; green: stack pages, red:
heap pages
48
3.3 Performance
In this section, we examine the general performance of the instrumentation by
comparing its CPU and memory usage to the same guest operating systems
run in a regular virtual machine (VirtualBox) and a non-instrumented Bochs
emulator. The aim of this section is to provide a general overview of the over-
head associated with the proposed approach; the numbers presented here are
approximate and should not be considered as accurate benchmarks.
The testing was performed on a workstation equipped with an Intel Xeon
E5-1650 v4 @ 3.60 GHz CPU, 64 GB of DDR4-2400 RAM, and a Samsung 850
PRO 512 GB SSD (SATA 3); the guest systems were assigned 1 processor core
and 2 GB of physical memory.
Table 2: Time from cold boot to log-on screen in tested configurations (mm:ss)
49
Windows 7 Windows 10 Ubuntu 16.10
x86 x64 x86 x64 x86
Bochs 2126
Bochspwn Reloaded 8300 4220 8290 6628 5205
3.4 Testing
The effectiveness of any instrumentation-based vulnerability detection is as good
as the code coverage achieved against the tested software. With this in mind,
we attempted to maximize the coverage of the analyzed kernels using publicly
available tools and methodology, while running them inside our tool. In the
subsections below, we explain how the testing was carried out on the Windows
and Linux platforms.
50
3.4.1 Microsoft Windows
As part of the research project, we ran Bochspwn Reloaded against Windows 7
and Windows 10, both 32 and 64-bit builds. The middle version of the operating
system – Windows 8.1 – was excluded from the analysis. The assumption behind
this approach was that the testing of Windows 7 would reveal bugs that had
been internally fixed by Microsoft in newer systems but not backported to the
older ones, while instrumentation of Windows 10 would uncover bugs in the
most recently introduced kernel code. In this context, there was very little
attack surface in Windows 8.1 that wouldn’t be already covered by the testing
of the two other versions, and as such, we considered it redundant to analyze
all three major releases of the OS.
In terms of bitness, in our experience and by intuition, a majority of kernel
infoleaks are cross-platform and affect both x86 and x64 builds of the code.
This is related to the fact that most root causes of the bugs, such as explicitly
uninitialized variables, structure fields, arrays etc. are bitness-agnostic and re-
produce in both execution modes. The subset of issues limited to x86 is very
narrow, as there are no fundamental reasons for the presence of such 32-bit only
disclosures other than low level platform-specific code (e.g. exception handling).
On the other hand, x64-only bugs may exist due to the fact that the width of
certain data types (size t, pointers etc.) and their corresponding alignment
requirements increase from 4 to 8 bytes, which in turn:
• creates new alignment holes in structures; one example being the standard
UNICODE STRING structure,
• extends the size of unions, which may misalign their fields and introduce
new uninitialized bytes, as is the case in IO STATUS BLOCK.
51
• Running and navigating through the code samples [32] for the “Windows
Graphics Programming: Win32 GDI and DirectDraw” book [31].
• Running around 30 NtQuery test suites developed specifically to uncover
memory disclosure in the specific subset of the system calls for querying
information about objects in the system. These test programs are further
discussed in Section 5.4 “Differential syscall fuzzing”.
• Shutting down the system.
As shown, Windows was tested on a best effort basis, and there is much room
for improvement in terms of coupling existing dynamic binary instrumentation
schemes with new ways of exploring a more substantial portion of the kernel
code. However, we believe that the steps we took allowed us to identify most of
the easily discoverable bugs that other parties could likely run across.
3.4.2 Linux
The Linux platform subject to examination was Ubuntu Server 16.10 32-bit
with a custom-compiled kernel v4.8. Support for x64 builds of the kernel was
never implemented. As part of the testing, we executed the following actions in
the system:
3.5 Results
In this section, we present a summary of previously unknown vulnerabilities
discovered by running Bochspwn Reloaded against Windows and Linux.
52
3.5.1 Microsoft Windows
Throughout the development of the project in 2017 and early 2018, we ran mul-
tiple iterations of the instrumentation on the then-latest builds of Windows 7
and 10. All issues found in each session were promptly reported to Microsoft in
accordance with the Google Project Zero disclosure policy. The first identified
vulnerability was reported on March 1, 2017 [59], and the last one was sent
to the vendor on January 22, 2018 [53]. In that time period, we progressively
improved the tool and introduced new features, which enabled us to regularly
uncover new layers of infoleaks. This is reflected in the history of the bug re-
ports – for example, the first 12 reported issues were pool-based disclosures,
because the handling of stack allocations was added a few weeks later. Simi-
larly, we initially developed instrumentation for 32-bit guest systems, and only
implemented support for 64-bit platforms at the end of 2017. This explains why
all of the x64-specific leaks we discovered were patched between February and
April 2018.
In total, we filed 73 issues describing Windows kernel memory disclosure to
user-mode in the Project Zero bug tracker. Out of those, 69 were closed as
“Fixed”, 2 as “Duplicate” and 2 as “WontFix”. The duplicate reports were
caused by the fact that some leaks reported as separate issues were determined
by the vendor to be caused by a single vulnerability in the code. The WontFix
cases were valid bugs, but turned out to be only reachable from a privileged
dwm.exe process and hence didn’t meet the bar to be serviced in a security
bulletin.
Microsoft classified the problems as 65 unique security flaws. The discrep-
ancy between the number of CVEs and “Fixed” bug tracker entries stems from
the fact that the vendor combined several groups of issues into one, e.g. if they
considered the bugs to have a common logic root cause. In cases where this
was inconsistent with our assessment and we viewed them as separate bugs, we
marked the corresponding tracker issues as “Fixed” instead of “Duplicate”.
Two example reports generated for the CVE-2017-8473 and CVE-2018-0894
vulnerabilities are shown in Listings 20 and 21. A complete summary of mem-
ory disclosure vulnerabilities found by the tool is presented in Tables 4 and 5,
while Figure 13 illustrates the distribution of disclosed memory types. We ex-
pect that stack leaks are more prevalent than pool leaks due to the fact that
a majority of temporary objects constructed by the kernel in system call han-
dlers are allocated locally. Furthermore, Figure 14 shows a classification of the
bugs based on the kernel modules they were found in. The core ntoskrnl.exe
executable image was the most suspectible to infoleaks, likely due to its large
attack surface and our intensified testing of the NtQuery syscall family. The
graphical win32k.sys driver was also affected by a significant number of prob-
lems, both in regular syscall handlers and a mechanism known as user-mode
callbacks. Lastly, several individual issues were found in other drivers such as
partmgr.sys, mountmgr.sys or nsiproxy.sys, mostly in the handling of user-
accessible IOCTLs with complex output structures.
53
------------------------------ found uninit-access of address 94447d04
[pid/tid: 000006f0/00000740] { explorer.exe}
READ of 94447d04 (4 bytes, kernel--->user), pc = 902df30f
[ rep movsd dword ptr es:[edi], dword ptr ds:[esi] ]
Stack trace:
#0 0x902df30f (win32k.sys!NtGdiGetRealizationInfo+0000005e)
#1 0x8288cdb6 (ntoskrnl.exe!KiSystemServicePostCall+00000000)
54
Both
Kernel pools
24
39
Kernel stack
Other drivers
35 23
ntoskrnl.exe
win32k.sys
55
CVE ID Component Fix Date Leaked bytes x64 only
CVE-2017-0258 ntoskrnl.exe May 2017 8
CVE-2017-0259 ntoskrnl.exe May 2017 60
CVE-2017-8462 ntoskrnl.exe June 2017 1
CVE-2017-8469 partmgr.sys June 2017 484
CVE-2017-8484 win32k.sys June 2017 5
CVE-2017-8488 mountmgr.sys June 2017 14
CVE-2017-84896 ntoskrnl.exe June 2017 6 or 72
CVE-2017-8490 win32k.sys June 2017 6672
CVE-2017-8491 volmgr.sys June 2017 8
CVE-2017-8492 partmgr.sys June 2017 4
CVE-2017-8564 nsiproxy.sys July 2017 13
CVE-2017-02997 ntoskrnl.exe August 2017 2
CVE-2017-8680 win32k.sys September 2017 Arbitrary
CVE-2017-11784 ntoskrnl.exe October 2017 192
CVE-2017-11785 ntoskrnl.exe October 2017 56
CVE-2017-11831 ntoskrnl.exe November 2017 25
CVE-2018-0746 ntoskrnl.exe January 2018 12
CVE-2018-08108 win32k.sys February 2018 4 X
CVE-2018-0813 win32k.sys March 2018 4 X
CVE-2018-0894 ntoskrnl.exe March 2018 4 X
CVE-2018-0898 ntoskrnl.exe March 2018 8 X
CVE-2018-0899 videoprt.sys March 2018 20 X
CVE-2018-0900 ntoskrnl.exe March 2018 40 X
CVE-2018-0926 win32k.sys March 2018 4 X
CVE-2018-0972 ntoskrnl.exe April 2018 8
CVE-2018-09739 ntoskrnl.exe April 2018 4
6 The CVE was assigned to a generic mitigation of zeroing the Buffered I/O out-
put buffer [39]. It fixed two bugs in IOCTLs handled by the \Device\KsecDD and
\\.\WMIDataDevice devices, filed in the Google Project Zero bug tracker as issues 1147
and 1152.
7 A patch for the vulnerability first shipped in June 2017. After it was proven ineffective,
Microsoft released a revised version of the fix in August of the same year.
8 The CVE collectively covers four different memory disclosure bugs found in win32k.sys
user-mode callbacks – one pool-based and three stack-based leaks. They were filed in the
Google Project Zero bug tracker as issues 1467, 1468, 1485 and 1487.
9 The vulnerability disclosed uninitialized pool memory on Windows 7, and stack memory
on Windows 10.
56
CVE ID Component Fix Date Leaked bytes x64 only
CVE-2017-0167 win32k.sys April 2017 20
CVE-2017-0245 win32k.sys May 2017 4
CVE-2017-0300 ntoskrnl.exe June 2017 5
CVE-2017-8470 win32k.sys June 2017 50
CVE-2017-8471 win32k.sys June 2017 4
CVE-2017-8472 win32k.sys June 2017 7
CVE-2017-8473 win32k.sys June 2017 8
CVE-2017-8474 ntoskrnl.exe June 2017 8
CVE-2017-8475 win32k.sys June 2017 20
CVE-2017-8476 ntoskrnl.exe June 2017 4
CVE-2017-8477 win32k.sys June 2017 104
CVE-2017-8478 ntoskrnl.exe June 2017 4
CVE-2017-8479 ntoskrnl.exe June 2017 16
CVE-2017-8480 ntoskrnl.exe June 2017 6
CVE-2017-8481 ntoskrnl.exe June 2017 2
CVE-2017-8482 ntoskrnl.exe June 2017 32
CVE-2017-8485 ntoskrnl.exe June 2017 8
CVE-2017-8677 win32k.sys September 2017 8
CVE-2017-8678 win32k.sys September 2017 4
CVE-2017-8681 win32k.sys September 2017 128
CVE-2017-8684 win32k.sys September 2017 88
CVE-2017-8685 win32k.sys September 2017 1024
CVE-2017-8687 win32k.sys September 2017 8
CVE-2017-11853 win32k.sys November 2017 12
CVE-2018-0745 ntoskrnl.exe January 2018 4
CVE-2018-0747 ntoskrnl.exe January 2018 4
CVE-2018-0810 win32k.sys February 2018 4 or 8
CVE-2018-0832 ntoskrnl.exe February 2018 4
CVE-2018-0811 win32k.sys March 2018 4 X
CVE-2018-0814 win32k.sys March 2018 8 X
CVE-2018-0895 ntoskrnl.exe March 2018 4 X
CVE-2018-0896 msrpc.sys March 2018 8 X
CVE-2018-0897 ntoskrnl.exe March 2018 120 X
CVE-2018-0901 ntoskrnl.exe March 2018 4 X
CVE-2018-0968 ntoskrnl.exe April 2018 4 X
CVE-2018-0969 ntoskrnl.exe April 2018 4
CVE-2018-0970 ntoskrnl.exe April 2018 4 or 16
CVE-2018-0971 ntoskrnl.exe April 2018 4 X
CVE-2018-0973 ntoskrnl.exe April 2018 4 X
CVE-2018-0974 ntoskrnl.exe April 2018 8 X
CVE-2018-0975 ntoskrnl.exe April 2018 4 or 56
57
------------------------------ found uninit-access of address f5733f38
========== READ of f5733f38 (4 bytes, kernel--->kernel), pc = f8aaf5c5
[ mov edi, dword ptr ds:[ebx+84] ]
[Heap allocation not recognized]
Allocation origin: 0xc16b40bc: SYSC_connect at net/socket.c:1524
Shadow bytes: ff ff ff ff Guest bytes: bb bb bb bb
Stack trace:
#0 0xf8aaf5c5: llcp_sock_connect at net/nfc/llcp_sock.c:668
#1 0xc16b4141: SYSC_connect at net/socket.c:1536
#2 0xc16b4b26: SyS_connect at net/socket.c:1517
#3 0xc100375d: do_syscall_32_irqs_on at arch/x86/entry/common.c:330
(inlined by) do_fast_syscall_32 at arch/x86/entry/common.c:392
Listing 22: Report of a bug in llcp sock connect on Ubuntu 16.10 32-bit
3.5.2 Linux
We implemented support for 32-bit Linux kernels in April 2017, including the
same infoleak detection logic that had been successfully used for Windows. As
a result of instrumenting Ubuntu 16.10 for several days, we detected a single
minor bug – disclosure of 7 uninitialized kernel stack bytes in the processing
of specific IOCTLs in the ctl ioctl function (drivers/md/dm-ioctl.c). The
routine handles requests sent to the /dev/control/mapper device, which is only
accessible by the root user, significantly reducing the severity of the issue. We
identified the problem on April 20, but before we were able to submit a patch,
we learned that it had been independently fixed by Adrian Salido in commit
4617f564c0 [16] on April 27.
The lack of success in identifying new infoleaks in Linux can be explained by
the vast extent of work done to secure the kernel in the past. Accordingly, we
decided to extend the detection logic to include all references to uninitialized
memory. By making this change, we intended to uncover other, possibly less
dangerous bugs, where uninitialized memory was used in a meaningful way, but
not directly copied to the user space. Thanks to this approach, we discovered
further 15 bugs – one disclosure of uninitialized stack memory through AF NFC
sockets (see the corresponding report presented in Listing 22) and 14 lesser,
functional issues in various subsystems of the kernel, with limited security im-
pact. The combined results of this effort are enumerated in Table 6.
During and after the triage of the output reports, we noticed that some
of our findings collided with the work of independent researchers and devel-
opers; i.e. several bugs had been fixed days or weeks prior to our discovery,
or reported by other parties shortly after we submitted patches. Most colli-
sions occurred with the KernelMemorySanitizer project [6], which was actively
developed and used to test Linux in the same period. Consequently, we only
submitted 11 patches for the 16 uncovered bugs, as the rest had already been
addressed. This example is very illustrative of the velocity of improvements
applied to Linux, and the scope of work done to eliminate entire vulnerability
classes.
58
File Fix commit Collision Mem. Type
net/nfc/llcp sock.c 608c4adfca X Stack L
drivers/md/dm-ioctl.c 4617f564c0 X Stack L
net/bluetooth/l2cap sock.c
net/bluetooth/rfcomm/sock.c d2ecfa765d Stack U
net/bluetooth/sco.c
net/caif/caif socket.c 20a3d5bf5e Stack U
net/iucv/af iucv.c e3c42b61ff Stack U
net/nfc/llcp sock.c f6a5885fc4 Stack U
net/unix/af unix.c defbcf2dec Stack U
kernel/sysctl binary.c 9380fa60b1 X Stack U
fs/eventpoll.c c857ab640c X Stack U
kernel/printk/printk.c 5aa068ea40 X Heap U
net/decnet/netfilter/dn rtmsg.c dd0da17b20 Heap U
net/netfilter/nfnetlink.c f55ce7b024 Heap U
fs/ext4/inode.c 2a527d6858 X Stack U
net/ipv4/fib frontend.c c64c0b3cac X Heap U
fs/fuse/file.c 68227c03cb Heap U
arch/x86/kernel/alternative.c fc152d22d6 Stack U
59
One useful property of the mechanism is that it fills the body of every new
allocation with a unique, repeated marker byte. If leaked to user-mode, these
variable marker bytes can be seen at common offsets of the PoC program output.
The only requirement is to determine which kernel module requests the affected
allocation (which is usually answered by Bochspwn Reloaded) and enable special
pool for that module. Then, we should see output similar to the following after
starting the proof-of-concept twice (on the example of CVE-2017-8491):
D:\>VolumeDiskExtents.exe
00000000: 01 00 00 00 39 39 39 39 ....9999
00000008: 00 00 00 00 39 39 39 39 ....9999
00000010: 00 00 50 06 00 00 00 00 ..P.....
00000018: 00 00 a0 f9 09 00 00 00 ........
D:\>VolumeDiskExtents.exe
00000000: 01 00 00 00 2f 2f 2f 2f ....////
00000008: 00 00 00 00 2f 2f 2f 2f ....////
00000010: 00 00 50 06 00 00 00 00 ..P.....
00000018: 00 00 a0 f9 09 00 00 00 ........
60
Candidates of syscalls capable of spraying the kernel stack can be identified
by looking for functions with large stack frames whose names start with “Nt”.
In our reproducers, we used the following services:
D:\>NtGdiGetRealizationInfo.exe
00000000: 10 00 00 00 03 01 00 00 ........
00000008: 2e 00 00 00 69 00 00 46 ....i..F
00000010: 41 41 41 41 41 41 41 41 AAAAAAAA
1. Spray the kernel stack with 2. Trigger the bug, and observe marker bytes at uninitialized
a recognizable pattern offsets
41 41 41 41 41 41 41 41 41 41 41 41
41 41 41 41 41 41 41
00 41
50 41
A8 41
00 41 41 00
41 50
41 A8
41 00
41 41 41
41 41 41 41 41 41 41 41 41
9B 41
01 41
00 41
00 41 41 41
9B 41
01 41
00 41
00
41 41 41 41 41 41 41 41 41 41 41 41 41 41 41 41 41 41
00 00 19 00 48 45 00 00 19 00 48 45
41 41 41 41 41 41 41 41 41 41 41 41 41 41 41 41 41 41
00 00 98 44 00 00 00 00 98 44 00 00
41 41 41 41 41 41 41 41 41 41 41 41 41 41 41 41 41 41
30 0A 00 00 00 05 30 0A 00 00 00 05
41 41 41 41 41 41 41 41 41 41 41 41 41 41 41 41 41 41
41 41 41 41 41 41 41 41 41 41 41
00 41
00 41 41 41 41 00
41 00
41
41 41 41 41 41 41 41 41 41 41 41
00 41
00 41 41 41 41 41
00 41
00
41 41 41 41 41 41 41 41 41 41 41 41
41 41 41 41 41 41 41 41 41 41 41 41
41 41 41 41 41 41 41 41 41 41 41 41
Figure 15: Kernel stack spraying used for memory disclosure reproduction
61
5 Alternative detection methods
Kernel memory disclosure can be detected with a variety of techniques, which
complement each other in how they approach the problem. Full system emula-
tion is one of them, and while effective, it is limited by the reachable kernel code
coverage. In the following sections, we discuss alternative methods of identifying
infoleaks, outlining their feasibility, pros and cons, and the results of evaluating
some of them against the Windows kernel.
62
code review. In May 2017, we decided to test this assumption by performing
a cursory manual analysis of the Windows kernel in search of low hanging fruit
issues that could have been missed by Bochspwn Reloaded due to incomplete
code coverage.
On Linux, a good starting point is to locate references to the copy to user
function and begin the audit from there, progressively going back through the
code to track the initialization of each part of the copied objects. On Windows,
there is no such dedicated function, and all user↔kernel memory operations are
achieved with direct pointer manipulation or regular memcpy calls. However, us-
age of the ProbeForWrite function [89] may be a valuable hint, because pointer
sanitization often directly preceeds writing data back to userland. To reduce
the scope of the review, we decided to focus on the top-level syscall handlers,
since they typically have a concise execution flow and easy to understand life-
time of objects in memory. Table 7 shows the combined numbers of direct and
inlined memcpy calls (as detected by IDA Pro) in Nt functions on 32-bit builds
of Windows 7 and 10, with the January 2018 patch installed. Assuming that
approximately half of the calls were used to copy output data, we assessed the
extent of required manual analysis to be manageable within several days.
Windows 7 Windows 10
Core (ntoskrnl.exe etc.) 139 165
Graphics (win32k.sys etc.) 288 405
63
1 DWORD NTAPI NtGdiGetGlyphOutline(
2 ...,
3 DWORD cbBuffer,
4 LPVOID lpvBuffer
5 ) {
6 LPVOID KernelBuffer = Allocate(cbBuffer);
7
8 DWORD Status = GreGetGlyphOutlineInternal(..., KernelBuffer, cbBuffer);
9
10 if (Status != GDI_ERROR) {
11 ProbeForWrite(lpvBuffer, cbBuffer, 1);
12 memcpy(lpvBuffer, KernelBuffer, cbBuffer);
13 }
14
15 Free(KernelBuffer);
16
17 return Status;
18 }
64
30 ms_exc.registration.TryLevel = -2;
31 if ( v16 )
32 memset(v16, 0, a5);
33 v10 = v16;
34 v15 = GreGetGlyphOutlineInternal(a1, a2, a3, &v13, a5, v16, &v14, a8);
65
ntoskrnl.exe win32k.sys
functions syscalls functions syscalls
Windows 7 vs. 10 153 8 89 16
Windows 8.1 vs. 10 127 5 67 11
The diffs that enabled us to recognize the vulnerabilities are shown in List-
ings 26 and 27.
14 v12 = 0;
15 v13 = 0;
16 memset(&v14, 0, 0x5Cu);
17 v11 = 0;
18 ms_exc.registration.TryLevel = 0;
15 v16[0] = 0;
16 memset(&v16[1], 0, 0x3FCu);
17 v14 = 0;
18 if ( a2 > 0x10000 )
19 return 0;
In July 2017, we learned that another flaw could have been also discovered
the same way – CVE-2017-11817, a leak of over 7 kB of uninitialized kernel
pool memory to NTFS metadata, discussed in Section 6.1.2. The vulnera-
bility was present in Windows 7, but a memset function had been added to
Ntfs!LfsRestartLogFile in Windows 8.1 and later, potentially exposing the
bug to researchers proficient in binary diffing. When reported to Microsoft, it
was patched in the October 2017 Patch Tuesday.
Interestingly, we have also observed the opposite situation, where a memset
call was removed from the kernel in newer systems, thus introducing a bug.
66
This was the case for CVE-2018-0832, a stack-based disclosure of four bytes
in nt!RtlpCopyLegacyContextX86 [64]. The buffer created with alloca was
correctly zero-initialized in Windows 7, but not in Windows 8.1 or 10. The root
cause behind adding the regression is unclear.
As demonstrated above, most fixes for kernel infoleaks are obvious both
when seen in the source code and in assembly. The binary diffing required to
identify inconsistent usage of memset doesn’t require much low-level expertise
or knowledge of operating system internals. Therefore, we hope that these bugs
were some of the very few instances of such easily discoverable issues, and we
encourage software vendors to make sure of it by applying security improvements
consistently across all supported versions of their software.
1. The analyzed system may be run under the Bochs emulator, with an
instrumentation that sets the bytes of all new allocations once they are
requested.
2. As explained in Section 4 “Windows bug reproduction techniques”, the
desired effect is a part of the default behavior of the Special Pool option in
Driver Verifier [79]. When the feature is enabled for any specific module,
all pool regions returned to that module are filled with a marker byte
(which changes after each new allocation).
3. Low-level hooks may be installed on kernel functions such as ExAllocate-
PoolWithTag to briefly redirect code execution and set the allocation’s
bytes. This option was used by fanxiaocao and pjf [27]. It may be prob-
lematic on 64-bit Windows platforms due to the mandatory Patch Guard
mechanism.
1. Again, the analyzed system may be run under Bochs, with an instrumenta-
tion responsible for writing markers to allocated stack frames (as described
in Section 3.1.2 “Tainting stack frames”).
67
2. Stack-spraying primitives (as detailed in Section 4 “Windows bug repro-
duction techniques”) may be used prior to invoking the tested system calls,
each time with a different value used for the spraying.
3. Low-level hooks may be installed on system call entry points such as
nt!KiFastCallEntry, to poison the kernel stack before passing execution
to the syscall handlers. This option was used by fanxiaocao and pjf [27],
and similarly to their pool hooks, it may not work well with Patch Guard
on 64-bit system builds.
Options (1) have the best allocation coverage, especially in terms of stack
poisoning, but come at the cost of a significant slowdown. Options (2) have
marginally worse coverage, but can be used on bare metal or in virtual ma-
chines, without digging in low-level system internals. Options (3) can be useful
when a custom memory-poisoning mechanism is needed, and system perfor-
mance plays a substantial role. We have successfully tested methods (1) and (2)
and confirmed their effectiveness in discovering Windows security flaws.
Besides poisoning all newly requested allocations, it is necessary to develop
a user-mode harness to invoke the tested syscalls and analyze their output.
This is strongly related to the more general field of effective Windows system
call fuzzing, which is still an open problem. Windows 10 RS3 32-bit supports
as many as 460 native syscalls [66] and 1174 graphical ones [65]. Ideally, the
harness should be aware of the prototypes of all system calls in order to invoke
them correctly, reach their core functionality and interpret the output. To our
best knowledge, there isn’t currently any existing framework that would enable
us to run this kind of analysis. As a result, we decided to take a more basic
approach and focus on a specific subset of the system calls.
As we were looking for disclosure of uninitialized memory, we were primarily
interested in services designed to query information and return it back to the
caller. One such family of syscalls consists of kernel functions whose names
start with the NtQuery prefix. Their purpose is to obtain various types of
information regarding different objects and resources present in the system.
There is currently a total of 60 such syscalls, with each type of kernel object
having a corresponding service, e.g. NtQueryInformationProcess for processes,
NtQueryInformationToken for security tokens, NtQuerySection for sections
and so forth. Conveniently, a majority of these system calls share a common
definition. An example prototype of the NtQueryInformationProcess handler
is shown in Listing 28.
68
The parameters, in the original order, are:
While the basic premise is shared across all NtQuery services, some of them
may diverge slightly from the prototype shown above. For example, they may
accept textual paths instead of object handles, use the IO STATUS BLOCK struc-
ture instead of a simple ReturnLength variable, or not take the information
class as an argument, because only one type of data is returned.
We briefly reverse-engineered all 60 system calls in question, and determined
that 31 of them were either simple enough that they didn’t require dynamic test-
ing, or could only be accessed by users with administrative rights. Accordingly,
we developed test cases (in the form of standalone test programs) for the remain-
ing 29 non-trivial syscalls. By running them on Windows 7 and 10 (32/64-bit)
in both regular VMs and in Bochspwn Reloaded, we discovered infoleaks in a
total of 14 services across 23 different information classes. The results of the
experiment are summarized in Table 9. A majority of the bugs were stack-based
leaks caused by uninitialized fields and padding bytes in the output structures.
We believe that the experiment demonstrates that differential syscall fuzzing
may be effective in identifying kernel memory disclosure, and that certain groups
of system calls are more suspectible to the problem than others due to their
design and purpose.
69
System call (NtQuery...) Information class
AttributesFile —
FileBothDirectoryInformation (3)
DirectoryFile
FileIdBothDirectoryInformation (37)
FullAttributesFile —
JobObjectBasicLimitInformation (2)
JobObjectExtendedLimitInformation (9)
InformationJobObject
JobObjectNotificationLimitInformation (12)
JobObjectMemoryUsageInformation (28)
ProcessVmCounters (3)
InformationProcess ProcessImageFileName (27)
ProcessEnergyValues (76)
InformationResourceManager ResourceManagerBasicInformation (0)
InformationThread ThreadBasicInformation (0)
InformationTransaction TransactionPropertiesInformation (1)
InformationTransactionManager TransactionManagerRecoveryInformation (4)
InformationWorkerFactory WorkerFactoryBasicInformation (7)
Object ObjectNameInformation (1)
SystemPageFileInformation (18)
SystemInformation MemoryTopologyInformation (138)
SystemPageFileInformationEx (144)
MemoryBasicInformation (0)
MemoryMappedFilenameInformation (2)
VirtualMemory
MemoryImageInformation (6)
MemoryPrivilegedBasicInformation (8)
VolumeInformationFile FileFsVolumeInformation (1)
70
5.5 Taintless Bochspwn-style instrumentation
A full system instrumentation similar to Bochspwn Reloaded is capable of de-
tecting some instances of disclosure of uninitialized memory even without the
notion of shadow memory and taint tracking, as long as it is able to examine
all user-mode memory writes originating from the kernel. One way to achieve
this is to analyze all syscall output data in search of information that shouldn’t
normally be found there and thus manifests infoleaks – for example, valid ring 0
addresses. As mentioned earlier in the paper, kernel addresses are among the
most common types of data found in uninitialized memory, and since they are
also relatively easy to recognize (especially on 64-bit platforms), they may be
used as highly reliable bug indicators. On the downside, the approach can-
not detect leaks that don’t contain any addresses, or leaks whose individual
continuous chunks are smaller than the width of a pointer.
To improve the above design, it is possible to poison the stack and heap/pool
allocations with a specific marker byte, and search for sequences of that byte
in data written by the kernel to ring 3. This is trivial to achieve from the level
of Bochs instrumentation, but might be more difficult when running the tested
system in a regular virtual machine. The various available avenues of poisoning
newly created kernel memory areas are detailed in Section 5.4 “Differential
syscall fuzzing”. An important quality of this technique is that it overcomes
any potential limitations of taint tracking and propagation, as it is based solely
on the analysis of actual memory contents in the guest system. On the other
hand, it may be prone to false-positives, in cases where legitimate syscall output
data happens to contain a sequence of the marker bytes. However, that risk can
be significantly reduced by running several instances of the instrumentation with
different marker bytes, and cross-checking the results to only analyze reports
which reproduce across all of the sessions.
The taintless approach was successfully employed by fanxiaocao and pjf of
IceSword Lab to discover 14 Windows kernel infoleaks in 2017 [27], and by the
grsecurity team to identify an unspecified number of bugs in the Linux kernel
in 2013 [33]. We also used a variant of this method to hunt for disclosure of
uninitialized memory to mass-storage devices, as discussed in Section 6.1.
71
6 Other data sinks
In addition to the user-mode address space, uninitialized kernel memory may
become available to unauthorized code through other data sinks, such as mass-
storage devices (internal and external hard drives, USB flash drives, DVDs etc.)
and the network. The UniSan tool [45] accounted for those possibilities in Linux
by including the sock sendmsg and vfs write functions in the list of sinks
together with copy to user, hence universally intercepting most exit points
where data escapes the kernel. The idea proved effective, as more than 50%
of new kernel vulnerabilities discovered by the project (10 out of 19) leaked
memory through the socket sink.
In our experimentation, we focused on recognizing disclosure of Windows
kernel memory through common file systems, using an enhanced variant of the
taintless technique discussed in Section 5.5. In the following subsections, we
outline the inner workings of this side project and present the results, includ-
ing one particularly interesting vulnerability found in the ntfs.sys file system
driver (CVE-2017-11817).
6.1.1 Detection
A canonical approach to sanitizing memory saved to disk in the Windows kernel
would be to develop a file system filter driver, and have it intercept all write
operations performed on the mounted volumes. Then, the driver would invoke
a specially designed hypercall, which would send a signal to the instrumentation
indicating that the taint of a specific memory area needs to be checked. However,
72
instead of going that route, we decided to test a more experimental method of
taintless detection, based on poisoning all new allocations in the system with
a known marker byte, and scanning the guest’s disk image in search of those
markers. The technique was potentially more effective as it was not subject
to the innacuracy of taint propagation, but it came at the cost of losing some
valuable context information – we no longer learned about the call stack and
system state at the exact time of the leak taking place. The only information
available to us was the location of the disclosed bytes on disk, and the type of
the leaked memory, if we used different markers for the stack and the pools.
This put us in an inconvenient spot where we could detect leaks, but it was
difficult to analyze them and determine their root cause. In order to solve this
problem, we tried to use the marker bytes to encode more information than
just the fact that they originated from a kernel allocation – specifically, the
addresses of the code that made the allocations. As the testing was performed
on x86 versions of Windows, the pointers were 4-byte wide. Thus, a 16-byte
pool buffer allocated at address 0x8b7ad4ab would no longer be padded with
the 0xaa byte:
aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa
but instead would be filled with a repeated address of its origin in little-endian
format:
ab d4 7a 8b ab d4 7a 8b ab d4 7a 8b ab d4 7a 8b
73
47 ac a7 b0 47 ac a7 b0 47 ac a7 b0 47 ac a7 b0
As a result of the testing, we didn’t identify any issues in the handling of file
systems from the FAT family. On the other hand, we discovered a multitude of
both stack-based and pool-based leaks to NTFS volumes, originating from 10
different locations in the ntfs.sys driver (see Table 10). The size of the leaks
was generally modest and ranged between 4-12 bytes in continuous blocks of
uninitialized memory. Most of them affected all currently supported versions
of Windows. They were initially submitted to Microsoft in a single report on
July 7, 2017, followed by an update with more details on the bugs on August 25.
A summary of our communication with MSRC can be found in the correspond-
ing bug #1325 in the Project Zero bug tracker [58]. The issues were collectively
fixed in a single bulletin as CVE-2017-11880 in November 2017.
74
Leaked allocation origin Memory Windows
ntfs!NtfsInitializeReservedBuffer+0x20 Pool 7
ntfs!NtfsAddAttributeAllocation+0xb16 Pool 7-10
ntfs!NtfsCheckpointVolume+0xdcd Pool 7-10
ntfs!NtfsDeleteAttributeAllocation+0x12d Pool 7-10
ntfs!CreateAttributeList+0x1c Pool 7-10
ntfs!NtfsCreateMdlAndBuffer+0x95 Pool 7-10
ntfs!NtfsDeleteAttributeAllocation+0xf Stack 7-10
ntfs!NtfsWriteLog+0xf Stack 7-10
ntfs!NtfsAddAttributeAllocation+0xf Stack 7-10
ntfs!NtfsCreateAttributeWithAllocation+0xf Stack 7-10
Table 10: A summary of minor kernel infoleaks found in the ntfs.sys driver
75
7 Future work
Disclosure of uninitialized memory is far from a solved problem, and there
are multiple avenues for improvement on all levels of work – both specific to
Bochspwn Reloaded and the wider topic of mitigating this class of vulnerabili-
ties in software. They are discussed in the paragraphs below.
76
Other security domains. We discussed how the characteristics of the C pro-
gramming language contribute to the difficulty of securely passing data between
different security domains in Section 2.1. In principle, the outlined problems are
not specific to interactions between user space and the kernel, and may appear
in any software where low-level objects are passed through shared memory be-
tween components with different privilege levels. In particular, we expect that
both inter-process communication channels (used in sandboxing) and virtual-
ization software may also be suspectible to similar infoleaks, and the concepts
described in this paper should be applicable to these areas.
77
8 Conclusion
Information disclosure in general, and kernel memory disclosure in particular
are a very specific class of software vulnerabilities, different from the more con-
ventional types such as buffer overflows or use-after-free conditions. They don’t
reveal themselves through crashes or hangs, even though they may be triggered
thousands of times every minute during normal system run time. As they are
usually a by-product of functional user↔kernel communication and very difficult
to spot in the code, they may remain unnoticed for many years.
In this paper, we elaborated on the different root causes and factors con-
tributing to memory disclosure in modern software, and showed that the Win-
dows operating system was affected by kernel infoleaks present in a range of
drivers and system calls. To address the problem, we designed and implemented
a Bochs-based instrumentation to automatically identify such issues during sys-
tem run time. We then proceeded to use it to discover and report over 70 unique
vulnerabilities in the Windows kernel and over 10 lesser bugs in Linux through-
out 2017 and the beginning of 2018. To expand more on the general subject, we
also evaluated alternative techniques for exposing similar leaks, and tackled the
problem of recognizing uninitialized memory escaping the kernel to mass-storage
devices. Finally, we reviewed other applications of system instrumentation to
uncover kernel bugs and sensitive parts of the code.
We are optimistic that with continued work in the areas of detection and
mitigation by OS vendors and compiler developers, the entire class of kernel
infoleaks may be completely eliminated in the foreseeable future.
9 Acknowledgments
We would like to thank Gynvael Coldwind, Jann Horn, Joe Bialek, Matt Miller,
Mathias Krause, Brad Spengler and Solar Designer for reviewing this paper or
parts of it and providing valuable feedback.
78
References
[1] bochs: The Open-Source IA-32 Emulation Project. http://
bochs.sourceforge.net/.
[2] Building Modules. https://www.reactos.org/wiki/Building Modules.
[3] Coccinelle. http://coccinelle.lip6.fr/.
[16] Adrian Salido. dm ioctl: prevent stack leak in dm ioctl call. https:
//git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/
commit/?id=4617f564c06117c7d1b611be49521a4430042287.
[17] Alexander Levin. kmemcheck: remove annotations. https:
//git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/
commit/?id=4950276672fce5c241857540f8561c440663673d.
[18] Alexander Popov. How STACKLEAK improves Linux kernel security.
https://linuxpiter.com/en/materials/2344. Linux Piter 2017.
[19] Alexander Potapenko. KernelMemorySanitizer against uninitialized mem-
ory. http://www.linuxplumbersconf.org/2017/ocw/proposals/4825.
79
[20] Alexandru Radocea, Georg Wicherski. Visualizing Page Ta-
bles for Local Exploitation: Hacking Like in the Movies.
https://media.blackhat.com/us-13/US-13-Wicherski-Hacking-
like-in-the-Movies-Visualizing-Page-Tables-Slides.pdf,
https://media.blackhat.com/us-13/US-13-Wicherski-Hacking-
like-in-the-Movies-Visualizing-Page-Tables-WP.pdf. Black Hat
USA 2013.
[21] Berger, Emery D and Zorn, Benjamin G. DieHard: probabilistic memory
safety for unsafe languages. In Acm sigplan notices, volume 41, pages
158–168. ACM, 2006.
[22] Chen, Haogang and Mao, Yandong and Wang, Xi and Zhou, Dong and
Zeldovich, Nickolai and Kaashoek, M Frans. Linux kernel vulnerabilities:
State-of-the-art defenses and open problems. In Proceedings of the Second
Asia-Pacific Workshop on Systems, page 5. ACM, 2011.
[23] Chow, Jim and Pfaff, Ben and Garfinkel, Tal and Rosenblum, Mendel.
Shredding Your Garbage: Reducing Data Lifetime Through Secure Deal-
location. In USENIX Security Symposium, pages 22–22, 2005.
[24] Cowan, Crispan and Pu, Calton and Maier, Dave and Walpole, Jonathan
and Bakke, Peat and Beattie, Steve and Grier, Aaron and Wagle, Perry
and Zhang, Qian and Hinton, Heather. Stackguard: Automatic adaptive
detection and prevention of buffer-overflow attacks. In USENIX Security
Symposium, volume 98, pages 63–78. San Antonio, TX, 1998.
[25] Dan Rosenberg. Subject: CVE request: multiple kernel stack memory
disclosures. http://www.openwall.com/lists/oss-security/2010/09/
25/2.
[26] Dan Rosenberg. Vulnerabilities. http://www.vulnfactory.org/vulns/.
Listed under “memory disclosure”.
[27] fanxiaocao and pjf of IceSword Lab (Qihoo 360). Automatically Dis-
covering Windows Kernel Information Leak Vulnerabilities. http:
//www.iceswordlab.com/2017/06/14/Automatically-Discovering-
Windows-Kernel-Information-Leak-Vulnerabilities en/.
[28] fanxiaocao of IceSword Lab (Qihoo 360). great! I am also got multi
case of “double-write”. yet I report about 20 kernel pool address leak to
MS. but they change the bar. https://twitter.com/TinySecEx/status/
943410888119218176.
[29] fanxiaocao of IceSword Lab (Qihoo 360). new type of info-leak. https:
//twitter.com/TinySecEx/status/943417169953505282.
[30] fanxiaocao of IceSword Lab (Qihoo 360). you see , i am also
found this case . haha! https://twitter.com/TinySecEx/status/
943411731845410816.
80
[31] Feng Yan. Windows Graphics Programming: Win32 GDI and DirectDraw.
Prentice Hall Professional, 2001.
[32] Feng Yuan. Source code for Windows Graphics Programming: Win32
GDI and DirectDraw. https://blogs.msdn.microsoft.com/fyuan/
2007/03/21/source-code-for-windows-graphics-programming-
win32-gdi-and-directdraw/.
[33] grsecurity. Probably didn’t find anything in the typical Linux userland
interface because 2013 we did some similar instrumentation to make some
leaks fall out - modified the magic for STACKLEAK/SANITIZE to a value
we told a fuzzer to never provide to the kernel,inspected copy to user for
it. https://twitter.com/grsecurity/status/991450745642905602.
[34] Jann Horn. MacOS getrusage stack leak through struct padding. https:
//bugs.chromium.org/p/project-zero/issues/detail?id=1405.
[35] Jon Oberheide. Advisories. https://jon.oberheide.org/advisories/.
Listed under “Stack Disclosure”.
81
[43] Kurmus, Anil and Zippel, Robby. A tale of two kernels: Towards ending
kernel hardening wars with split kernel. In Proceedings of the 2014 ACM
SIGSAC Conference on Computer and Communications Security, pages
1366–1377. ACM, 2014.
[44] Lu, Kangjie. Securing software systems by preventing information leaks.
PhD thesis, Georgia Institute of Technology, 2017.
[45] Lu, Kangjie and Song, Chengyu and Kim, Taesoo and Lee, Wenke.
UniSan: Proactive kernel memory initialization to eliminate data leak-
ages. In Proceedings of the 2016 ACM SIGSAC Conference on Computer
and Communications Security, pages 920–932. ACM, 2016.
[46] Lu, Kangjie and Walter, Marie-Therese and Pfaff, David and Nürnberger,
Stefan and Lee, Wenke and Backes, Michael. Unleashing use-before-
initialization vulnerabilities in the Linux kernel using targeted stack spray-
ing. In NDSS’17, Network and Distributed System Security Symposium,
2017.
[47] Mateusz Jurczyk. A story of win32k!cCapString, or unicode strings gone
bad. http://j00ru.vexillium.org/?p=1609.
[48] Mateusz Jurczyk. FreeType 2.5.3 CFF CharString parsing heap-based
buffer overflow in ”cff builder add point”. https://bugs.chromium.org/
p/project-zero/issues/detail?id=185.
[49] Mateusz Jurczyk. FreeType 2.5.3 multiple unchecked function calls return-
ing FT Error. https://bugs.chromium.org/p/project-zero/issues/
detail?id=197.
[50] Mateusz Jurczyk. nt!NtMapUserPhysicalPages and Kernel Stack-
Spraying Techniques. http://j00ru.vexillium.org/?p=769.
[51] Mateusz Jurczyk. Subtle information disclosure in WIN32K.SYS syscall
return values. http://j00ru.vexillium.org/?p=762.
[52] Mateusz Jurczyk. Using Binary Diffing to Discover Windows Kernel Mem-
ory Disclosure Bugs. https://googleprojectzero.blogspot.com/2017/
10/using-binary-diffing-to-discover.html.
[53] Mateusz Jurczyk. Windows Kernel 64-bit stack memory disclosure
in nt!NtQueryVirtualMemory (MemoryImageInformation). https://
bugs.chromium.org/p/project-zero/issues/detail?id=1519.
[54] Mateusz Jurczyk. Windows Kernel Local Denial-of-Service
#1: win32k!NtUserThunkedMenuItemInfo (Windows 7-10).
http://j00ru.vexillium.org/?p=3101.
[55] Mateusz Jurczyk. Windows Kernel Local Denial-of-Service #2:
win32k!NtDCompositionBeginFrame (Windows 8-10). http:
//j00ru.vexillium.org/?p=3151.
82
[56] Mateusz Jurczyk. Windows Kernel Local Denial-of-Service #3:
nt!NtDuplicateToken (Windows 7-8). http://j00ru.vexillium.org/?p=
3187.
[57] Mateusz Jurczyk. Windows Kernel Local Denial-of-Service
#4: nt!NtAccessCheck and family (Windows 8-10). http:
//j00ru.vexillium.org/?p=3225.
[58] Mateusz Jurczyk. Windows Kernel multiple stack and pool memory dis-
closures into NTFS file system metadata. https://bugs.chromium.org/
p/project-zero/issues/detail?id=1325.
[59] Mateusz Jurczyk. Windows Kernel pool memory disclosure due to output
structure alignment in win32k!NtGdiGetOutlineTextMetricsInternalW.
https://bugs.chromium.org/p/project-zero/issues/detail?id=
1144.
[60] Mateusz Jurczyk. Windows Kernel pool memory disclosure in
win32k!NtGdiGetGlyphOutline. https://bugs.chromium.org/p/
project-zero/issues/detail?id=1267.
[61] Mateusz Jurczyk. Windows Kernel pool memory disclosure into
NTFS metadata ($LogFile) in Ntfs!LfsRestartLogFile. https://
bugs.chromium.org/p/project-zero/issues/detail?id=1352.
83
[68] Mateusz Jurczyk, Gynvael Coldwind. Bochspwn: Exploiting Ker-
nel Race Conditions Found via Memory Access Patterns. http://
j00ru.vexillium.org/slides/2013/syscan.pdf. SyScan 2013.
[69] Mateusz Jurczyk, Gynvael Coldwind. Bochspwn: Identifying 0-
days via system-wide memory access pattern analysis. http://
j00ru.vexillium.org/slides/2013/bhusa.pdf. Black Hat USA 2013.
[70] Mateusz Jurczyk, Gynvael Coldwind. Identifying and Exploiting Win-
dows Kernel Race Conditions via Memory Access Patterns. http://
vexillium.org/dl.php?bochspwn.pdf.
[71] Mateusz Jurczyk, Gynvael Coldwind. kfetch-toolkit. https://
github.com/j00ru/kfetch-toolkit.
[72] Mateusz Jurczyk, Gynvael Coldwind. read lin mem() func-
tion. https://github.com/j00ru/kfetch-toolkit/blob/master/
instrumentation/mem interface.cc.
[73] Mathias Krause. CVE Requests (maybe): Linux kernel: various info
leaks, some NULL ptr derefs. http://www.openwall.com/lists/oss-
security/2013/03/05/13.
[74] Matt Tait. Google Project Zero Bug Tracker. https:
//bugs.chromium.org/p/project-zero/issues/list?can=1&q=
id%3A390%2C435%2C453.
[75] Matt Tait. Kernel-mode ASLR leak via uninitialized memory returned
to usermode by NtGdiGetTextMetrics. https://bugs.chromium.org/p/
project-zero/issues/detail?id=480.
[76] Michal Hocko. kmemcheck: rip it out for real. https://git.kernel.org/
pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=
f335195adf043168ee69d78ea72ac3e30f0c57ce.
[77] Michal Zalewski. American Fuzzy Lop. http://lcamtuf.coredump.cx/
afl/.
[78] Microsoft. Acknowledgments 2015. https://docs.microsoft.com/en-
us/security-updates/Acknowledgments/2015/acknowledgments2015.
[79] Microsoft. Special Pool (MSDN). https://docs.microsoft.com/en-us/
windows-hardware/drivers/devtest/special-pool.
[80] Microsoft. Windows lifecycle fact sheet. https://
support.microsoft.com/en-us/help/13853/windows-lifecycle-
fact-sheet.
[81] Microsoft (MSDN). ExAllocatePoolWithQuotaTag function.
https://docs.microsoft.com/en-us/windows-hardware/drivers/
ddi/content/wdm/nf-wdm-exallocatepoolwithquotatag.
84
[82] Microsoft (MSDN). ExAllocatePoolWithTag function. https:
//docs.microsoft.com/en-us/windows-hardware/drivers/ddi/
content/wdm/nf-wdm-exallocatepoolwithtag.
[83] Microsoft (MSDN). /GS (Buffer Security Check). https://
msdn.microsoft.com/en-us/library/8dbf701c.aspx.
85
[95] MITRE. CWE-252: Unchecked Return Value. https://cwe.mitre.org/
data/definitions/252.html.
[96] North, John. Identifying Memory Address Disclosures. 2015.
[97] Peiró, Salva and Muñoz, M and Masmano, Miguel and Crespo, Alfons.
Detecting stack based kernel information leaks. In International Joint
Conference SOCO14-CISIS14-ICEUTE14, pages 321–331. Springer, 2014.
[98] Peiró, Salva and Munoz, M and Crespo, Alfons. An analysis on the im-
pact and detection of kernel stack infoleaks. Logic Journal of the IGPL,
24(6):899–915, 2016.
[99] pj4533. MemSpyy. https://www.codeproject.com/Articles/21090/
MemSpyy.
[100] Robert C. Seacord. The CERT C Coding Standard, Second Edition: 98
Rules for Developing Safe, Reliable, and Secure Systems. Addison-Wesley
Professional, 2014. DCL39-C. Avoid information leakage when passing a
structure across a trust boundary.
[101] Sebastian Apelt. Pwn2Own 2014 - Escaping the sandbox through
AFD.sys. http://siberas.blogspot.de/2014/07/pwn2own-2014-
escaping-sandbox-through.html.
[102] Solar Designer. Finally went through the Bochspwn Reloaded slides.
Kudos! Feedback: rather than “Remove taint on free”, you could re-
taint & detect UAF+leak. https://twitter.com/solardiz/status/
879288169174315008.
[103] StatCounter GlobalStats. Desktop Windows Version Market Share
Worldwide. http://gs.statcounter.com/os-version-market-share/
windows/desktop/worldwide.
[104] Tavis Ormandy. Fun Rev. Challenge: On 32bit Windows7, explain where
the upper 16bits of eax come from after a call to NtUserRegisterClassEx-
WOW(). https://twitter.com/taviso/status/16853682570.
[105] Tavis Ormandy. iknowthis Linux System Call Fuzzer. https://
github.com/rgbkrk/iknowthis.
[106] Tess Ferrandez. Show me the memory: Tool for visualizing virtual mem-
ory usage and GC heap usage. https://blogs.msdn.microsoft.com/
tess/2009/04/23/show-me-the-memory-tool-for-visualizing-
virtual-memory-usage-and-gc-heap-usage/.
[107] Vasiliy Kulikov. Re: Linux kernel proactive security hardening. http:
//seclists.org/oss-sec/2010/q4/129.
[108] Vegard Nossum. Getting started with kmemcheck. https://
www.kernel.org/doc/html/v4.12/dev-tools/kmemcheck.html.
86
[109] Wandering Glitch. Leaking Windows Kernel Pointers. https:
//ruxcon.org.au/assets/2016/slides/RuxCon%20-%20Leaking%
20Windows%20Kernel%20Pointers.pdf.
[110] Weimin Wu. An Analysis of A Windows Kernel-Mode Vulnerabil-
ity (CVE-2014-4113). https://blog.trendmicro.com/trendlabs-
security-intelligence/an-analysis-of-a-windows-kernel-mode-
vulnerability-cve-2014-4113/.
87
A Other system instrumentation schemes
During the development of Bochspwn and Bochspwn Reloaded, we have consid-
ered a number of alternative ways in which full system instrumentation could
help identify security flaws, or at least signal sensitive areas of code that should
receive more attention. For some of the ideas, we implemented functional proto-
types which did uncover new bugs in the Windows kernel. While none of these
experimental tools had as much success as the two main projects, we decided
to discuss them in this appendix for completeness. We hope that the concepts
outlined in the following subsections may serve as a source of inspiration for
researchers aiming to take up the subject of kernel instrumentation through
software emulation. They are mostly discussed in the context of Windows, but
many of them also apply to Linux and other operating systems.
88
internal kernel functions with operating on user-mode pointers. With the lack
of explicit pointer annotations in Windows, no kernel routine may definitely
know if a pointer it receives as an argument is a user-mode one or not.
Passing along a user-controlled pointer into nested functions may blur the
understanding of which part of the code is accountable for sanitizating the
address. This may have dire consequences for system security, as illustrated in
Listing 29.
In this example, the Output parameter is never validated before being writ-
ten to. The top-level NtMagicValues handler doesn’t sanitize the pointer be-
cause it doesn’t directly operate on it. The Foo() function doesn’t do it, because
it assumes that any argument it receives will already have been checked by the
caller. Finally, the Bar() function doesn’t do it, because it is a simple internal
function that has no notion of different types of pointers. As a whole, this re-
sults in an arbitrary kernel memory overwrite – an easily exploitable security
flaw – all because of the ambiguity caused by passing user-mode pointers to
internal kernel functions which do not expect it.
Potential issues of this kind may be flagged by logging all user-mode memory
references taking place within relatively deep callstacks. While not all instances
of such behavior manifest actual bugs, heavily nested accesses to the ring 3
address space may suggest that the relevant code is not aware of the nature of
the referenced pointer. This in turn increases the likelihood of the presence of
a vulnerability, and warrants manual follow-up analysis.
89
A.3 Unprotected accesses to user-mode memory
Kernels should typically never assume the validity of user-mode pointers, unless
the address range in question is explicitly locked in memory. In Windows, this
imposes the need to wrap each user-mode memory reference with an adequate
exception handler. The absence of such handler at any place where a controlled
pointer is accessed may be exploited to trigger an unhandled exception and crash
the operating system. The need to safely handle all exceptions arising from the
usage of ring 3 memory is reflected in the documentation of the ProbeForRead
function [88]:
try {
...
ProbeForWrite(Buffer, BufferSize, BufferAlignment);
/* Note that any access (not just the probe, which must come first,
* by the way) to Buffer must also be within a try-except.
*/
...
} except (EXCEPTION_EXECUTE_HANDLER) {
/* Error handling code */
...
}
90
struct _EH3_EXCEPTION_REGISTRATION {
struct _EH3_EXCEPTION_REGISTRATION *Next;
PVOID ExceptionHandler;
PSCOPETABLE_ENTRY ScopeTable;
DWORD TryLevel;
};
The structures reside in the stack frames of their corresponding functions and
are initialized by calling the SEH prolog4( GS) procedures. During execution,
entering the try{} blocks is denoted by writing their zero-based indexes to the
TryLevel fields in the aforementioned structures, and later overwriting them
with −2 (0xfffffffe) when execution leaves the blocks and exception handling
is disabled. Below is an example of a try/except block encapsulating the
writing of a single DWORD value to user space:
Table 11: Summary of DoS bugs caused by unprotected access to user space
91
A.4 Broad exception handlers
In Windows, for every try{} block of code modifying any global data structures
in the kernel, there should be a corresponding except{} block which reverts all
persistent changes made prior to the exception. This is relatively easy to achieve
with a flat structure of the code, when all relevant operations are explicitly vis-
ible inside the try{}. On the other hand, the rule is more difficult to enforce
when nested calls are used, thus obfuscating the code flow and potentially fa-
cilitating the interruption of functions which don’t anticipate being preempted
in the middle of execution. In certain situations, this can lead to leaving global
objects in the system in an inconsistent state, which may open up security
vulnerabilities. One example of such bug is CVE-2014-1767 [101], a dangling
pointer flaw in the afd.sys network driver, which was used during the pwn2own
competition to elevate privileges in the system as part of a longer exploit chain.
92
Let’s consider an example shown in Listing 32. The NtAddValue top-level
syscall handler validates the UserValuePtr pointer but doesn’t read its value,
instead passing it to the internal KeAddValueInternal function. The latter
routine is not aware of the type of the pointer it receives, so it simply imple-
ments its self-contained logic – allocates a new object in memory, inserts it into
a doubly-linked list and initializes it with the input data. A problem arises
when accessing ValuePtr in line 14 fails and generates an ACCESS VIOLATION
exception, thus effectively aborting the execution of KeAddValueInternal and
jumping straight into the handler in line 28. Due to the fact that the nested
function was interrupted, it has already saved the newly allocated object as the
head of the list, but hasn’t yet initialized the object’s LIST ENTRY structure.
On the other hand, NtAddValue doesn’t know the internal logic of the functions
it invokes, so it doesn’t revert any changes made by them. Consequently, the
linked list becomes corrupted with uninitialized pointers, leaving the system
unstable and prone to privilege escalation attacks.
Candidates for such issues can be automatically identified by system instru-
mentation by examining all kernel accesses to user-mode memory11 and checking
the location of the first enabled exception handler in the stack trace for each
of them. If the closest handler is not in the current function (or even worse,
several functions below in the call stack), it is an indicator of a broad exception
handler that could be unprepared to correctly restore the kernel to the state
before the exception. That said, not all broad handlers are bugs, so each case
needs to be examined on its own to determine if any changes made up to the
current point of execution are not accounted for in the corresponding exception
handler.
ExAllocatePoolWithQuotaTag [81].
93
if ((Address & (~Alignment)) != 0) {
ExRaiseDatatypeMisalignment();
}
• Intercept all cmp instructions with a register or memory as the first operand.
• Resolve the values of both operands of the instruction.
• If the value of the second operand is equal to MmUserProbeAddress (for
example 0x7fff0000 in x86 builds), mark the address in the first operand
as sanitized in the scope of the current syscall.
94
win32k!RegisterLogonProcess). However, there is much room for improve-
ment in this field, and we believe that the instrumentation could successfully
detect security issues provided an effective way to determine both ends of each
probed ring 3 address range.
In lines 6-8, the syscall handler initializes the output UNICODE STRING struc-
ture as an empty string backed by a user-mode buffer. Further on, it calls the
RtlAppendUnicodeString API on that structure to fill it with a textual rep-
resentation of the system version. The problem in the code is that the latter
routine assumes that it receives a non-volatile kernel UNICODE STRING object,
while in fact it is passed a user-mode pointer whose data may change asyn-
chronously at any point of the system call execution. A malicious program could
use a concurrent thread to exploit the race condition by changing the value of
UnicodeString->Buffer, to point it into the kernel address space within the
short time window between the initialization of the pointer and its usage in the
API function.
By nature, the bug class is very similar to double fetches, with the main
difference being that the affected code trusts the contents of user-mode memory
not because it has already read it once, but because it has explicitly initialized
it to a specific value. The detection of such issues is also almost identical to the
logic implemented in the original Bochspwn project [70], with the addition of
95
instrumenting not only kernel→user memory reads, but also writes. We expect
read-after-write conditions to be mostly specific to Windows, as it seems to be
strongly tied to direct user-mode pointer manipulation and the lack of clear
distinction between ring 3 and ring 0 pointers.
96
Let’s examine the example illustrated in Listing 34. The USERNAME structure
is a self-contained object that includes both the UNICODE STRING structure and
the corresponding textual buffer. A local object of this type is first initialized
in lines 11-12 and later copied to the client in line 14. Since the Buffer pointer
passed back to user-mode still contains a kernel address, it is overwritten with a
reference to the ring 3 buffer in line 15. The time window available for another
thread to capture the disclosed kernel pointer lasts between lines 14 and 15.
Detection of double writes is again very similar to that of double fetches – the
instrumentation should catch all kernel→userland memory writes, and signal a
bug every time a specific address is written to with non-zero data more than
once in the context of a single system call. As an additional feature, the tool can
put a special emphasis on cases where the original bytes resemble a kernel-mode
address, and the new data appears to be a user-mode pointer. This should help
highlight the reports most likely to represent actual information disclosure bugs.
To test the above idea, we implemented a simple prototype of the instru-
mentation and ran it on Windows 7 and 10 32-bit. As a result, we discovered
three double-write conditions, all leaking addresses of objects in ring 0 memory:
• A bug in nt!IopQueryNameInternal, in the copying of a UNICODE STRING
structure. The flaw is reachable through the nt!NtQueryObject and
nt!NtQueryVirtualMemory system calls, and was filed in the Project Zero
bug tracker with the corresponding proof of concept as issue #1456 [63].
• A bug in nt!PspCopyAndFixupParameters (UNICODE STRING structures
nested in RTL USER PROCESS PARAMETERS).
• A bug in win32k!NtUserfnINOUTNCCALCSIZE (NCCALCSIZE PARAMS struc-
ture).
The first of the above problems was reported to Microsoft in December 2017,
but the vendor replied that the report and all similar issues didn’t meet the bar
for a security bulletin and would be instead targeted to be fixed in the next
version of Windows. Upon publishing the details of the double-write conditions,
other researchers publicly claimed that they were also aware of the bug class
and collided with some of our findings [28, 30, 29].
97
It is important to note that unchecked return values are a canonical type
of a problem that can be effectively detected using static analysis, especially
if the source code is available. As a very basic example, after finding the
aforementioned FreeType bug caused by an ignored error code, we added a
attribute ((warn unused result)) directive to the declaration of the in-
ternal FT Error type and compiled the project with the -Wno-attributes flag,
which caused gcc to warn about all instances of unchecked return values of type
FT Error [49]. The output of this experiment motivated the project’s main-
tainer to submit a series of patches to fix many potential, related bugs. While
this is a simple example, more advanced analyzers should be able to pinpoint
such behavior accurately, without the need to apply any special changes to the
tested code.
On a binary level, static analysis is more difficult, as some information is
inevitably lost during compilation. However, it is still possible to achieve by
tracking operations on the EAX or RAX registers to determine if each of the
evaluated functions has a return value, and verifying if all of their callers check
that value accordingly. A big advantage of the static approach is that it is able
to process an entire code base without the need to execute it, and hence it is not
limited by the reachable code coverage. Nonetheless, this can also be considered
a drawback, as reports regarding unchecked return values tend to be flooded
with false-positives and non-issues. In this context, the results of dynamic
analysis are easier to triage and understand, because they are supplemented
with complete information about the system state, including traces of the control
flow, the actual values returned by the functions and so on.
The execution pattern indicating potential problems in the code is straight-
forward. Excluding minor corner cases and assuming 32-bit execution mode, it
is as follows: if two instructions set the value of the EAX register and they are
separated by at least one ret instruction but no reads from EAX in between,
this suggests that the second write discards a return value that should have
been checked first. The only major problem with the above scheme is the fact
that Bochs doesn’t provide instrumentation callbacks for register operations.
On the upside, references to the emulated CPU registers are achieved through
general-purpose macros such as BX READ 32BIT REG and BX WRITE 32BIT REGZ
defined in cpu/cpu.h (Listings 35 and 36). For a demonstration of how the
macros are used in the software implementation of the mov r32, r32 instruc-
tion, see Listing 37. Thanks to this detail, we were able to introduce two
custom callbacks named bx instr genreg read and bx instr genreg write,
invoked on every access to any register; their prototypes are shown in Listing 38.
We subsequently added calls to these instrumentation-defined functions in all
register-related macros found in cpu/cpu.h.
With the capability of intercepting all references to EAX taking place in the
kernel, we implemented the high-level logic of the instrumentation. While test-
ing the tool, we learned that several instructions required special handling –
xor eax, eax had to be treated as write operations instead of the theoreti-
cal r/w, while movzx eax, ax and similar instructions are effectively no-ops in
the sense of our logic, even though they operate on various parts of EAX.
98
145 #define BX_READ_8BIT_REGL(index) (BX_CPU_THIS_PTR gen_reg[index].word.
byte.rl)
146 #define BX_READ_16BIT_REG(index) (BX_CPU_THIS_PTR gen_reg[index].word.rx)
147 #define BX_READ_32BIT_REG(index) (BX_CPU_THIS_PTR gen_reg[index].dword.
erx)
99
We evaluated the instrumentation against Windows 7 32-bit and collected
over 2700 unique reports of unchecked return values. Due to the excessive output
volume we were only able to review about 20% of the reports, which did not
manifest any high-severity bugs. Nevertheless, we believe the technique shows
great potential and can be successfully used to uncover new bugs, with more
effort put into reducing the number of flagged non-issues.
All of the above and many other types of API-related issues could be uncov-
ered by instrumenting known sensitive functions and validating their security-
related requirements.
100