Introduction To QEMU
Introduction To QEMU
Introduction To QEMU
Date:2010/04/09 , rednoah
Outline
Introduction to QEMU
QEMU
Reference
James Smith and Ravi Nair, Virtual Machines: Versatile Platforms for Systems and Processors
QEMU internals http://wiki.qemu.org/Main_Page
VM Overview
why Virtual memory? why Java virtual machine? why Virtual I/O? why Virtual Private Network (VPN) ?
VM Overview
VM Overview
Common VMs:
IBM VM/CMS VMware GSX Xen Virtual PC JVM, MS CLI (Common Language Infrastructure) Dalvik virtual machine IA-32 IL, Apple Rosetta, HP PA-Aries Transmeta Crusoe QEMU
VM Overview
Virtualization
OS: A machine
is defined
by ISA Compiler: A machine is defined by ABI (User ISA + OS calls) Application: A machine is defined by API (User ISA + Library calls)
VM Overview
Virtual Machines
Add
virtualization software to a host platform and support guest process or system on a VM.
VM Overview
Guest processes may intermingle with host processes Execute applications with an ISA different from the HW platform Couple at ABI level via runtime system As a practical matter, guest OS and host OS are often the same
Ex. FX!32 (allows x86 Win32 programs to execute on Alpha-based systems running Windows NT) Different OS: ex. Wine (Enable running Win32 program on Linux)
VM Overview
Server consolidation Secure partitioning Fault isolation Support software development and deployment Cloud computing bandwagon
VM Overview
VM Overview
VM Overview
QEMU User-mode
QEMU System-mode
Interpretation
Emulation ? Simulation ?
Emulation: to be you. (A method for enabling a (sub)system to present the same interface and characteristics as another) Simulation: to be like you.
Refer to platforms
Source and Target Source ISA: Original instruction set or binary Target ISA: Instruction set being executed by processor performing emulation.
Refer to ISAs
Interpretation
instruction at-a-time Binary Translation: block-at-a-time optimized for repeated instruction executions
Interpretation
Interpretation State
target-arm/cpu.h
Interpretation
Compiled Emulation
Replace each source instruction by a sequence of emulation functions in high-level language. The in-lined program can then be compiled and optimized by the compiler. After redundancy elimination optimization, the generated code may be similar to the result of static translation
e.g.
add(r1,r2,r3); sub(r4,r5,r6);
Binary Translation
Generate custom code for every source instruction. For example, a load instruction in source code could be translated into a respective load instruction in native code. Get rid of repeated parsing, decoding, and jumping overhead. Register mapping is needed to reduce load/stores significantly.
Binary Translation
Binary Translation
Binary Translation
Binary Translation
Binary Translation
Register mapping
Easier
if Number of target registers > number of source registers. (e.g. translating x86 binary to RISC) May be on a per-block, or per-trace, or per-loop, basis If the number of target registers is not enough Infrequently used registers (Source) may not be mapped
Binary Translation
(Target PC) is different from SPC (Source PC) For indirect branches, the registers hold source PCs. So we must provide a way to map SPCs to TPCs.
Incorrect translation ! /* jump indirect through ctr, but ctr contains SPC */
Binary Translation
Dynamic Translation
First Interpret
And perform code discovery as a byproduct Incrementally, as it is discovered Place translated blocks into Code Cache Save source to target PC mapping in an Address Lookup Table Execute translated block to end Lookup next source PC in table If translated, jump to target PC else interpret and translate
Translate Code
Emulation process
Binary Translation
is shifted as needed between the interpreter, the EM, and translated blocks in the code cache. Each component must have a way to track SPC.
Interpreter
uses SPC directly Interpreter passes the next SPC to EM Translated block passes the next SPC to EM using JAL (jump and link instruction) or mapping SPC to a register.
Binary Translation
Emulation Manager
Translation Block
Context switch
Translation Block
Interpreter
0 2
a.out 4
Emulator
3
Cache cache
Binary Translation
Case 1: Both source and target machines have CC the source machines CC must be save/restored Case 2: Only source machine has CC target machine must simulate the CC some source machines set many CC in one instruction Case 3: Only target machine has CC compare & branch is emulated by two instruction Case 4: Neither target nor source have CC no issues
Binary Translation
Some rarely used CC (V,C) or flags (e.g. parity) can be simulated using lazy evaluation. Instead of saving/restoring the CC or the flag, save the instruction and its operands and re-compute the CC/flag when it is needed
Binary Translation
CC Optimizations
Combine Compare with Branch ARM may use two instructions for a branch: a compare (or a TST or TEQ) instruction followed by a branch. For some simple cases, MIPS can simply use a compare-and-branch instruction. There are cases, although very rare, the translated code (in terms of number of instructions) could be even smaller than the original ARM code.
Mapping each flag to a dedicated register Example: N:R17 Z: R18 C:R19 V:R20
This can reduce instruction overhead to extract/deposit target flags from/to the CPSR (Current Program Status Register). It the target architecture has sufficient number of registers, this optimization should be considered. Otherwise, it may take away three more registers, and cause register spilling.
Binary Translation
Process VM
Perform guest/host mapping at the ABI (ISA + system calls) level Encapsulate guest process in process-level runtime Example: QEMU linux usermode Issues Memory architecture Exception architecture OS call emulation Overall VM architecture High performance implementation System environments
Process VM Implementation
Process VM
Loader
A special loader writes guest code and data into a region holding the guests memory image, and load the runtime code into memory. Allocate memory for the code cache and other tables Initialize runtime data structures and invoke OS to establish signal handlers. Emulate guest instructions with interpreter or binary translation What translation to flush?
Initialization
Emulation engine
OS Call Emulator
Exception Emulator
Process VM
Compatibility A strict definition of compatibility (e.g. bug-to-bug compatible) would exclude many useful process VM. Intrinsic compatibility
Any software written by the most devious programmer will work in a compatible way Example: Intel strives for intrinsic compatibility when it produces a new x86 microprocessor
Extrinsic compatibility Many useful VM applications do not achieve intrinsic compatibility Limited application set: run Microsoft productivity tools (Office)
Process VM
Compatibility issues
State Mapping if the guest process uses all virtual address space, intrinsic compatibility cannot be achieved Mapping of control transfers some potential trapping instructions may be removed User-level instruction FP format may be different OS operation host OS does not support exactly the same function as the guests native OS
Process VM
Software to maintain mapping table Similar to hardware page table/TLB Slow, but always work
Process VM
Process VM
QEMU Overview
Faster than cycle-accurate simulators. Good enough to use applications written for another CPU.
Just-in-time (JIT) compilation support to achieve high performance (400 ~ 500 MIPS) Lots of peripherals support (VGA, serial, and Ethernet, etc) Lots of target hosts and targets support (full system emulation)
x86, arm, mips, sh4, cris, sparc, powerpc, nds32, qemu/hw/* contain all of the supported boards.
Good enough to use applications written for another CPU. User mode emulation: can run applications compiled for another CPU.
QEMU overview
Update status
0.9.1 (Jan 6, 2008) Stable and stop for a long time 0.10 (Mar 5, 2009) TCG support (a new general JIT framework) 0.11 (Sep 24, 2009) KVM support 0.12 More KVM support. Code refactoring new peripheral framework to support dynamic board configuration
QEMU JIT
generic backend for a C compiler. It was simplified to be used in QEMU. "basic block" corresponds to a list of instructions terminated by a branch instruction.
QEMU JIT
Prologue, Epilogue
QEMU JIT
cpu exec() called each time around main loop. Program executes until an unchained block is encountered. Returns to cpu exec() through epilogue. Enter the code cache:
Linux: Set buffer executable and jump to Buffer & Execute
tcg_liveness_analysis
Remove dead code. Ex. and_i32 t0, t0, $0xffffffff Ex. add_i32 t0, t1, t2 add_i32 t0, t0, $1 mov_i32 t0, $1
Register mapping
register struct CPUNDS32State *env asm(r14); register target_ulong T0 asm(r15); register target_ulong T1 asm(r12); register target_ulong T2 asm(r13);
Avoid context-switch overhead Every time a block returns, try to chain it. tb_add_jump(): back-patch the native jump address
Base on qemu 0.10.5 , emulate mips (little endian) decode_opc translate mips-asm to micro-op Translation stops when a conditional branch is encountered. gen_store_gpr will store this value to the emulated cpus general register.
target-mips/translate.c
/qemu/Tcg.c
tcg_gen_code
opc
tcg outputs 0xe8 which means a call instruction in x86. It will call the functions in array qemu_ld_helpers. The args to the functions is passed by registers EAX,EDX and ECX.
pc+4 (s->code_ptr += 4)
address.
qemu
needs to find the PhysPageDesc entry in table **l1_phys_map and get the phys_offset. guest_phy_addr[31:22] first level entry guest_phy_addr[21:12] second level entry If page not find cpu_register_physical_memory : qemu creates a new entry (by mmap) and updates its value and insert this entry to the l1_phys_map table.
Translate the guest physical address to host virtual address. phys_offset == IO_MEM_RAM guest RAM space
phys_offset[31:12]: the offset of this page in emulated physical memory. phys_offset + phys_ram_base = host virtual address
Original way
1. Translate the guest virtual address to guest physical address 2. Then qemu needs to find the PhysPageDesc entry in table l1_phys_map and get the phys_offset 3. phys_offset + phys_ram_base = host virtual address 1. Search TLB first. 2. Hit: guest_virtual_address + addend = host_virtual_address. 3. Miss: Search the l1_phys_map table and then fill the corresponding entry to the TLB table
()
No
Translate one TB
register struct CPUARMState *env asm(r14); register target_ulong T0 asm(r15); register target_ulong T1 asm(r12); register target_ulong T2 asm(r13);
qemu/ qemu-* : OS dependent API wrapper example: memory allocation or socket target-*/ : target porting tcg/ : new and unified JIT framework *-user/ : user-mode emulation on different OS softmmu-* : target MMU acceleration framework hw/ : peripheral model fpu : softfloat FPU emulation library gdb : GDB stub implementation
TranslationBlock structure in translate-all.h Translation cache is code_gen_buffer in exec.c cpu-exec() in cpu-exec.c orchestrates translation and block chaining. vl.c: Main loop for system emulation.
Sample Demo
Using gdb to debug QEMU Using QEMU to debug guest OS QEMU Linux-user mode emulation QEMU system mode emulation