Which instructions are sent to the VM versus the small kernel? Can you go back through the trap path that a virtual machine need to generate? More virtual machine confusion So, let's say that we have the following (dig the ASCII art?): +-----+ +-----+ | App | | App | +-----+ +-----+ | OS1 | | OSn | +-----+ +-----+ | VM1 | | VMn | +-----+------+-----+ | Small Kernel | +------------------+ | Bare Hardware | +------------------+ This is basically the same as http://www.cs.princeton.edu/courses/archive/fall02/cs318/lec4/slide16.html I'll first explain the "logical" basis for what the VM does - this is a simplified explanation, but is basically correct. Now, assume that App1 wants to execute a system call. It goes through the standard steps, which is to push arguments onto the stack, then execute a software interrupt. This interrupt is delivered to the bare hardware, which then passes it to the small kernel. The small kernel only knows about the existence of the VMs, so VM1 gets told that a software interrupt was generated by something "owned" byt it. At this point, it knows it has to dispatch this to OS1, and OS1 is unmodified. So, it can then set up the stack the same way that the small kernel would have seen it, and then jump in to OS1's interrupt handler. Since OS1 is assumed to be unmodified, it has to provide an array of interrupt handler pointers, so VM1 knows where to jump. Now remember - this code is what OS1 would be using if it were handling real interrupts, and the exit point of this code has to be the special "return from interrupt" (IRET) instruction, rather than the "return from function" instruction. One might expect that "return from interrupt" is a privileged instruction, since regular user-space code isn't supposed to be handling interrupts directly. So, when the OS1 interrupt code finishes, it tries to execute IRET, which should cause another trap (but an illegal instruction trap this time). This trap goes to to bare hardware, which passes it to the small kernel, which hands it off to VM1, which then realizes that the OS is trying to exit from the interrupt handler. It then cleans up OS1's stack in the same way that IRET would, and then does the actual "return from interrupt" so that App1 picks up where it normally would after the system call. What's wrong the above "simplified" explanation? It's logically correct, but you may not want to do this in real life. There are several reasons, all relating to the ugly details. Remember that OS1 is not "trusted" by the virtual machine system. What this means is that between VM1 and OS1, there should be a switch back from kernel to user mode, and we didn't show that in the above description. Also note that in the above description, all of the code in the OS1 interrupt handler is being handled during the "real" interrupt. What happens if OS1 has an infinite loop in the interrupt handler? This would be a bug in OS1, but it would also tie up the entire machine. So how do you get around this? Basically, the VM system has to handle the "real" interrupt itself, and then later schedule the "fake" interrupt for OS1. In this way, the "fake" interrupt is really being done as just another part of user-mode processing, and if something goes wrong in the interrupt handler there, it can't take down the whole system - it would be just like any other process crashing or infinite looping. So how is this kind of VM different from the Java kind of VM? In the IBM style of VM, the actual hardware and instruction set of the machine is what's being virtualized. This is the same approach that VMWare takes, allowing it to run an off-the-shelf version of Windows under it. The Java style of VM presents a "standard" machine to the programs, no matter what the underlying physical machine really is. They call it a "virtual" machine, not because it "virtualizes" the underlying machine, but because it presents a machine that's never really existed in a physical sense. The IBM VM approach, on the other hand, makes a copy of ("virtualizes") the physical machine on which it's running, and that machine is what the OSs see.