[Virtualization] Virtualizing Memory

TLB Miss Flow with software-Managed TLB

  • Without virtualization: Extract VPN from VA, VPN-PFN mapping…
  • With virtualization: Extract VPN from VA, VPN-PFN mapping (handled by VMM to be VPN-MFN mapping)…

Difficulty in Virtualizing Hardware-Managed TLB

  • Hardware-managed TLB
    – Hardware does page table walk on each TLB miss and fills TLB with the found PTE
  • Hypervisor doesn’t have chance to intercept on TLB misses
  • Solution-1: shadow paging
  • Solution-2: direct paging (para-virtualization)
  • Solution-3: new hardware

Shadow Paging

  • VMM intercepts guest OS getting the virtual CR3
  • VMM iterates over the guest page table, constructs a corresponding shadow page table
  • In shadow PT, every guest physical address is translated into host physical address (machine address)
  • Finally, VMM sets the real CR3 to point to the shadow page table

Link
The guest can’t be allowed access to the hardware PT because then it would essentially have control of the machine. So the hypervisor keeps the “real” mappings (guest virtual VPN -> host physical MFN) in the hardware when the relevant guest is executing, and keeps a representation of the page tables that the guest thinks it’s using “in the shadows.”
This avoids the VPN -> PFN translation step. As far as page faults go, nothing changes from the hardware’s point of view (remember, the hypervisor makes it so the page tables used by the hardware contain VPN->MFN mappings), a page fault will simply generate an exception and redirect to the appropriate exception handler. However, when a page fault occurs while a VM is running, this exception can be “forwarded” to the hypervisor, which can then handle it appropriately.

Question

  • Assume that:
    – There are 10 VMs running on a machine
    – Each VM contains 10 applications
  • How many shadow page tables in total? -> 110?
    – Shadow page tables are per application
    – Guest page tables are per application
    – pmaps are per VM

What if Guest OS Modifies Its Page Table?

  • Should not allow it to happen directly
    – Since CR3 is not pointing to the shadow page table
    – Need to synchronize the shadow page table with guest page table
  • VMM needs to intercept when guest OS modifies page table, and updates the shadow page table accordingly
    1. Mark the guest table pages as read-only (in the shadow page table)
    2. If guest OS tries to modify its page tables, it triggers page fault
    3. VMM handles the page fault by updating shadow page table

Dealing with Page Faults

  • When page fault occurs, traps to VMM
  • If present bit is 0 in the guest page table entry, guest OS needs to handle the fault
    – Guest OS load page from virtual disk to guest physical memory and sets present bit to 1
    – Guest OS returns from page fault, which traps into VMM again
    – VMM sees that present is 1 in guest PTE and creates entry in shadow page table
    – VMM returns from the original page fault
  • If present is 1: guest OS thinks page is present (but VMM may have swapped it out), VMM handles transparently
    – VMM locates the corresponding physical page, loads it in memory if needed
    – VMM creates entry in shadow page table
    – VMM returns from the original page fault

What if a Guest App Access its Kernel Memory?

  • How do we selectively allow/deny access to kernel-only pages?
  • One solution: split a shadow page table into two tables
    – Two shadow page tables, one for user, one for kernel
    – When guest OS switches to guest applications, VMM will switch the shadow page table as well, vice versa

What about Memory for Translation Cache (BT)?

  • Translation cache intermingles guest and monitor memory accesses
    – Need to distinguish these accesses
    – Monitor accesses have full privileges
    – Guest accesses have lesser privileges
  • On x86 can use segmentation
    – Monitor lives in high memory
    – Guest segments truncated to allow no access to monitor
    – Binary translator uses guest segments for guest accesses and monitor segments for monitor accesses

Pros and Cons of Shadow Paging

  • Pros
    – When shadow PT is established, memory accesses are very fast
  • Cons
    – Maintaining consistency between guest PTs and shadow PTs involve VMM traps, can be costly
    – TLB flush on every “world switch”
    – Memory space overhead to maintain pmap

Hardware-Assisted Memory Virtualization

  • Hardware support for memory virtualization
    – Intel EPT (Extended Page Table) and AMD NPT (Nested Page Table)
    – EPT: a per VM table translating PPN -> MPN, referenced by EPT base pointer
    – EPT controlled by the hypervisor, guest page table (GPT) controlled by guest OS (both exposed to hardware)
    – Hardware directly walks GPT + EPT (for each PPN access during GPT walk, needs to walk the EPT to determine MPN)
    – No VM exits due to page faults, INVLPG, or CR3 accesses

Pros and Cons of EPT

  • Pros
    – Simplified VMM design (all handled by hardware)
    – Guest PT changes do not trap, minimize VM exits
    – Lower memory space overhead (no need for pmap in memory)
  • Cons
    – TLB miss is costly: can involve many memory accesses to finish the walk!

Reclaiming Memory

  • ESX (and other hypervisors) allow overcommitment of memory
    – Total memory size of all VMs can exceed actual machine memory size
    – ESX must have some way to reclaim memory from VMs (and swap to disk)
  • Traditional: add transparent swap layer
    – Requires “meta-level” decisions: which page from which VM to swap
    – Best data to guide decision known only by guest OS
    – Guest and meta-level policies may clash, resulting in double paging
  • Alternative: implicit cooperation
    – Coax guest OS into doing its own page replacement
    – Avoid meta-level policy decisions

Ballooning Details

  • Guest drivers
    – Inflate: allocate pinned PPNs; backing MPNs reclaimed
    – Use standard Windows/Linux/BSD kernel APIs
  • Performance benchmark
    – Linux VM, memory-intensive dbench workload
    – Compares 256MB with balloon sizes 32-128MB vs static VMs
    – Overhead 1.4%-4.4%

Memory Sharing

  • Motivation
    – Multiple VMs running same OS, apps
    – Collapse redundant copies of code, data, zeros
  • Transparent page sharing
    – Map multiple PPNs to single MPN (copy-on-write)
    – Pioneered by Disco, but required guest OS hooks
  • New twist: content-based sharing
    – General-purpose, no guest OS changes
    – Background activity saves memory over time

Memory Allocation

  • Min size
    – Guaranteed, even when overcommitted
    – Enforced by admission control
  • Max size
    – Amount of “physical” memory seen by guest OS
    – Allocated when undercommitted
  • Shares
    – Specify relative importance
    – Proportional-share fairness

Allocation Policy

  • Traditional approach
    – Optimize aggregate system-wide metric
    – Problem: no QoS guarantees, VM importance varies
  • Pure share-based approach
    – Revoke from VM with min shares-per-page ratio
    – Problem: ignores usage, unproductive hoarding
  • Desired behavior
    – VM gets full share when actively using memory
    – VM may lose pages when working set shrinks

Reclaiming Idle Memory

  • Tax on idle memory
    – Charge more for idle page than active page
    – Idle-adjusted shares-per-page ratio
  • Tax rate
    – Explicit administrative parameter
    – 0% ~ plutocracy … 100% ~ socialism
  • High default rate
    – Reclaim most idle memory
    – Some buffer against rapid working-set increases

Dynamic Reallocation

  • Reallocation events
  • Enforcing target allocations
    – Ballooning: common-case optimization
    – Swapping: dependable fallback, try sharing first
  • Reclamation states
    – High: background sharing
    – Soft: mostly balloon
    – Hard: mostly swap
    – Low: swap and block VMs above target

Conclusion

  • Software and hardware solutions for memory virtualization both have pros and cons
  • More things to take care of besides the basic mechanism of memory virtualization
    – Allocation, sharing, overcommitment and reclamation

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

This site uses Akismet to reduce spam. Learn how your comment data is processed.