Category Archives: [UCSD] Virtualization

[Virtualization] Library OS

Why build for clouds as we do for desktops?

  • More layers –> tricky config
  • Duplication –> inefficiency
  • Large sizes –> long boot times
  • More stuff –> larger attack surface
  • Disentangle applications from the OS
  • Break up OS functionality into modular libraries
  • Link only the system functionality your app needs
  • Target alternative platforms from a single codebase

Unikernels

  • Unikernels are specialized virtual machine images built from a modular stack adding system libraries and configurations to application code
  • Every application is compiled into its own specialized OS that runs on the cloud or embedded devices

Traditional Library OS

  • Most OS functionalities implemented in the user space as libraries
  • The kernel-space OS part only ensures protection and multiplexing
  • Applications get to access hardware resources directly (faster)
  • But isolation is hard and a lot of software (esp. device drivers) need to be rewritten

Comparison

Unikernel Designs

  • Integrating configurations into the compilation process
  • Single-purpose libOS VMs perform only what the application needs and rely on hypervisor for isolation and resource multiplexing
  • Within a unikernel VM, there’s no priviledge difference between application and libOS (single address space)

Unikernel Benefits

  • Lightweight: only what the application uses is compiled and deployed
  • Better security: isolated libOS on hypervisor, small attack surface, single type-safe language, …
  • Fits many new cloud environment well: serverless, microservices, NFV

[Virtualization] Serverless Computing

Serverless Computing

  • Computing without servers?
  • Running applications without the need to manage servers?
  • Running functions instead of containers/VMs?
  • Infinite scaling?
  • The truth: no clear, agreed definition, i.e., no one really knows

One Perspective: How Cloud and Virtualization Evolved

Classic
Virtual Machine
Containers
Serverless Computing

Decreasing concern (and control) over stack implementation, Increasing focus on business logic

A Related Topic: Microservice

  • A software architecture that develops an application as a suite of small services, each of which can be deployed and scaled independently
  • When one (micro)service is in large demand, can scale it up
  • Different (micro)services can be written and managed by different teams
  • Changing one (micro)service will not affect the others

What is the essence of “Serverless Computing”?

  • Management-free
  • Autoscaling
  • Only pay for what you use

What is Today’s Serverless Computing Like?

  • Largely offered as Function as a Service (FaaS)
    • Cloud users write functions and ship them
    • Cloud provider runs and manages them
  • Still runs on servers
  • Have attractive features but also many limitations

Basic Architecture

BigQuery, …

AWS Lambda

  • An event-driven, serverless computing FasS platform introduced in 2014
  • Functions can be written in Node.js, Python, Java, Go, Ruby, C#, PowerShell
  • Each function allowed to take 128MB – 3GB memory and up to 15min
  • Max 1000 concurrent functions
  • Connected with many other AWS services

Lambda Function Triggering and Billing Model

  • Run user handlers in response to events: RPC handlers, triggers, cron jobs
  • Pay per function invocation: no charge when nothing is run

Internal Execution Model

  • Developers upload function code to a handler store (and associate it with a URL)
  • Events trigger functions through RPC (to the URL)
  • Load balancers handle RPC requests by starting handlers on workers
  • Handlers and sandboxed in containers

Azure, GCP, …

Limitations of Today’s Serverless Offerings

  • Difficult and slow to manage states
  • No easy or fast way to communicate across functions
  • Functions can only use limited resources
  • No control over function placement or locality
  • Billing model does not fit all needs

[Virtualization] Kubernetes and gVisor

Kubernetes

  • We need more than just packing and isolation: Scheduling, Lifecycle and health, Discovery, Monitoring, Auth{n,z}, Aggregates, Scaling, …
  • Everything at Google runs in containers: Gmail, Web Search, Maps, …
  • Open Source Containers: Kubernetes
    • Container orchestration
    • Builds on Docker containers
    • Multiple cloud and bare-metal environments
    • Supports existing OSS apps
    • Inspried and informed by Google’s experiences and internal systems
    • 100% open source, written in Go
    • Lets users manage applications, not machines

Primary concepts

  • Container: A sealed application package (Docker)
  • Pod: A small group of tightly coupled Containers
  • Labels: Identifying metadata attached to objects
  • Selector: A query against labels, producing a set result
  • Controller: A reconciliation loop that drives current state towards desired state
  • Service: A set of pods that work together

Pod

  • a Kubernetes abstraction that represents a group of one or more application containers, and some shared resources for those containers
    • Shared storage, as Volumes
    • Networking, as a unique cluster IP address
    • Information about how to run each container, such as the container image version or specific ports to use

Node

  • A node is a worker machine (either VM or physical machine)
  • One pod runs on one node, one node can run multiple pods
  • Nodes managed by control plane

Persistent Volumes

  • A higher-level abstraction – insulation from any one cloud environment
  • Admin provisions them, user claim them
  • Independent lifetime and fate
  • Can be handed-off between pods and lives until user is done with it
  • Dynamically “scheduled” and managed, like nodes and pods

Labels

  • Arbitrary metadata
  • Attached to any API object
  • Generally represent identity
  • Queryable by selectors
  • The only grouping mechanism
  • Use to determine which objects to apply an operation to

Pod lifecycle

  • Once scheduled to a node, pods do not move
  • Pods can be observed pending, running, succeeded, or failed
  • Pods are not rescheduled by the scheduler or apiserver
  • Apps should consider these rules

Internals

kube-apiserver

  • Provides a forward facing REST interface into the Kubernetes control plane and datastore
  • All clients and other applications interact with Kubernetes strictly through the API server
  • Acts as the gatekeeper to the cluster by handling authentication and authorization, request validation, mutation, and admission control in addition to being the front-end to the backing datastore

kube-controller-manager

  • Monitors the cluster state via the apiserver and steers the cluster towards the desired state

kube-scheduler

  • Component on the master that watches newly created pods that have no node assigned, and selects a node for them to run on
  • Factors taken into account for scheduling decisions include individual and collective resource requirements, hardware/software/policy constraints, affinity and anti-affinity specifications, data locality, inter-workload interference and deadlines

cloud-controller-manager

  • Node Controller: For checking the cloud provider to determining if a node has been deleted in the cloud after it stops responding
  • Route Controller: For setting up routes in the underlying cloud infrastructure
  • Service Controller: For creating, updating, and deleting cloud provider load balancers
  • Volume Controller: For creating, attaching, and mounting volumes, and interacting with the cloud provider to orchestrate volumes

etcd

  • etcd: an atomic key-value store that uses Raft consensus
  • Backing store for all control plane metadata
  • Provides a strong, consistent and highly available key-value store for persisting cluster state
  • Stores objets and config information

Node Components

kubelet

  • An agent that runs on each node in the cluster. It makes sure that containers are running in a pod
  • The kubelet takes a set of PodSpecs that are provided through various mechanisms and ensures that the containers described in those PodSpecs are running and healthy

kube-proxy

  • Manages the network rules on each node
  • Performs connection forwarding or load balancing for Kubernetes cluster services

gVisor

“Containers do not contain” – Dan Walsh

  • Still sharing the same kernel
  • Share same device drivers
  • Linux kernel represents a large attack surface
  • cgroup accounting may not be accurate

Are System Calls Secure?

  • The interface between containers and OS is system calls
  • Linux x86_64 has 319 64-bit syscalls
  • 2046 CVEs since 1999

Why can VMs be More Secure?

  • Virtual machines
    • Independent guest kernels
    • Virtual hardware interface: clear privilege separation and state encapsulation
    • But virtualized hardware interface is inflexible and VM is heavy weight with large memory footprint

Sandboxing

  • Rule-based sandboxing: reduce the attack surface by restricting what applications can access
    • e.g., AppArmor, SELinux, Secomp-bpf
    • Rules can be fragile (not properly capture threats) and can’t prevent side channel attacks

gVisor

  • Sandboxes untrusted applications
  • Implements Linux system API in user space
  • Secure by default
  • Written in Go, a memory/type-safe language

gVisor Architecture

  • Two separate processes (communicated through IPC
    • Sentry: emulated Linux system calls in user space
    • Gofer: file access
  • Most exploited syscalls: socket and open
    • Even if sentry is compromised, still can’t access files or open ports
  • Network is handled by user-mode network stack in Sentry

Trapping System Calls

  • Two modes supported
  • ptrace
  • KVM

Reference

Kubernetes 한글: https://kubernetes.io/ko/docs/concepts/overview/what-is-kubernetes/

[Virtualization] Container

Are VMs fit for (All) Today’s (Cloud) usages?

  • Performance overhead of indirections (guest OS and hypervisor)
  • Large memory footprint
  • Slow startup time
  • License and maintenance cost of guest OS
  • Do we really need to virtualize hardware and a full OS?
  • What about DevOps?

Why does it work? Separation of concerns

Why people care?

Developers: Build once… run anywhere (dependencies, packages, versions, automation, w/o overhead like VMs)
Administrators: Configure once… run anything (lifecycle efficiency, remove inconsistencies, segregation of duties, addresses the cost of VMs)

Linux Containers

  • Run everywhere
    • Regardless of kernel version
    • Regardless of host distro
    • Physical or virtual, cloud or not
    • Container and host architecture must match…
  • Run anything
    • If it can run on the host, it can run in the container
    • If it can on a Linux kernel, it can run

At High-Level: It looks like a VM
At Low-Level: OS-Level Virtualization

Using Namespaces to separate “Views” of Users

  • Namespace: naming domain for various resources
    • User IDs (UIDs)
    • Process IDs (PIDs)
    • File paths (mnt)
    • Network sockets
    • Pipe names
  • Namespaces are isolated by kernel

Isolating resources with cgroups

  • Linux Control Groups (cgroups): collection of Linux processes
    • Limits resource usages at group level (e.g., memory, CPU, device)
    • Fair sharing of resources
    • Track resource utilization (e.g., could be used for billing/management)
    • Control processes (e.g., pause/resume, checkpoint/restore)

Efficiency: almost no overhead

  • Processes are isolated, but run straight on the host
  • CPU performance = native performance
  • Memory performance = a few % shaved off for (optional) accounting
  • Network performance = small overhead; can be optimized to zero overhead

Docker

  • Docker Inc
    • Founded as dotCloud, Inc. in 2010 by Solomon Hykes (renamed to Docker Inc. in 2013)
    • Estimated to be valued at over $1 billion (101-250 employees)
  • Docker the software
    • A container engine written in Go (based on linux container)
  • Docker community
    • Now 1851 contributors, 16.2k forks of docker engine on GitHub (called Moby)

Why are Docker Containers Lightweight?

Docker Engine

  • daemon: Rest API (receiving instructions) and other features
  • containerd: Execution logic (e.g., start, stop, pause, unpause, delete containers)
  • runc: A lightweight runtime CLI

Docker Images

  • Not a VHD, not a file system
  • uses a Union File System
  • a read-only Layer
  • do not have state
  • Basically a tar file
  • Has hierarchy (arbitrary depth)

Docker Image Registry

  • Registry containing docker images
    • Local registry on the same host
    • Docker Hub Registry: Globally shared
    • Private registry on docker.com

Docker Swarm

  • Docker Swarm: A group of nodes collaborating over a network
  • Two modes for Docker hosts
    • Single Engine Mode: Not participating in a Swarm
    • Swarm Mode: Participating in a Swarm
  • Each swarm has a few managers (one being leader) that dispatch tasks to workers. Managers are also workers (i.e., execute tasks)

Security Implications of Containers

  • Unlike VMs whose interface is hardware instructions, containers’ interface is OS system calls
  • More difficult to protect syscalls
    • Involve large amount of code in the OS
    • And there are many syscalls

Reference

Container & Docker: https://tech.osci.kr/docker/2018/09/10/45749387/

[Virtualization] Virtualizing Memory

TLB Miss Flow with software-Managed TLB

  • Without virtualization: Extract VPN from VA, VPN-PFN mapping…
  • With virtualization: Extract VPN from VA, VPN-PFN mapping (handled by VMM to be VPN-MFN mapping)…

Difficulty in Virtualizing Hardware-Managed TLB

  • Hardware-managed TLB
    – Hardware does page table walk on each TLB miss and fills TLB with the found PTE
  • Hypervisor doesn’t have chance to intercept on TLB misses
  • Solution-1: shadow paging
  • Solution-2: direct paging (para-virtualization)
  • Solution-3: new hardware

Shadow Paging

  • VMM intercepts guest OS getting the virtual CR3
  • VMM iterates over the guest page table, constructs a corresponding shadow page table
  • In shadow PT, every guest physical address is translated into host physical address (machine address)
  • Finally, VMM sets the real CR3 to point to the shadow page table

Link
The guest can’t be allowed access to the hardware PT because then it would essentially have control of the machine. So the hypervisor keeps the “real” mappings (guest virtual VPN -> host physical MFN) in the hardware when the relevant guest is executing, and keeps a representation of the page tables that the guest thinks it’s using “in the shadows.”
This avoids the VPN -> PFN translation step. As far as page faults go, nothing changes from the hardware’s point of view (remember, the hypervisor makes it so the page tables used by the hardware contain VPN->MFN mappings), a page fault will simply generate an exception and redirect to the appropriate exception handler. However, when a page fault occurs while a VM is running, this exception can be “forwarded” to the hypervisor, which can then handle it appropriately.

Question

  • Assume that:
    – There are 10 VMs running on a machine
    – Each VM contains 10 applications
  • How many shadow page tables in total? -> 110?
    – Shadow page tables are per application
    – Guest page tables are per application
    – pmaps are per VM

What if Guest OS Modifies Its Page Table?

  • Should not allow it to happen directly
    – Since CR3 is not pointing to the shadow page table
    – Need to synchronize the shadow page table with guest page table
  • VMM needs to intercept when guest OS modifies page table, and updates the shadow page table accordingly
    1. Mark the guest table pages as read-only (in the shadow page table)
    2. If guest OS tries to modify its page tables, it triggers page fault
    3. VMM handles the page fault by updating shadow page table

Dealing with Page Faults

  • When page fault occurs, traps to VMM
  • If present bit is 0 in the guest page table entry, guest OS needs to handle the fault
    – Guest OS load page from virtual disk to guest physical memory and sets present bit to 1
    – Guest OS returns from page fault, which traps into VMM again
    – VMM sees that present is 1 in guest PTE and creates entry in shadow page table
    – VMM returns from the original page fault
  • If present is 1: guest OS thinks page is present (but VMM may have swapped it out), VMM handles transparently
    – VMM locates the corresponding physical page, loads it in memory if needed
    – VMM creates entry in shadow page table
    – VMM returns from the original page fault

What if a Guest App Access its Kernel Memory?

  • How do we selectively allow/deny access to kernel-only pages?
  • One solution: split a shadow page table into two tables
    – Two shadow page tables, one for user, one for kernel
    – When guest OS switches to guest applications, VMM will switch the shadow page table as well, vice versa

What about Memory for Translation Cache (BT)?

  • Translation cache intermingles guest and monitor memory accesses
    – Need to distinguish these accesses
    – Monitor accesses have full privileges
    – Guest accesses have lesser privileges
  • On x86 can use segmentation
    – Monitor lives in high memory
    – Guest segments truncated to allow no access to monitor
    – Binary translator uses guest segments for guest accesses and monitor segments for monitor accesses

Pros and Cons of Shadow Paging

  • Pros
    – When shadow PT is established, memory accesses are very fast
  • Cons
    – Maintaining consistency between guest PTs and shadow PTs involve VMM traps, can be costly
    – TLB flush on every “world switch”
    – Memory space overhead to maintain pmap

Hardware-Assisted Memory Virtualization

  • Hardware support for memory virtualization
    – Intel EPT (Extended Page Table) and AMD NPT (Nested Page Table)
    – EPT: a per VM table translating PPN -> MPN, referenced by EPT base pointer
    – EPT controlled by the hypervisor, guest page table (GPT) controlled by guest OS (both exposed to hardware)
    – Hardware directly walks GPT + EPT (for each PPN access during GPT walk, needs to walk the EPT to determine MPN)
    – No VM exits due to page faults, INVLPG, or CR3 accesses

Pros and Cons of EPT

  • Pros
    – Simplified VMM design (all handled by hardware)
    – Guest PT changes do not trap, minimize VM exits
    – Lower memory space overhead (no need for pmap in memory)
  • Cons
    – TLB miss is costly: can involve many memory accesses to finish the walk!

Reclaiming Memory

  • ESX (and other hypervisors) allow overcommitment of memory
    – Total memory size of all VMs can exceed actual machine memory size
    – ESX must have some way to reclaim memory from VMs (and swap to disk)
  • Traditional: add transparent swap layer
    – Requires “meta-level” decisions: which page from which VM to swap
    – Best data to guide decision known only by guest OS
    – Guest and meta-level policies may clash, resulting in double paging
  • Alternative: implicit cooperation
    – Coax guest OS into doing its own page replacement
    – Avoid meta-level policy decisions

Ballooning Details

  • Guest drivers
    – Inflate: allocate pinned PPNs; backing MPNs reclaimed
    – Use standard Windows/Linux/BSD kernel APIs
  • Performance benchmark
    – Linux VM, memory-intensive dbench workload
    – Compares 256MB with balloon sizes 32-128MB vs static VMs
    – Overhead 1.4%-4.4%

Memory Sharing

  • Motivation
    – Multiple VMs running same OS, apps
    – Collapse redundant copies of code, data, zeros
  • Transparent page sharing
    – Map multiple PPNs to single MPN (copy-on-write)
    – Pioneered by Disco, but required guest OS hooks
  • New twist: content-based sharing
    – General-purpose, no guest OS changes
    – Background activity saves memory over time

Memory Allocation

  • Min size
    – Guaranteed, even when overcommitted
    – Enforced by admission control
  • Max size
    – Amount of “physical” memory seen by guest OS
    – Allocated when undercommitted
  • Shares
    – Specify relative importance
    – Proportional-share fairness

Allocation Policy

  • Traditional approach
    – Optimize aggregate system-wide metric
    – Problem: no QoS guarantees, VM importance varies
  • Pure share-based approach
    – Revoke from VM with min shares-per-page ratio
    – Problem: ignores usage, unproductive hoarding
  • Desired behavior
    – VM gets full share when actively using memory
    – VM may lose pages when working set shrinks

Reclaiming Idle Memory

  • Tax on idle memory
    – Charge more for idle page than active page
    – Idle-adjusted shares-per-page ratio
  • Tax rate
    – Explicit administrative parameter
    – 0% ~ plutocracy … 100% ~ socialism
  • High default rate
    – Reclaim most idle memory
    – Some buffer against rapid working-set increases

Dynamic Reallocation

  • Reallocation events
  • Enforcing target allocations
    – Ballooning: common-case optimization
    – Swapping: dependable fallback, try sharing first
  • Reclamation states
    – High: background sharing
    – Soft: mostly balloon
    – Hard: mostly swap
    – Low: swap and block VMs above target

Conclusion

  • Software and hardware solutions for memory virtualization both have pros and cons
  • More things to take care of besides the basic mechanism of memory virtualization
    – Allocation, sharing, overcommitment and reclamation

[Virtualization] Virtualizing CPU (x86)

x86 Difficulties

  • Not all sensitive instructions are privileged with x86, i.e., non-virtualizable processor
  • These instructions do not trap and behave differently in kernel and user mode

Possible Solutions

  • Emulate: interpret each instruction, super slow (e.g., Virtual PC on Mac)
  • Binary translation: rewrite non-virtualizable instructions (e.g., VMware)
  • Para-virtualization: modify guest OS to avoid non-virtualizable instructions (e.g., Xen)
  • Change hardware: add new CPU mode, extend page table, and other hardware assistance (e.g., Intel VT-x, EPT, VT-d, AMD-V)

Full Emulation / Hosted Interpretation

  • VMM implements the complete hardware architecture in software
  • VMM steps through VM’s instructions and update emulated hardware as needed
  • Can handle all types of instructions, but super slow

Trap-and-Emulate

Basic Idea of Binary Translation

  • Based on input guest binary, compile (translate) instructions in a cache and run them directly
  • Challenges:
    – Protection of the cache
    – Correctness of direct memory addressing
    – Handling relative memory addressing (e.g., jumps)
    – Handling sensitive instructions

VMware’s Dynamic Binary Translation

  • Binary: input is binary x86 code
  • Dynamic: translation happens at runtime
  • On demand: code is translated only when it is about to execute
  • System level: rules set by x86 ISA, not higher-level ABIs
  • Subsetting: output a safe subset of input full x86 instruction set
  • Adaptive: translated code is adjusted according to guest behavior changes

Translation Unit

  • TU: 12 instructions or a “terminating” instruction (a basic code block)
  • Why TU as the unit not individual instruction? (overhead)
  • TU -> Compiled Code Fragment (CCF)
  • CCF stored in translation cache (TC)
  • At the end of each CCF, call into translator (implemented by the VMM) to decide and translate the next TU (more optimization soon)
    – If the destination code is already in TC, then directly jumps to it
    – Otherwise, compiles the next CCF into TC

Architecture of VMware’s Binary Translation

IDENT/Non-IDENT Translation

  • Most instructions can be translated IDENT (do nothing to the instructions), except for
  • PC-relative address
  • Direct control flow
  • Indirect control flow
  • Sensitive instructions
    – If already traps, then can be handled when it traps (more optimization soon to be discussed)
    – Otherwise, replace it with a call to the emulation function

Adaptive Binary Translation

  • Binary translation can outperform classical virtualization by avoiding traps
    – rdtsc on Pentium 4: trap-and-emulate 2030 cycles, callout-and-emulate 1254 cycles, in-TC emulation 216 cycles
  • What about sensitive instructions that are not priviledged?
    – “Innocent until proven guilty”
    – Start in the innocent state and detect instructions that trap frequently
    . Retranslate non-IDENT to avoid the trap
    . Patch the original IDENT translation with a forwarding jump to the new translation

Hardware-Assisted CPU Virtualization (Intel VT-x)

  • Two new modes of execution (orthogonal to protection rings)
    – VMX root mode: same as x86 without VT-x
    – VMX non-root mode: runs VM, sensitive instructions cause transition to root mode, even in Ring 0
  • New hardware structure: VMCS (virtual machine control structure)
    – One VMCS for one virtual processor
    – Configured by VMM to determine which sensitive instructions cause VM exit
    – Specifies guest OS state

Example: Guest syscall with Hardware Virtualization

  • VMM fills VMCS exception table for guest OS and sets bit in VMCS to not exit on syscall exception, VMM executes VM entry
  • Guest application invokes a syscall, does not trap, but go to the VMCS exception table

Conclusion

  • Virtualizing CPU is a non-trivial task, esp. for non-virtualizable architectures like x86
  • Software binary translation is a neat (but very tricky) way to virtualize x86 and still meet Popek and Goldberg’s virtualization principles
  • Hardware vendors keep adding more virtualization support, which makes life a lot easier
  • Software and hardware techniques both have pros and cons

[Virtualization] Background and Virtualization Basics

Typical OS Structure

OS Functionalities vs Virtualized Functionalities

OS FunctionalitiesVirtualized Functionalities
– Process management
– Virtual memory system
– File and storage system

– Networking
– Other I/O systems
– Command-interpreter system
– Virtualize sensitive instructions
– Virtualized physical memory
– Virtual disk

– Virtual network interface
– Other virtualized I/O systems
– Command-interpreter system

Dual-Mode Operation

  • OS manages shared resources
  • OS protects programs from other programs (OS needs to be “privileged”)
  • Dual-mode operation of hardware
    – Kernel mode – can run privileged instructions
    – User mode – can only run non-privileged instructions

Different OS Structures

Transition between User/Kernel Modes

Interrupt

  • A mechanism for coordination between concurrently operating units of a computer system (e.g. CPU and I/O devices) to respond to specific conditions within a computer
  • Results in transfer of control (to interrupt handler in the OS), forced by hardware
  • Hardware interrupts
    – I/O devices: NIC, keyboard, etc.
    – Timer
  • Software interrupts
    – Exception: a software error (e.g., divided by zero)
    – System call

Handling Interrupts

  • Incoming interrupts are disabled (at this and lower priority levels) while the interrupt is being processed to prevent a lost interrupt
  • Interrupt architecture must save the address of the interrupted instruction
  • Interrupt transfers control to the interrupt service routine
    – generally, through the interrupt vector, which contains the addressed of all the service routines
  • If interrupt routine modifies process state (register values)
    – save the current state of the CPU (registers and the program counter) on the system stack
    – restore state before returning
  • Interrupts are re-enabled after servicing current interrupt
  • Resume the interrupted instruction

Interaction between Different Layers

Design Space (Level vs. ISA)

Type 1 and Type 2 Hypervisor (VMM)

Virtualization Principles

Popek and Goldberg’s virtualization principles in 1974:

  • Fidelity: Software on the VMM executes identically to its execution on hardware, barring timing effects
  • Performance: An overwhelming majority of guest instructions are executed by the hardware without the intervention of the VMM
  • Safety: The VMM manages all hardware resources

Possible implementation: Full Emulation / Hosted Interpretation

  • VMM implements the complete hardware architecture in software
  • VMM steps through VM’s instructions and update emulated hardware as needed
  • Pros:
    – Easy to handle all types of instructions (can enforce policy when doing so)
    – Provides complete isolation (no guest instructions runs directly on hardware)
    – Can debug low-level code in the guest
  • Cons:
    – Emulating a modern processor is difficult
    – Violated performance requirement (VERY SLOW)

Protection Rings

  • More privileged rings can access memory of less privileged ones
  • Calling across rings can only happen with hardware enforcement
  • Only Ring 0 can execute privileged instructions (CENTER)
  • Rings 1, 2, and 3 trap when executing privileged instructions
  • Usually, the OS executes in Ring 0 and applications execute in Ring 3

Improving Performance over Full Emulation

  • Idea: execute most guest instructions natively on hardware (assuming guest OS runs on the same architecture as real hardware)
  • Applications run in ring 3
  • Cannot allow guest OS to run sensitive instructions directly!
  • Guest OS runs in ring 1
  • When guest OS executes a privileged instruction, will trap into VMM (IN RING 0)
  • When guest applications generates a software interrupt, will trap into VMM (IN RING 0)
  • Goldberg (1974) two classes of instructions
    – privileged instructions: those that trap when in user mode
    – sensitive instructions: those that modify or depends on hardware configurations

Trap-and-Emulate

  • Hand off sensitive operations to the hypervisor
  • VMM emulates the effect of these operations on virtual hardware provided to the guest OS
    – VMM controls how the VM interacts with physical hardware
    – VMM fools the guest OS into thinking that it runs at the highest privilege level
  • Performance implications
    – Almost no overhead for non-sensitive instructions
    – Large overhead for sensitive instructions

System Calls with Virtualization

x86 Difficulties

  • Popek and Goldberg’s Theorem (1974): A machine can be virtualized (using trap-and-emulate) if every sensitive instruction is privileged
  • Not all sensitive instructions are privileged with x86
  • These instructions do not trap and behave differently in kernel and user mode
  • Example: popf

Possible Solutions

  • Emulate: interpret each instruction, super slow (e.g., Virtual PC on Mac)
  • Binary translation: rewrite non-virtualizable instructions (e.g., VMware)
  • Para-virtualization: modify guest OS to avoid non-virtualizable instructions (e.g., Xen)
  • Change hardware: add new CPU mode, extend page table, and other hardware assistance (e.g., Intel VT-x, EPT, VT-d, AMD-V)

Regular Virtual Memory System

TODO

Software-controlled TLB

TODO

Hardware-controlled TLB

TODO

Virtualizing Memory

TODO

I/O Virtualization

TODO

[Virtualization] Introduction

Why study virtualization?

  • Almost all cloud applications run in the virtualization environment
  • Most IT infrastructures run in the cloud or on-prem virtualization environment
  • Understanding virtualization is key to building cloud infrastructures
  • Understanding virtualization will help application design

Operating Systems

  • A piece of software that manages and virtualizes hardware for applications
    – An indirection layer between applications and hardware
    – Provides a high-level interface to applications
    – While interact with hardware devices with low-level interfaces
    – Runs privileged instructions to interact with hardware devices
  • Applications
    – Can only execute unprivileged instructions
    – Perform system calls or faults to “trap” into OS
    – OS protect applications from each other (to some extent) (e.g., address space)

Virtualization

  • Adding another level of indirection to run OSes on an abstraction of hardware
  • Virtual Machine (Guest OS)
    – OS that runs on virtualized hardware resources
    – Manages by another software (VMM/Hypervisor)
  • Virtual Machine Monitor (Hypervisor)
    – The software that creates and manages the execution of virtual machines
    – Runs on bare-metal hardware

History

Mainframes and IBM

  • Before we have datacenters or PCs, there were giant metal frames
  • Support computational and I/O intensive commercial/scientific workloads
  • Expensive (IBM 704 (1954) costs $250K to millions)
  • “IBM and the seven dwarfs” – their heyday was the late ’50s through ’70s

Issues with Early Mainframes

  • Different generations were not architecturally compatible
  • Batch-oriented (against interactive)
  • Meanwhile, ideas started to appear towards a time-sharing OS
  • The computer was becoming a multiplexed tool for a community of users, instead of being a batch tool for wizard programmers

IBM’s Response

  • IBM bet the company on the System/360 hardware family [1964]
    – S/360 was the first to clearly distinguish architecture and implementation
    – Its architecture was virtualizable
  • The CP/CMS system software [1968]
    – CP: a “control program” that created and managed virtual S/360 machines
    – CMS: the “Cambridge monitor system” — a lightweight, single-user OS
    – With CP/CMS, can run several different OSs concurrently on the same HW
  • IBM CP/CMS is the first virtualization systems. Main purpose: multiple users can share a mainframe

IBM’s Mainframe Product Line

  • System/360 (1964-1970)
    – Support virtualization via CP/CMS, channel I/O, virtual memory, …
  • System/370 (1970-1988)
    – Reimplementation of CP/CMS as VM/370
  • System/390 (1900-2000)
  • zSeries (2000-present)
  • Huge moneymaker for IBM, and may business still depend on these!

PCs and Multi-User OSes

  • 1976: Steve Jobs and Steve Wozniak start Apple Computers and roll out the Apple I, the first computer with a single-circuit board
  • 1981: The first IBM personal computer, code-named “Acorn,” is introduced. It uses Microsoft’s MS-DOS
  • 1983: Apple’s Lisa is the first personal computer with a GUI
  • 1985: Microsoft announces Windows
  • The PC market (1980-90s): ship hundreds of millions of units, not hundreds of units
  • Cluster computing (1990s): build a cheap mainframe out of a cluster of PCs

Multiprocessor and Stanford FLASH

  • Development of multiprocessor hardware boomed (1990s)
  • Stanford FLASH Multiprocessor
    – A multiprocessor that integrates global cache coherence and message passing
  • But system software lagged behind
  • Commodity OSes do not scale and cannot isolate/contain faults

Stanford Disco and VMWare

  • Stanford Disco project (SOSP’97 Mendel Rosenblum et al.)
    – Extend modern OS to run efficiently on shared memory multiprocessors
    – A VMM built to run multiple copies of Silicon Graphics IRIX OS on FLASH
  • Mendel Rosenblum, Diane Greene, and others co-founded VMWare in 1998
    – Brought virtualization to PCs. Main purpose: run different OSes on different architectures
    – Initial market was software developers for testing software in multiple OSes
    – Acquired by EMC (2003), which later merged with DELL (2016)

Server Consolidation

  • Datacenters often run many services (e.g., search, mail server, database)
    – Easier to manage by running one service per machine
    – Leads to low resource utilization
  • Virtualization can “consolidate” servers by hosting many VMs per machine, each running one service
    – Higher resource utilization while still delivering manageability

The Cloud Era

  • The cloud revolution is what really took virtualization on
  • Instead of renting physical machines, rent VMs
    – Better consolidation and resource utilization
    – Better portability and manageability
    – Easy to deploy and maintain software
    – However, raise certain security and QoS concerns
  • Many instance types, some with specialized hardware; all well maintained and patched
    – AWS: 241 instance types in 30 families (as of Dec 2019)

The Virtuous Cycle for Cloud Providers

  • More customers utilize more resources
  • Greater utilization of resources requires more infrastructures
  • Buying more infrastructure in volume leads to lower unit costs
  • Lower unit costs allow for lower customer prices
  • Lower prices attract more customers

Container

  • VMs run a complete OS on emulated hardware
    – Too heavy-weighted and unnecessary for many cloud usages
    – Need to maintain OS versions, libraries, and make sure applications are compatible
  • Containers (e.g., Docker, LXC)
    – Run multiple isolated user-space applications on the host OS
    – Much more lightweight: better runtime performance, less memory, faster startup
    – Easier to deploy and maintain applications
    – But doesn’t provide as strong security boundaries as VMs

Managing Containers

  • Need a way to manage a cluster of containers
    – Handle failure, scheduling, monitoring, authentication, etc.
  • Kubernetes: the most popular container orchestration today
  • Cloud providers also offer various container orchestration service
    – e.g., AWS ECS, EKS

Serverless Computing

  • VMs and containers in cloud still need to be “managed”
  • Is there a way to just write software and let the cloud do all the rest?
  • Serverless computing (mainly in the form of Function as a Service)
    – Autoscaled and billed by request load
    – No need to manage “server cluster” or handle failure
    – A lot less control and customization (e.g., fixed CPU/memory/memory ratio, no direct communication across functions, no easy way to maintain states)

Summary of Virtualization History

  • Invented by IBM in 1960s for sharing expensive mainframes
  • Popular research ideas in 1960s and 1970s
  • Interest died as the adoption of cheap PCs and multi-user OSes surged in 1980s
  • A (somewhat accidental) research idea got transferred to VMWare
  • Real adoption happend with the growth of cloud computing
  • New forms of virtualization: container and serverless, in the modern cloud era