We need more than just packing and isolation: Scheduling, Lifecycle and health, Discovery, Monitoring, Auth{n,z}, Aggregates, Scaling, …
Everything at Google runs in containers: Gmail, Web Search, Maps, …
Open Source Containers: Kubernetes
Container orchestration
Builds on Docker containers
Multiple cloud and bare-metal environments
Supports existing OSS apps
Inspried and informed by Google’s experiences and internal systems
100% open source, written in Go
Lets users manage applications, not machines
Primary concepts
Container: A sealed application package (Docker)
Pod: A small group of tightly coupled Containers
Labels: Identifying metadata attached to objects
Selector: A query against labels, producing a set result
Controller: A reconciliation loop that drives current state towards desired state
Service: A set of pods that work together
Pod
a Kubernetes abstraction that represents a group of one or more application containers, and some shared resources for those containers
Shared storage, as Volumes
Networking, as a unique cluster IP address
Information about how to run each container, such as the container image version or specific ports to use
Node
A node is a worker machine (either VM or physical machine)
One pod runs on one node, one node can run multiple pods
Nodes managed by control plane
Persistent Volumes
A higher-level abstraction – insulation from any one cloud environment
Admin provisions them, user claim them
Independent lifetime and fate
Can be handed-off between pods and lives until user is done with it
Dynamically “scheduled” and managed, like nodes and pods
Labels
Arbitrary metadata
Attached to any API object
Generally represent identity
Queryable by selectors
The only grouping mechanism
Use to determine which objects to apply an operation to
Pod lifecycle
Once scheduled to a node, pods do not move
Pods can be observed pending, running, succeeded, or failed
Pods are not rescheduled by the scheduler or apiserver
Apps should consider these rules
Internals
kube-apiserver
Provides a forward facing REST interface into the Kubernetes control plane and datastore
All clients and other applications interact with Kubernetes strictly through the API server
Acts as the gatekeeper to the cluster by handling authentication and authorization, request validation, mutation, and admission control in addition to being the front-end to the backing datastore
kube-controller-manager
Monitors the cluster state via the apiserver and steers the cluster towards the desired state
kube-scheduler
Component on the master that watches newly created pods that have no node assigned, and selects a node for them to run on
Factors taken into account for scheduling decisions include individual and collective resource requirements, hardware/software/policy constraints, affinity and anti-affinity specifications, data locality, inter-workload interference and deadlines
cloud-controller-manager
Node Controller: For checking the cloud provider to determining if a node has been deleted in the cloud after it stops responding
Route Controller: For setting up routes in the underlying cloud infrastructure
Service Controller: For creating, updating, and deleting cloud provider load balancers
Volume Controller: For creating, attaching, and mounting volumes, and interacting with the cloud provider to orchestrate volumes
etcd
etcd: an atomic key-value store that uses Raft consensus
Backing store for all control plane metadata
Provides a strong, consistent and highly available key-value store for persisting cluster state
Stores objets and config information
Node Components
kubelet
An agent that runs on each node in the cluster. It makes sure that containers are running in a pod
The kubelet takes a set of PodSpecs that are provided through various mechanisms and ensures that the containers described in those PodSpecs are running and healthy
kube-proxy
Manages the network rules on each node
Performs connection forwarding or load balancing for Kubernetes cluster services
…
gVisor
“Containers do not contain” – Dan Walsh
Still sharing the same kernel
Share same device drivers
Linux kernel represents a large attack surface
cgroup accounting may not be accurate
Are System Calls Secure?
The interface between containers and OS is system calls
Linux x86_64 has 319 64-bit syscalls
2046 CVEs since 1999
Why can VMs be More Secure?
Virtual machines
Independent guest kernels
Virtual hardware interface: clear privilege separation and state encapsulation
But virtualized hardware interface is inflexible and VM is heavy weight with large memory footprint
Sandboxing
Rule-based sandboxing: reduce the attack surface by restricting what applications can access
e.g., AppArmor, SELinux, Secomp-bpf
Rules can be fragile (not properly capture threats) and can’t prevent side channel attacks
gVisor
Sandboxes untrusted applications
Implements Linux system API in user space
Secure by default
Written in Go, a memory/type-safe language
gVisor Architecture
Two separate processes (communicated through IPC
Sentry: emulated Linux system calls in user space
Gofer: file access
Most exploited syscalls: socket and open
Even if sentry is compromised, still can’t access files or open ports
Network is handled by user-mode network stack in Sentry
Performance overhead of indirections (guest OS and hypervisor)
Large memory footprint
Slow startup time
License and maintenance cost of guest OS
Do we really need to virtualize hardware and a full OS?
What about DevOps?
Why does it work? Separation of concerns
Why people care?
Developers: Build once… run anywhere (dependencies, packages, versions, automation, w/o overhead like VMs) Administrators: Configure once… run anything (lifecycle efficiency, remove inconsistencies, segregation of duties, addresses the cost of VMs)
Linux Containers
Run everywhere
Regardless of kernel version
Regardless of host distro
Physical or virtual, cloud or not
Container and host architecture must match…
Run anything
If it can run on the host, it can run in the container
If it can on a Linux kernel, it can run
At High-Level: It looks like a VM At Low-Level: OS-Level Virtualization
Using Namespaces to separate “Views” of Users
Namespace: naming domain for various resources
User IDs (UIDs)
Process IDs (PIDs)
File paths (mnt)
Network sockets
Pipe names
Namespaces are isolated by kernel
Isolating resources with cgroups
Linux Control Groups (cgroups): collection of Linux processes
Limits resource usages at group level (e.g., memory, CPU, device)
Fair sharing of resources
Track resource utilization (e.g., could be used for billing/management)
Control processes (e.g., pause/resume, checkpoint/restore)
Efficiency: almost no overhead
Processes are isolated, but run straight on the host
CPU performance = native performance
Memory performance = a few % shaved off for (optional) accounting
Network performance = small overhead; can be optimized to zero overhead
Docker
Docker Inc
Founded as dotCloud, Inc. in 2010 by Solomon Hykes (renamed to Docker Inc. in 2013)
Estimated to be valued at over $1 billion (101-250 employees)
Docker the software
A container engine written in Go (based on linux container)
Docker community
Now 1851 contributors, 16.2k forks of docker engine on GitHub (called Moby)
Why are Docker Containers Lightweight?
Docker Engine
daemon: Rest API (receiving instructions) and other features
Without virtualization: Extract VPN from VA, VPN-PFN mapping…
With virtualization: Extract VPN from VA, VPN-PFN mapping (handled by VMM to be VPN-MFN mapping)…
Difficulty in Virtualizing Hardware-Managed TLB
Hardware-managed TLB – Hardware does page table walk on each TLB miss and fills TLB with the found PTE
Hypervisor doesn’t have chance to intercept on TLB misses
Solution-1: shadow paging
Solution-2: direct paging (para-virtualization)
Solution-3: new hardware
Shadow Paging
VMM intercepts guest OS getting the virtual CR3
VMM iterates over the guest page table, constructs a corresponding shadow page table
In shadow PT, every guest physical address is translated into host physical address (machine address)
Finally, VMM sets the real CR3 to point to the shadow page table
Link The guest can’t be allowed access to the hardware PT because then it would essentially have control of the machine. So the hypervisor keeps the “real” mappings (guest virtual VPN -> host physical MFN) in the hardware when the relevant guest is executing, and keeps a representation of the page tables that the guest thinks it’s using “in the shadows.” This avoids the VPN -> PFN translation step. As far as page faults go, nothing changes from the hardware’s point of view (remember, the hypervisor makes it so the page tables used by the hardware contain VPN->MFN mappings), a page fault will simply generate an exception and redirect to the appropriate exception handler. However, when a page fault occurs while a VM is running, this exception can be “forwarded” to the hypervisor, which can then handle it appropriately.
Question
Assume that: – There are 10 VMs running on a machine – Each VM contains 10 applications
How many shadow page tables in total? -> 110? – Shadow page tables are per application – Guest page tables are per application – pmaps are per VM
What if Guest OS Modifies Its Page Table?
Should not allow it to happen directly – Since CR3 is not pointing to the shadow page table – Need to synchronize the shadow page table with guest page table
VMM needs to intercept when guest OS modifies page table, and updates the shadow page table accordingly 1. Mark the guest table pages as read-only (in the shadow page table) 2. If guest OS tries to modify its page tables, it triggers page fault 3. VMM handles the page fault by updating shadow page table
Dealing with Page Faults
When page fault occurs, traps to VMM
If present bit is 0 in the guest page table entry, guest OS needs to handle the fault – Guest OS load page from virtual disk to guest physical memory and sets present bit to 1 – Guest OS returns from page fault, which traps into VMM again – VMM sees that present is 1 in guest PTE and creates entry in shadow page table – VMM returns from the original page fault
If present is 1: guest OS thinks page is present (but VMM may have swapped it out), VMM handles transparently – VMM locates the corresponding physical page, loads it in memory if needed – VMM creates entry in shadow page table – VMM returns from the original page fault
What if a Guest App Access its Kernel Memory?
How do we selectively allow/deny access to kernel-only pages?
One solution: split a shadow page table into two tables – Two shadow page tables, one for user, one for kernel – When guest OS switches to guest applications, VMM will switch the shadow page table as well, vice versa
What about Memory for Translation Cache (BT)?
Translation cache intermingles guest and monitor memory accesses – Need to distinguish these accesses – Monitor accesses have full privileges – Guest accesses have lesser privileges
On x86 can use segmentation – Monitor lives in high memory – Guest segments truncated to allow no access to monitor – Binary translator uses guest segments for guest accesses and monitor segments for monitor accesses
Pros and Cons of Shadow Paging
Pros – When shadow PT is established, memory accesses are very fast
Cons – Maintaining consistency between guest PTs and shadow PTs involve VMM traps, can be costly – TLB flush on every “world switch” – Memory space overhead to maintain pmap
Hardware-Assisted Memory Virtualization
Hardware support for memory virtualization – Intel EPT (Extended Page Table) and AMD NPT (Nested Page Table) – EPT: a per VM table translating PPN -> MPN, referenced by EPT base pointer – EPT controlled by the hypervisor, guest page table (GPT) controlled by guest OS (both exposed to hardware) – Hardware directly walks GPT + EPT (for each PPN access during GPT walk, needs to walk the EPT to determine MPN) – No VM exits due to page faults, INVLPG, or CR3 accesses
Pros and Cons of EPT
Pros – Simplified VMM design (all handled by hardware) – Guest PT changes do not trap, minimize VM exits – Lower memory space overhead (no need for pmap in memory)
Cons – TLB miss is costly: can involve many memory accesses to finish the walk!
Reclaiming Memory
ESX (and other hypervisors) allow overcommitment of memory – Total memory size of all VMs can exceed actual machine memory size – ESX must have some way to reclaim memory from VMs (and swap to disk)
Traditional: add transparent swap layer – Requires “meta-level” decisions: which page from which VM to swap – Best data to guide decision known only by guest OS – Guest and meta-level policies may clash, resulting in double paging
Alternative: implicit cooperation – Coax guest OS into doing its own page replacement – Avoid meta-level policy decisions
Ballooning Details
Guest drivers – Inflate: allocate pinned PPNs; backing MPNs reclaimed – Use standard Windows/Linux/BSD kernel APIs
Performance benchmark – Linux VM, memory-intensive dbench workload – Compares 256MB with balloon sizes 32-128MB vs static VMs – Overhead 1.4%-4.4%
Memory Sharing
Motivation – Multiple VMs running same OS, apps – Collapse redundant copies of code, data, zeros
Transparent page sharing – Map multiple PPNs to single MPN (copy-on-write) – Pioneered by Disco, but required guest OS hooks
New twist: content-based sharing – General-purpose, no guest OS changes – Background activity saves memory over time
Memory Allocation
Min size – Guaranteed, even when overcommitted – Enforced by admission control
Max size – Amount of “physical” memory seen by guest OS – Allocated when undercommitted
Para-virtualization: modify guest OS to avoid non-virtualizable instructions (e.g., Xen)
Change hardware: add new CPU mode, extend page table, and other hardware assistance (e.g., Intel VT-x, EPT, VT-d, AMD-V)
Full Emulation / Hosted Interpretation
VMM implements the complete hardware architecture in software
VMM steps through VM’s instructions and update emulated hardware as needed
Can handle all types of instructions, but super slow
Trap-and-Emulate
Basic Idea of Binary Translation
Based on input guest binary, compile (translate) instructions in a cache and run them directly
Challenges: – Protection of the cache – Correctness of direct memory addressing – Handling relative memory addressing (e.g., jumps) – Handling sensitive instructions
VMware’s Dynamic Binary Translation
Binary: input is binary x86 code
Dynamic: translation happens at runtime
On demand: code is translated only when it is about to execute
System level: rules set by x86 ISA, not higher-level ABIs
Subsetting: output a safe subset of input full x86 instruction set
Adaptive: translated code is adjusted according to guest behavior changes
Translation Unit
TU: 12 instructions or a “terminating” instruction (a basic code block)
Why TU as the unit not individual instruction? (overhead)
TU -> Compiled Code Fragment (CCF)
CCF stored in translation cache (TC)
At the end of each CCF, call into translator (implemented by the VMM) to decide and translate the next TU (more optimization soon) – If the destination code is already in TC, then directly jumps to it – Otherwise, compiles the next CCF into TC
Architecture of VMware’s Binary Translation
IDENT/Non-IDENT Translation
Most instructions can be translated IDENT (do nothing to the instructions), except for
PC-relative address
Direct control flow
Indirect control flow
Sensitive instructions – If already traps, then can be handled when it traps (more optimization soon to be discussed) – Otherwise, replace it with a call to the emulation function
Adaptive Binary Translation
Binary translation can outperform classical virtualization by avoiding traps – rdtsc on Pentium 4: trap-and-emulate 2030 cycles, callout-and-emulate 1254 cycles, in-TC emulation 216 cycles
What about sensitive instructions that are not priviledged? – “Innocent until proven guilty” – Start in the innocent state and detect instructions that trap frequently . Retranslate non-IDENT to avoid the trap . Patch the original IDENT translation with a forwarding jump to the new translation
Hardware-Assisted CPU Virtualization (Intel VT-x)
Two new modes of execution (orthogonal to protection rings) – VMX root mode: same as x86 without VT-x – VMX non-root mode: runs VM, sensitive instructions cause transition to root mode, even in Ring 0
New hardware structure: VMCS (virtual machine control structure) – One VMCS for one virtual processor – Configured by VMM to determine which sensitive instructions cause VM exit – Specifies guest OS state
Example: Guest syscall with Hardware Virtualization
VMM fills VMCS exception table for guest OS and sets bit in VMCS to not exit on syscall exception, VMM executes VM entry
Guest application invokes a syscall, does not trap, but go to the VMCS exception table
Conclusion
Virtualizing CPU is a non-trivial task, esp. for non-virtualizable architectures like x86
Software binary translation is a neat (but very tricky) way to virtualize x86 and still meet Popek and Goldberg’s virtualization principles
Hardware vendors keep adding more virtualization support, which makes life a lot easier
Software and hardware techniques both have pros and cons
– Virtual network interface – Other virtualized I/O systems – Command-interpreter system
Dual-Mode Operation
OS manages shared resources
OS protects programs from other programs (OS needs to be “privileged”)
Dual-mode operation of hardware – Kernel mode – can run privileged instructions – User mode – can only run non-privileged instructions
Different OS Structures
Transition between User/Kernel Modes
Interrupt
A mechanism for coordination between concurrently operating units of a computer system (e.g. CPU and I/O devices) to respond to specific conditions within a computer
Results in transfer of control (to interrupt handler in the OS), forced by hardware
Hardware interrupts – I/O devices: NIC, keyboard, etc. – Timer
Software interrupts – Exception: a software error (e.g., divided by zero) – System call
Handling Interrupts
Incoming interrupts are disabled (at this and lower priority levels) while the interrupt is being processed to prevent a lost interrupt
Interrupt architecture must save the address of the interrupted instruction
Interrupt transfers control to the interrupt service routine – generally, through the interrupt vector, which contains the addressed of all the service routines
If interrupt routine modifies process state (register values) – save the current state of the CPU (registers and the program counter) on the system stack – restore state before returning
Interrupts are re-enabled after servicing current interrupt
Resume the interrupted instruction
Interaction between Different Layers
Design Space (Level vs. ISA)
Type 1 and Type 2 Hypervisor (VMM)
Virtualization Principles
Popek and Goldberg’s virtualization principles in 1974:
Fidelity: Software on the VMM executes identically to its execution on hardware, barring timing effects
Performance: An overwhelming majority of guest instructions are executed by the hardware without the intervention of the VMM
Safety: The VMM manages all hardware resources
Possible implementation: Full Emulation / Hosted Interpretation
VMM implements the complete hardware architecture in software
VMM steps through VM’s instructions and update emulated hardware as needed
Pros: – Easy to handle all types of instructions (can enforce policy when doing so) – Provides complete isolation (no guest instructions runs directly on hardware) – Can debug low-level code in the guest
Cons: – Emulating a modern processor is difficult – Violated performance requirement (VERY SLOW)
Protection Rings
More privileged rings can access memory of less privileged ones
Calling across rings can only happen with hardware enforcement
Only Ring 0 can execute privileged instructions (CENTER)
Rings 1, 2, and 3 trap when executing privileged instructions
Usually, the OS executes in Ring 0 and applications execute in Ring 3
Improving Performance over Full Emulation
Idea: execute most guest instructions natively on hardware (assuming guest OS runs on the same architecture as real hardware)
Applications run in ring 3
Cannot allow guest OS to run sensitive instructions directly!
Guest OS runs in ring 1
When guest OS executes a privileged instruction, will trap into VMM (IN RING 0)
When guest applications generates a software interrupt, will trap into VMM (IN RING 0)
Goldberg (1974) two classes of instructions – privileged instructions: those that trap when in user mode – sensitive instructions: those that modify or depends on hardware configurations
Trap-and-Emulate
Hand off sensitive operations to the hypervisor
VMM emulates the effect of these operations on virtual hardware provided to the guest OS – VMM controls how the VM interacts with physical hardware – VMM fools the guest OS into thinking that it runs at the highest privilege level
Performance implications – Almost no overhead for non-sensitive instructions – Large overhead for sensitive instructions
System Calls with Virtualization
x86 Difficulties
Popek and Goldberg’s Theorem (1974): A machine can be virtualized (using trap-and-emulate) if every sensitive instruction is privileged
Not all sensitive instructions are privileged with x86
These instructions do not trap and behave differently in kernel and user mode
Example: popf
Possible Solutions
Emulate: interpret each instruction, super slow (e.g., Virtual PC on Mac)
Almost all cloud applications run in the virtualization environment
Most IT infrastructures run in the cloud or on-prem virtualization environment
Understanding virtualization is key to building cloud infrastructures
Understanding virtualization will help application design
Operating Systems
A piece of software that manages and virtualizes hardware for applications – An indirection layer between applications and hardware – Provides a high-level interface to applications – While interact with hardware devices with low-level interfaces – Runs privileged instructions to interact with hardware devices
Applications – Can only execute unprivileged instructions – Perform system calls or faults to “trap” into OS – OS protect applications from each other (to some extent) (e.g., address space)
Virtualization
Adding another level of indirection to run OSes on an abstraction of hardware
Virtual Machine (Guest OS) – OS that runs on virtualized hardware resources – Manages by another software (VMM/Hypervisor)
Virtual Machine Monitor (Hypervisor) – The software that creates and manages the execution of virtual machines – Runs on bare-metal hardware
History
Mainframes and IBM
Before we have datacenters or PCs, there were giant metal frames
Support computational and I/O intensive commercial/scientific workloads
Expensive (IBM 704 (1954) costs $250K to millions)
Different generations were not architecturally compatible
Batch-oriented (against interactive)
Meanwhile, ideas started to appear towards a time-sharing OS
The computer was becoming a multiplexed tool for a community of users, instead of being a batch tool for wizard programmers
IBM’s Response
IBM bet the company on the System/360 hardware family [1964] – S/360 was the first to clearly distinguish architecture and implementation – Its architecture was virtualizable
The CP/CMS system software [1968] – CP: a “control program” that created and managed virtual S/360 machines – CMS: the “Cambridge monitor system” — a lightweight, single-user OS – With CP/CMS, can run several different OSs concurrently on the same HW
IBM CP/CMS is the first virtualization systems. Main purpose: multiple users can share a mainframe
IBM’s Mainframe Product Line
System/360 (1964-1970) – Support virtualization via CP/CMS, channel I/O, virtual memory, …
System/370 (1970-1988) – Reimplementation of CP/CMS as VM/370
System/390 (1900-2000)
zSeries (2000-present)
Huge moneymaker for IBM, and may business still depend on these!
PCs and Multi-User OSes
1976: Steve Jobs and Steve Wozniak start Apple Computers and roll out the Apple I, the first computer with a single-circuit board
1981: The first IBM personal computer, code-named “Acorn,” is introduced. It uses Microsoft’s MS-DOS
1983: Apple’s Lisa is the first personal computer with a GUI
1985: Microsoft announces Windows
The PC market (1980-90s): ship hundreds of millions of units, not hundreds of units
Cluster computing (1990s): build a cheap mainframe out of a cluster of PCs
Multiprocessor and Stanford FLASH
Development of multiprocessor hardware boomed (1990s)
Stanford FLASH Multiprocessor – A multiprocessor that integrates global cache coherence and message passing
But system software lagged behind
Commodity OSes do not scale and cannot isolate/contain faults
Stanford Disco and VMWare
Stanford Disco project (SOSP’97 Mendel Rosenblum et al.) – Extend modern OS to run efficiently on shared memory multiprocessors – A VMM built to run multiple copies of Silicon Graphics IRIX OS on FLASH
Mendel Rosenblum, Diane Greene, and others co-founded VMWare in 1998 – Brought virtualization to PCs. Main purpose: run different OSes on different architectures – Initial market was software developers for testing software in multiple OSes – Acquired by EMC (2003), which later merged with DELL (2016)
Server Consolidation
Datacenters often run many services (e.g., search, mail server, database) – Easier to manage by running one service per machine – Leads to low resource utilization
Virtualization can “consolidate” servers by hosting many VMs per machine, each running one service – Higher resource utilization while still delivering manageability
The Cloud Era
The cloud revolution is what really took virtualization on
Instead of renting physical machines, rent VMs – Better consolidation and resource utilization – Better portability and manageability – Easy to deploy and maintain software – However, raise certain security and QoS concerns
Many instance types, some with specialized hardware; all well maintained and patched – AWS: 241 instance types in 30 families (as of Dec 2019)
The Virtuous Cycle for Cloud Providers
More customers utilize more resources
Greater utilization of resources requires more infrastructures
Buying more infrastructure in volume leads to lower unit costs
Lower unit costs allow for lower customer prices
Lower prices attract more customers
Container
VMs run a complete OS on emulated hardware – Too heavy-weighted and unnecessary for many cloud usages – Need to maintain OS versions, libraries, and make sure applications are compatible
Containers (e.g., Docker, LXC) – Run multiple isolated user-space applications on the host OS – Much more lightweight: better runtime performance, less memory, faster startup – Easier to deploy and maintain applications – But doesn’t provide as strong security boundaries as VMs
Managing Containers
Need a way to manage a cluster of containers – Handle failure, scheduling, monitoring, authentication, etc.
Kubernetes: the most popular container orchestration today
Cloud providers also offer various container orchestration service – e.g., AWS ECS, EKS
Serverless Computing
VMs and containers in cloud still need to be “managed”
Is there a way to just write software and let the cloud do all the rest?
Serverless computing (mainly in the form of Function as a Service) – Autoscaled and billed by request load – No need to manage “server cluster” or handle failure – A lot less control and customization (e.g., fixed CPU/memory/memory ratio, no direct communication across functions, no easy way to maintain states)
Summary of Virtualization History
Invented by IBM in 1960s for sharing expensive mainframes
Popular research ideas in 1960s and 1970s
Interest died as the adoption of cheap PCs and multi-user OSes surged in 1980s
A (somewhat accidental) research idea got transferred to VMWare
Real adoption happend with the growth of cloud computing
New forms of virtualization: container and serverless, in the modern cloud era