Kubernetes
- We need more than just packing and isolation: Scheduling, Lifecycle and health, Discovery, Monitoring, Auth{n,z}, Aggregates, Scaling, …
- Everything at Google runs in containers: Gmail, Web Search, Maps, …
- Open Source Containers: Kubernetes
- Container orchestration
- Builds on Docker containers
- Multiple cloud and bare-metal environments
- Supports existing OSS apps
- Inspried and informed by Google’s experiences and internal systems
- 100% open source, written in Go
- Lets users manage applications, not machines

Primary concepts
- Container: A sealed application package (Docker)
- Pod: A small group of tightly coupled Containers
- Labels: Identifying metadata attached to objects
- Selector: A query against labels, producing a set result
- Controller: A reconciliation loop that drives current state towards desired state
- Service: A set of pods that work together
Pod

- a Kubernetes abstraction that represents a group of one or more application containers, and some shared resources for those containers
- Shared storage, as Volumes
- Networking, as a unique cluster IP address
- Information about how to run each container, such as the container image version or specific ports to use

Node

- A node is a worker machine (either VM or physical machine)
- One pod runs on one node, one node can run multiple pods
- Nodes managed by control plane
Persistent Volumes
- A higher-level abstraction – insulation from any one cloud environment
- Admin provisions them, user claim them
- Independent lifetime and fate
- Can be handed-off between pods and lives until user is done with it
- Dynamically “scheduled” and managed, like nodes and pods
Labels
- Arbitrary metadata
- Attached to any API object
- Generally represent identity
- Queryable by selectors
- The only grouping mechanism
- Use to determine which objects to apply an operation to
Pod lifecycle
- Once scheduled to a node, pods do not move
- Pods can be observed pending, running, succeeded, or failed
- Pods are not rescheduled by the scheduler or apiserver
- Apps should consider these rules
Internals
kube-apiserver
- Provides a forward facing REST interface into the Kubernetes control plane and datastore
- All clients and other applications interact with Kubernetes strictly through the API server
- Acts as the gatekeeper to the cluster by handling authentication and authorization, request validation, mutation, and admission control in addition to being the front-end to the backing datastore
kube-controller-manager
- Monitors the cluster state via the apiserver and steers the cluster towards the desired state
kube-scheduler
- Component on the master that watches newly created pods that have no node assigned, and selects a node for them to run on
- Factors taken into account for scheduling decisions include individual and collective resource requirements, hardware/software/policy constraints, affinity and anti-affinity specifications, data locality, inter-workload interference and deadlines
cloud-controller-manager
- Node Controller: For checking the cloud provider to determining if a node has been deleted in the cloud after it stops responding
- Route Controller: For setting up routes in the underlying cloud infrastructure
- Service Controller: For creating, updating, and deleting cloud provider load balancers
- Volume Controller: For creating, attaching, and mounting volumes, and interacting with the cloud provider to orchestrate volumes
etcd
- etcd: an atomic key-value store that uses Raft consensus
- Backing store for all control plane metadata
- Provides a strong, consistent and highly available key-value store for persisting cluster state
- Stores objets and config information
Node Components
kubelet
- An agent that runs on each node in the cluster. It makes sure that containers are running in a pod
- The kubelet takes a set of PodSpecs that are provided through various mechanisms and ensures that the containers described in those PodSpecs are running and healthy
kube-proxy
- Manages the network rules on each node
- Performs connection forwarding or load balancing for Kubernetes cluster services
…
gVisor
“Containers do not contain” – Dan Walsh
- Still sharing the same kernel
- Share same device drivers
- Linux kernel represents a large attack surface
- cgroup accounting may not be accurate
Are System Calls Secure?
- The interface between containers and OS is system calls
- Linux x86_64 has 319 64-bit syscalls
- 2046 CVEs since 1999
Why can VMs be More Secure?
- Virtual machines
- Independent guest kernels
- Virtual hardware interface: clear privilege separation and state encapsulation
- But virtualized hardware interface is inflexible and VM is heavy weight with large memory footprint
Sandboxing
- Rule-based sandboxing: reduce the attack surface by restricting what applications can access
- e.g., AppArmor, SELinux, Secomp-bpf
- Rules can be fragile (not properly capture threats) and can’t prevent side channel attacks
gVisor
- Sandboxes untrusted applications
- Implements Linux system API in user space
- Secure by default
- Written in Go, a memory/type-safe language
gVisor Architecture
- Two separate processes (communicated through IPC
- Sentry: emulated Linux system calls in user space
- Gofer: file access
- Most exploited syscalls: socket and open
- Even if sentry is compromised, still can’t access files or open ports
- Network is handled by user-mode network stack in Sentry
Trapping System Calls
- Two modes supported
- ptrace
- KVM
…
Reference
Kubernetes 한글: https://kubernetes.io/ko/docs/concepts/overview/what-is-kubernetes/