service
apiVersion: v1
kind: Service
metadata:
name: my-service
spec:
selector:
app.kubernetes.io/name: MyApp
ports:
- protocol: TCP
port: 80
targetPort: 9376
The Service targets TCP port 9376 on any Pod with the app.kubernetes.io/name: MyApp label.
Kubernetes assigns this Service an IP address (the cluster IP), that is used by the virtual IP address mechanism.
kube-proxy
Every [node] in a Kubernetes [cluster] runs a [kube-proxy]
The kube-proxy component is responsible for implementing a virtual IP mechanism for [Services]of type other than [ExternalName]. Each instance of kube-proxy watches the Kubernetes [control plane] for the addition and removal of Service and EndpointSlice [objects]. For each Service, kube-proxy calls appropriate APIs (depending on the kube-proxy mode) to configure the node to capture traffic to the Service's clusterIP and port, and redirect that traffic to one of the Service's endpoints (usually a Pod).
iptables proxy mode
-
kube-proxy uses iptable rules to redirect packets to a randomly selected backend pod
-
kube-proxy 使用了iptables 的 filter 表和 nat 表
iptables command in Linux
iptables is a command line interface used to set up and maintain tables for the Netfilter firewall for IPv4, included in the Linux kernel. The firewall matches packets with rules defined in these tables and then takes the specified action on a possible match.
- Tables is the name for a set of chains.
- Chain is a collection of rules.
- Rule is condition used to match packet.
- Target is action taken when a possible rule matches. Examples of the target are ACCEPT, DROP, QUEUE.
- Policy is the default action taken in case of no match with the inbuilt chains and can be ACCEPT or DROP.
ingress
An API object that manages external access to the services in a cluster, typically HTTP.
Ingress may provide load balancing, SSL termination and name-based virtual hosting.
apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
name: minimal-ingress
annotations:
nginx.ingress.kubernetes.io/rewrite-target: /
spec:
ingressClassName: nginx-example
rules:
- http:
paths:
- path: /testpath
pathType: Prefix
backend:
service:
name: test
port:
number: 80
类似于 nginx
服务注册与发现
The controller for that Service (EndpointSlice controller) continuously scans for Pods that match its selector, and then makes any necessary updates to the set of EndpointSlices for the Service.
Discovering services
Environment variables
When a Pod is run on a Node, the kubelet adds a set of environment variables for each active Service. It adds {SVCNAME}_SERVICE_HOST and {SVCNAME}_SERVICE_PORT variables, where the Service name is upper-cased and dashes are converted to underscores.
缺点:启动后不能发现新的服务
DNS
You can (and almost always should) set up a DNS service for your Kubernetes cluster using an [add-on].
A cluster-aware DNS server, such as CoreDNS, watches the Kubernetes API for new Services and creates a set of DNS records for each one. If DNS has been enabled throughout your cluster then all Pods should automatically be able to resolve Services by their DNS name.
kube-dns service
Before Kubernetes version 1.11, the Kubernetes DNS service was based on kube-dns. Version 1.11 introduced CoreDNS to address some security and stability concerns with kube-dns.
- A service named
kube-dnsand one or more pods are created. - The
kube-dnsservice watch for service and endpoint events from the Kubernetes API and updates its DNS records as needed. These events are triggered when you create, update or delete Kubernetes services and their associated pods. - kubelet sets each new pod’s
/etc/resolv.confnameserveroption (Linux 服务器中 DNS 解析配置位于/etc/resolv.conf,本机的dns服务器) to the cluster IP of thekube-dnsservice, with appropriatesearchoptions to allow for shorter hostnames to be used
kube-proxy
kube-dns,kube-proxy 共同工作
- kube-dns负责解析服务名称到ClusterIP地址
- The kube-proxy, then uses iptable rules to redirect packets to a randomly selected backend pod.
健康检查机制
存活检查(livenessProbe)
持续检查,失败则重启
- exec 退出码为0
- httpGet 状态码 大于200 且小于400
- tcpSocket 端口打开即为成功 redis
- grpc alph (不用)
就绪检查(readinessProbe),
应用在启动时可能需要加载大量的数据或配置文件,或是启动后要依赖等待外部服务。
持续检查,失败不会重启,仍然按间隔检查,但不接受通过 Kubernetes Service 的流量。
启动检查 (startupProbe)
一些现有的应用在启动时需要较长的初始化时间
通过将 failureThreshold * periodSeconds 参数设置为足够长的时间来应对最糟糕情况下的启动时间
ports:
- name: liveness-port
containerPort: 8080
livenessProbe:
httpGet:
path: /healthz
port: liveness-port
failureThreshold: 1
periodSeconds: 10
startupProbe:
httpGet:
path: /healthz
port: liveness-port
failureThreshold: 30
periodSeconds: 10
幸亏有启动探测,应用将会有最多 5 分钟(30 * 10 = 300s)的时间来完成其启动过程。 一旦启动探测成功一次,存活探测任务就会接管对容器的探测,对容器死锁作出快速响应。 如果启动探测一直没有成功,容器会在 300 秒后被杀死,并且根据 restartPolicy 来执行进一步处置。
k8s
Control Plane Components
kube-apiserver
exposes the Kubernetes API. The API server is the front end for the Kubernetes control plane.
etcd
Consistent and highly-available key value store used as Kubernetes' backing store for all cluster data.
kube-scheduler
watches for newly created [Pods] and selects a node for them to run on.
kube-controller-manager
Control plane component that runs [controller] processes.
Logically, each [controller] is a separate process, but to reduce complexity, they are all compiled into a single binary and run in a single process.
Some of the controllers:
- Node controller: Responsible for noticing and responding when nodes go down.
- Job controller: Watches for Job objects that represent one-off tasks, then creates Pods to run those tasks to completion.
- EndpointSlice controller: Populates EndpointSlice objects (to provide a link between Services and Pods).
- ServiceAccount controller: Create default ServiceAccounts for new namespaces.
cloud-controller-manager
The cloud controller manager lets you link your cluster into your cloud provider's API, and separates out the components that interact with that cloud platform from components that only interact with your cluster.
The cloud-controller-manager only runs controllers that are specific to your cloud provider.
Node Components
A node may be a virtual or physical machine, depending on the cluster. Each node is managed by the [control plane] and contains the services necessary to run [Pods].
kubelet
An agent that runs on each [node] in the cluster. It makes sure that [containers]are running in a [Pod].
The [kubelet]takes a set of PodSpecs that are provided through various mechanisms and ensures that the containers described in those PodSpecs are running and healthy. The kubelet doesn't manage containers which were not created by Kubernetes.
API server to kubelet
The connections from the API server to the kubelet are used for:
- Fetching logs for pods.
- Attaching (usually through
kubectl) to running pods. - Providing the kubelet's port-forwarding functionality.
kube-proxy
kube-proxy is a network proxy that runs on each [node] in your cluster, implementing part of the Kubernetes [Service] concept.
Container runtime
pod
Pods are the smallest deployable units of computing that you can create and manage in Kubernetes.
A Pod (as in a pod of whales or pea pod) is a group of one or more [containers], with shared storage and network resources, and a specification for how to run the containers. A Pod's contents are always co-located and co-scheduled, and run in a shared context.
The shared context of a Pod is a set of Linux namespaces, cgroups, and potentially other facets of isolation - the same things that isolate a container. Within a Pod's context, the individual applications may have further sub-isolations applied.
A Pod is similar to a set of containers with shared namespaces and shared filesystem volumes.
k8s 常用命令
kubectl get: 获取资源的信息,如获取Pod、Service、Deployment等资源的状态信息。- kubectl get pods: listing pods
- kubectl get pods -o wide//显示ip和node
- kubectl get services
- kubectl get rc(即replicationcontrollers)
kubectl create: 创建资源,如创建Pod、Service、Deployment等资源。kubectl delete: 删除资源,如删除Pod、Service、Deployment等资源。kubectl apply: 应用配置文件,如应用Deployment的配置文件。kubectl describe: 查看资源的详细信息,如查看Pod、Service、Deployment等资源的详细配置和状态信息。- kubectl describe pod xxx
- kubectl scale rc kubia --replicas=3//扩容
UNDERSTANDING HOW THE DESCRIPTION RESULTS IN A RUNNING CONTAINER
When the APl server processes your app's description, the Scheduler schedules the specified groups of containers onto the available worker nodes based on computztional resources required by each group and the unallocated resources on each nod at that moment. The Kubelet on those nodes then instructs the Container Runtime(Docker, for example) to pull the required container images and run the containers.
Accessing your web application
With your pod running, how do you access it? We mentioned that each pod gets its .own IP address, but this address is internal to the cluster and isn't accessible from the outside of it. To make the pod accessible from the outside,you'll expose it through a Service object. You'll create a special service of type LoadBalancer, because if you create a regular service (a ClusterIp service), like the pod, it would also only be accessible from inside the cluster. By creating a LoadBalancer-type service, an external load balancer will be created and you can connect to the pod through the load balancer'spublic IP.
To create the service, you'll tell Kubernetes to expose the ReplicationController you created earlier: $ kubectl expose rc kubia --type=LoadBalancer --name kubia-http
service"kubia-http"exposed
rc is the abbreviation of replicationcontroller
WHY YOU NEED A SERVICE
The third component of your system is the kubia-http service. To understand whyyou need services, you need to learn a key detail about pods. They're ephemeral. A pod may disappear at any time because the node it's running on has failed, because someone deleted the pod, or because the pod was evicted from an otherwise healthy node. When any of those occurs, a missing pod is replaced with a new one by the ReplicationController, as described previously. This new pod gets a different IP address from the pod it's replacing. This is where services come in, to solve the problem of ever-changing pod IP addresses, as well as exposing multiple pods at a single constant IP and port pair. When a service is created, it gets a static IP, which never changes during the lifetime ofthe service. Instead of connecting to pods directly, clients should connect to the service through its constant IP address. The service makes sure one of the pods receives the connection, regardless of where the pod is currently running (and what its lP address is).Services represent a static location for a group of one or more pods that all providethe same service. Requests coming to the IP and port of the service will be forwarded to the IP and port of one of the pods belonging to the service at that moment.
why multiple containers are better than one container running multiple processes
Containers are designed to run only a single process per container (unless the process itself spawns child processes). If you run multiple unrelated processes in a single container, it is your responsibility to keep all those processes running, manage their logs, and so on. For example,you'd have to include a mechanism for automatically restarting individual processes if they crash. Also, all those processes would log to the same standard output, so you'd have a hard time figuring out what process logged what.
partial isolation between containers of the same pod
Kubernetes achieves this by configuring Docker to have all containers of a pod share the same set of linux namespaces instead of each container having its own set.
But when it comes to the filesystem, things are a little different. Because most of the container's filesystem comes from the container image, by default, the filesystems of each container is fully isolated from other containers. However, it's possible to have them share file directories using a Kubernetes concept called a Volume, which we'll talk about in chapter 6.
Containers in a pod run in the same Network namespace,they share the same IP address and port space. Processes running in containers of the same pod need to take care not to bind to the same port numbers or they'll run into port conflicts.
flat inter-pod network
All pods in a Kubernetes cluster reside in a single flat, shared, network-address space(shown in figure 3.2), which means every pod can access every other pod at the other pod's IP address. No NAT (Network Address Translation) gateways exist between them. When two pods send network packets between each other, they'll each see the actual IP address of the other as the source IP in the packet.
like computers on a local area network(LAN)
splitting multi-tier apps into multiple pods
when to use multiple containers in a pod
The main reason to put multiple containers into a single pod is when the application consists of one main process and one or more complementary processes, as shown infigure 3.3.
Examples of sidecar containers include log rotators and collectors, data processors,communication adapters, and others.
a basic pod manifest:xxx.yaml
也可以用kubectl explain pods
$ kubectl create -f xxx.yaml //create pod from yaml file
view logs
Containerized applications usually log to the standard output and standard error stream instead of writing their logs to files.
The container runtime (Docker in your case) redirects those streams to files and allows you to get the container's log by running
$ docker logs <container id>
$ kubectl logs xxx //查看log, -c 能指定 container
port-forward
When you want to talk to a specific pod without going through a service (for debugging or other reasons), Kubernetes allows you to configure port forwarding to the pod. This is done through the kubectl port-forward command. The following command will forward your machine's local port 8888 to port 8080 of your kubia-manual pod:
$ kubectl port-forward xxx 8888:8080
lables
Labels are a simple, yet incredibly poweful, Kubernetes feature for organizing not only pods, but all other Kubernetes resoulces. A label is an arbitrary key-value pair you attach to a resource, which is then utilized when selecting resources using label selectors.
$ kubectl get po -l creation_method=manual // -l : lable selector
use lables and selectors to constrain pod scheduling
When your hardware infrastructureisn't homogenous. lf part of your worker nodes have spinning hard drives, whereas others have SSDs, you may want to schedule certain pods to one group of nodes and the rest to the other. Another example is when you need to schedule pods performing intensive GPU-based computation only to nodes that provide the required GPU acceleration.
If you want to control where a pod should be scheduled, instead of specifying an exact node, you should describe the node requirements and then let Kubernetes select a node that matches those requirements. This can be done through node labels and node label selectors.
kubectl label node xxx gpu=true
kubectl get nodes -l gpu=true
namespace
kubernetes namespaces provide a scope for objects names.
Using multiple namespaces allows you to split complex systems with numerous components into smaller distinct groups. They can also be used for separating resources in a multi-tenant environment, splitting up resources into production, development,and OA environments, or in any other way you may need. While most types of resources are namespaced, a few aren't. One of them is the Node resource, which is glolal and not tied to a single namespace.
kubectl ge ns
也可以用
kubectl create namespace xxx
kubectl crate -f xxx.yaml -n my-namespace
delete pod
As soon as you delete a pod created by the ReplicationController, it immediately creates a new one. To delete the pod, you also need to delete the ReplicationController.
healty check
As soon as a pod is scheduled to a node, the Kubelet on that node will run its containers and, from then on, keep them running as long as the pod exists. If the container's main process crashes, the Kubelet will restart the container. lf your application has a bug that causes it to crash every once in a while, Kubernetes will restart it automatically, so even without doing anything special in the app itself, running the app in Kubernetes automatically gives it the ability to heal itself.
But sometimes apps stop working without their process crashing :
- oom err
- infinite loop or deadlock
liveness probes
-
http get probe
- If the probe receives a response, and the response code doesn't represent an error (in other words, if the HTTP response code is 2xx or 3xx),the probe is considered successful.
-
tcp socket probe
-
exec probe
Obtaining the application log of a crashed containei
kubectl logs mypod --previous
ReplicationControllers
A ReplicationController constantly monitors the list of running pods and makes sure the actual number of pods of a “type" always matches the desired number.
reconciliation loop
three essential parts
- A label selector, which determines what pods are in the ReplicationController's scope
- A replica count, which specifies the desired number of pods that should be running
- A pod template, which is used when creating new pod replicas
kubectl describe rc xxx // displaying details of a ReplicationController
kubectl scale rc xxx --replicas=10
或者
kubectl edit rc xxx
ReplicaSets
ReplicaSet has more expressive pod selectors
kubectl get rs
running exactly one pod on each node with DaemonSets
Certain cases exist when you want a pod to run on each and every node in the cluster (and each node needs to run exactly one instance of the pod, as shown in figure 4.8).Those cases include infrastructure-related pods that perform system-level operations. For example, you'll want to run a log collector and a resource monitor on every node. Another good example is Kubernetes’own kube-proxy process, which needs to run on all nodes to make services work.
DaemonSets run only a single pod replica on each node, whereas ReplicaSets scatter them around the whole cluster randomly.
If a node goes down, the DaemonSet doesn't cause the pod to be created elsewhere. But when a new node is added to the cluster, the DaemonSet immediately deploys a new pod instance to it,It also does the same if someone inadvertentlydeletes one of the pods, leaving the node without the DaemonSet's pod. Like a ReplicaSet, a DaemonSet creates the pod from the pod template configured in it.
specify the node-Selector property in the pod template to run on certain nodes
job resource (a single completable task)
kubectl get jobs
kubectl get po -a // 加上 -a 才能列出 completed pods
The reason the pod isn't deleted when it completes is to allow you to examine its logs;
kubectl logs xxx
The pod will be deleted when you delete it or the Job that created it.
running multiple pod instances in a job
setting the completion and the parallelism properties in the Job spec.
kubectl scale job multi-completion-batch-job --replicas 3 //scale a job
CronJob
Job resources will be created from the Cronjob resource at approximately the scheduled time. The job then creates the pods.
In the example in listing 4.15, one of the times the job is supposed to run is 10:30:00.If it doesn't start by 10:30:15 for whatever reason, the job will not run and will be shown as Failed.
In normal circumstances, a CronJob always creates only a single Job for each execution configured in the schedule, but it may happen that two Jobs are created at thesame time, or none at all, To combat the first problem, your jobs should be idempotent (running them multiple times instead of once shouldn't lead to unwanted results). For the second problem, make sure that the next job run performs any workthat should have been done by the previous (missed) run.
services
create a service through kubectl expose
kubectl get svc
kubectl exec kubia-7nog1 -- curl -s http://10.111.249.153 //注意有 --
session affinity
This makes the service proxy redirect all requests originating from the same client IP to the same pod.
specify multiple ports in a service definition
discovering services
- through environment viriables
- When a pod is started, Kubernetes initializes a set of environment variables pointing to each service that exists at that moment. If you create the service before creating the client pods, processes in those pods can get the IP address and port of the service by inspecting their environment variables..
- through dns
-
Remember in chapter 3 when you listed pods in the kube-system namespace? One of the pods was called kube-dns. The kube-system namespace also includes a corresponding service with the same name.
-
As the name suggests, the pod runs a DNS server, which all other pods running in the cluster are automatically configured to use (Kubernetes does that by modifying each container's /etc/resolv.conf file). Any DNS query performed by a process running in a pod will be handled by Kubernetes’ own DNS server, which knows all the services running in your system.
-
Each service gets a DNS entry in the internal DNS server, and client pods that know the name of the service can access it through its fully qualified domain name (FODN) instead of resorting to environment variables.
-
fully qualified domain name (FODN)
backend-database.default.svc.cluster.local
backend-database corresponds to the service name, default stands for the namespace the service is defined in, and svc.cluster.local is a configurable cluster domain suffix used in all cluster local service names.
why you can't ping a service IP
Curl-ing the service works, but pinging it doesn't. That's because the service's cluster IP is a virtual IP, and only has meaning when combined with the service port.We'll explain what that means and how services work in chapter 11. I wanted to men-tion that here because it's the first thing users do when they try to debug a brokens ervice and it catches most of them off guard.
connecting to services living outside the cluster
service endpoints
An Endpoints resource (yes, plural) is a list of IP addresses and ports exposing a service. The Endpoints resource is like any other Kubernetes resource, so you can displayits basic info with kubectl get
exposing services to external clients
- setting the service type to NodePort
- setting the service type to LoadBalancer
- Creating an Ingress resource
When an external client connects to a service through the node port (this also includes cases when it goes through the load balancer first), the randomly chosen pod may or may not be running on the same node that received the connection. An additional network hop is required to reach the pod, but this may not always be desirable.
setting the externalTrafficPolicy field to Local in the service's spec section:
spec:
externalTrafficPolicy: Local
If a service definition includes this setting and an external connection is opened through the service's node port, the service proxy will choose a locally running pod. If no local pods exist, the connection will hang (it won't be forwarded to a random global pod, the way connections are when not using the annotation). You thereforeneed to ensure the load balancer forwards connections only to nodes that have at least one such pod.
Using this annotation also has other drawbacks. Normally, connections are spread evenly across all the pods, but when using this annotation, that's no longer the case.
the non-preservation of the client's IP
Usually, when clients inside the cluster connect to a service, the pods backing the service can obtain the client's IP address. But when the connection is received through a node port, the packets’ source IP is changed, because Source Network Address Translation (SNAT) is performed on the packets.
The backing pod can't see the actual client's IP, which may be a problem for someapplications that need to know the client's IP. In the case ofa web server, for example,this means the access log won't show the browser's IP.
The Local external traffic policy described in the previous section affects the preservation of the client's IP, because there’s no additional hop between the node receiving the connection and the node hosting the target pod (SNAT isn't performed).
Ingress
Why Ingresses are needed
One important reason is that each LoadBalancer service requires its own load balancer with its own public IP address, whereas an Ingress only requires one, even when providing access to dozens of services. When a client sends an HTTP request to the Ingress, the host and path in the request determine which service the request is for-warded to, as shown in figure 5.9..
Ingresses operate at the application layer of the network stack (HTTP) and can provide features such as cookie-based session affinity and the like, which services can't.
To make Ingress resources work, an Ingress controller needs to be running in the cluster.
This defines an Ingress with a single rule, which makes sure all HTTP requests received by the Ingress controller, in which the host kubia.example.com is requested, will be sent to the kubia-nodeport service on port 80.
obtain the ip address of the ingress
kubectl get ingresses
how ingress work
the Ingress controller didn't forward the request to the service. It only used it to select a pod.
tls
When a client opens a TLS connection to an Ingress controller, the controller terminates the TLS connection. The communication between the client and the controlleris is encrypted, whereas the communication between the controller and the backendpod isn't. The application running in the pod doesn't need to support TLS. For example, if the pod runs a web server, it can accept only HT'TP traffic and let the Ingresscontroller take care of everything related to TLS. To enable the controller to do that,you need to attach a certificate and a private key to the Ingress. The two need to be stored in a Kubernetes resource called a Secret, which is then referenced in the Ingress manifest.
1.create the private key and certificate
2.create the Secret from the two files
$ kubectl create secret tls tls-secret --cert=tls.cert --key=tls.key
discovering individual pods
Kubernetes allows clients to discover pod IPs through DNS lookups. Usually.when you perform a DNS lookup for a service, the DNS server returns a single Ip-the service's cluster IP. But if you tell Kubernetes you don't need a cluster IP for your service(you do this by setting the clusterIp field to None in the service specification) , the DNSserver will return the pod IPs instead of the single service lP.
volumns
Kubernetes volumes are a component of a pod and are thus defined in the pod's specification-much like containers. They aren't a standalone Kubernetes object and cannot be created or deleted on their own. A volume is available to all containers in the pod, but it must be mounted in each container that needs to access it. In each container, you can mount the volume in any location of its filesystem.
Using volumes to share data between containers
ConfigMaps and Secrets
The Kubernetes resource for storing configuration data is called a ConfigMap.
Regardless if you're using a ConfigMap to store configuration data or not, you canconfigure your apps by
- Passing command-line arguments to containers
- Setting custom environment variables for each container
- Mounting configuration files into containers through a special type of volume
In a Dockerfile, two instructions define the two parts:
-
ENTRYPOINT defines the executable invoked when the container is started.
-
CMD specifies the arguments that get passed to the ENTRYPOINT.
-
shell form-For example, ENTRYPOINT node app.js
-
exec form-For example, ENTRYPOINT ["node", "app.js"].
Kubernetesallows you to specify a custom list of environment variables for each container of a pod
Having values effectively hardcoded in the pod definition means you need to have separate pod definitions for your production and your development pods.
Decoupling configuration with a ConfigMap
Passing all entries of a ConfigMap as environment variables at once
Passing a ConfigMap entry as a command-line argument
You defined the environment variable exactly as you did before, but then you used the$(ENV_VARIABLE_NAME) syntax to have Kubernetes inject the value of the variable into the argument.
Using a configMap volume to expose ConfigMap entries as files
A configMap volume will expose each entry of the ConfigMap as a file. The process running in the container can obtain the entry's value by reading the contents of the file.
Using Secrets to pass sensitive data to containers
Secrets are much like ConfigMaps, they're also maps that hold key-value pairs. They can be used the same way as a ConfigMap. You can
- Pass Secret entries to the container as environment variables
- Expose Secret entries as files in a volume
Kubernetes helps keep your Secrets safe by making sure each Secret is only distributed to the nodes that run the pods that need access to the Secret. Also, on the nodes themselves, Secrets are always stored in memory and never written to physical storage.
The contents ofa Secret's entries are shown as Base64-encoded strings, whereas those of a ConfigMap are shown in clear text.
When you expose the Secret to a container through a secret volume, the value of the Secret entry is decoded and written to the file in its actual form (regardless if it's plaintext or binary). The same is also true when exposing the Secret entry through an environment variable. In both cases, the app doesn't need to decode it, but can read the file's contents or look up the environment variable value and use it directly.
Stalefulsets:deploying repicaled stateful applications
When a stateful pod instance dies (or the node it's running on fails), the pod instance needs to be resurrected on another node, but the new instance needs to get the same name, network identity, and state as the one it's replacing. This is what happens when the pods are managed through a StatefulSet.
Unlike pods created by ReplicaSets, pods created by the StatefulSet aren't exact replicas of each other. Each can have its own set of volumes , in other words, storage (and thus persistent state)-which differentiates it from its peers. Pet pods also have a predictable (and stable) identity instead of each new pod instance getting a completely random one.
Each pod created by a StatefulSet is assigned an ordinal index (zero-based), whichis then used to derive the pod's name and hostname, and to attach stable storage to the pod. The names of the pods are thus predictable, because each pod's name is derived from the StatefulSet's name and the ordinal index of the instance. Rather than the pods having random names, they're nicely organized, as shown in the nextfigure.
the replacement pod gets the same name and hostname as the pod that has disappeared
Scaling the StatefulSet creates a new pod instance with the next unused ordinal index
Because certain stateful applications don't handle rapid scale-downs nicely, StatefulSets scale down only one pod instance at a time.
For this exact reason, StatefulSets also never permit scale-down operations if any of the instances are unhealthy. If an instance is unhealthy, and you scale down by one at the same time, you've effectively lost two cluster members at once.
architecture
- The Kubernetes Control Plane
- The etcd distributed persistent storage
- The APl server
- The Scheduler
- The Controller Manager
- The (worker) nodes
- The Kubelet
- The Kubernetes Service Proxy(kube-proxy)
- The Container Runtime (Docker, rkt, or others)
- Add-on Components
-
The Kubernetes DNS server
-
An Ingress controller
-
Heapster,
-
The Container Network Interface network plugin
-
Kubernetes system components communicate only with the APl server. They don't talk to each other directly.
TheAPI server stores the complete JSON representation of a resource in etcd.
how the APl server notifies clients of resource changes
he API server doesn't even tell these controllers what to do. All it does is enable those controllers and other components to observe changes to deployed resources. A Control Plane component can request to be notified when a resource is created, modified, or deleted.
scheduler
The Scheduler waits for newly created pods through the APl server's watch mechanism and assign a node to each new pod that doesn't aready have the node set.
The Scheduler doesn't instruct the selected node (or the Kubelet running on that node) to run the pod. All the Scheduler does is update the pod definition through the API server. The APl server then notifies the Kubelet (again, through the watch mechanism described previously) that the pod has been scheduled. As soon as the Kubelet on the target node sees the pod has been scheduled to its node, it creates and runs the pod's containers.
DEFAULT SCHEDULING ALGORITHM
- Filtering the list of all nodes to obtain a list of acceptable nodes the pod can be scheduled to.
- Prioritizing the acceptable nodes and choosing the best one. If multiple nodeshave the highest score, round-robin is used to ensure pods are deployed across all of them evenly.
Instead of running a single Scheduler in the cluster, you can run multiple Schedulers. Then, for each pod, you specify the Scheduler that should schedule this particular pod by setting the schedulerName property in the pod spec.
controllers running in the Controller Manager
make sure the actual state of the system converges toward the desired state
- Replication Manager (a controller for ReplicationController resources)
- ReplicaSet, DaemonSet, and Job controllers
- Deployment controller
- StatefulSet controller
- Node controller
- Service controller
- Endpoints controller
- Namespace controller
- PersistentVolume controller
- Others
In general, controllers run a reconciliation loop, which reconciles the actual state with the desired state (specified in the resource’s spec section) and writes the new actual state to the resource's status section. Controllers use the watch mechanism to be notified of changes, but because using watches doesn't guarantee the controller won't miss an event, they also perform a re-list operation periodically to make sure they haven't missed anything.
replication manager
controller doesn't poll the pods in every iteration, but is instead notified by the watch mechanism of each change that may affect the desired replica count or the number of matched pods (see figure 11.6). Any suchchanges trigger the controller to recheck the desired vs actual replica count and actaccordingly.
When too few pod instances are running, the Replication-Controller runs additional instances. But it doesn't actually run them itself. It creates new Pod manifests, posts them to the APl server, and lets the Scheduler and the Kubelet do their job of scheduling and running the pod.
endpoint controller
The controller watches both Services and Pods. WhenServices are added or updated or Pods are added, updated, or deleted, it selects Pods matching the Service's pod selector and adds their IPs and ports to the Endpoints resource. Remember, the Endpoints object is a standalone object, so the controller creates it if necessary, likewise, it also deletes the Endpoints obiect when the Service is deleted.
PERSISTENTVOLUME CONTROLLER
When a PersistentVolumeClaim pops up, the controller finds the best match for the claim by selecting the smallest PersistentVolume with the access mode matching the one requested in the claim and the declared capacity above the capacity requested in the claim. It does this by keeping an ordered list of PersistentVolumes for each access mode by ascending capacity and returning the first volume from the list.
Then, when the user deletes the PersistentVolumeClaim, the volume is unbound and reclaimed according to the volume's reclaim policy (left as is, deleted, or emptied)
kubelet
The Kubelet is the component responsible for everything running on a worker node. Its initial job is to register thhe node it's running on by creating a Noderesource in the APl server. Then it needs to continuously monitor the API server for Pods that have been scheduled to the node, and start the pod's containers. It does this by telling the configured container runtime to run a container from a specific container image. The Kubelet then constantly reports their status, events, and resource consumption to the API server.
The Kubelet is also the component that runs the container liveness probes, restarting containers when the probes fail. Lastly, it terminates containers when their Pod is deleted from the API server and notifies the server that the pod has terminated.
kube-proxy
The kube-proxy got its name because it was an actual proxy, but the current, much better performing implementation only uses iptables rules to redirect packets to a randomly selected backend pod without passing them through an actual proxy server.This mode is called the iptables proxy mode and is shown in figure 11.10.
how controllers cooperate
what a running pod is
This pause container is the container that holds all the containers of a pod together. Remember how all containers of a pod share the same network and other Linux namespaces? The pause container is an infrastructure container whose sole purpose is to hold all these namespaces. All other user-defined containers of the pod then use the namespaces of the pod infrastructure container (see figure 11.13).
network
Figure 11.16 shows that to enable communication between pods across two nodes with plain layer 3 networking, the node’s physical network interface needs to be connected to the bridge as well. Routing tables on node A need to be configured so all packets destined for 10.1.2.0/24 are routed to node B, whereas node B's routing tables need to be configured so packets sent to 10.1.1.0/24 are routed to node A.
With this type of setup, when a packet is sent by a container on one of the nodesto a container on the other node, the packet first goes through the veth pair, then through the bridge to the node's physical adapter, then over the wire to the other node's physical adapter, through the other node’s bridge, and finally through the veth pair of the destination container.
This works only when nodes are connected to the same network switch, without any routers in between; otherwise those routers would drop the packets because they refer to pod IPs, which are private. Sure, the routers in between could be configured to route packets between the nodes, but this becomes increasingly difficult and error-prone as the number of routers between the nodes increases. Because of this, it's easier to use a Software Defined Network (SDN), which makes the nodes appear as though they're connected to the same network switch, regardless of the actual underlying network topology, no matter how complex it is. Packets sent from the pod are encapsulated and sent over the network to the node running the other pod, where they are de-encapsulated and delivered to the pod in their original form.
同一个pod里面的container则是 share the pod's ip address
kube-proxy
Everything related to Services is handled by the kube-proxy process running on each node. Initially, the kube-proxy was an actual proxy waiting for connections and foreach incoming connection, opening a new connection to one of the pods. This was called the userspace proxy mode. Later, a better-performing iptables proxy mode replaced it. This is now the default.
An Endpoints object holds the IP/port pairs of all the pods that back the service (an IP/port pair can also point to something other than a pod). That's why the kube-proxy must also watch all Endpoints obiects.
What happens to the packet when it's sent by the client pod (pod A in the figure).
The packet's destination is initially set to the IP and port of the Service (in the example, the Service is at 172.30.0.1:80). Before being sent to the network, the packet is first handled by node A's kernel according to the iptables rules set up on the node.
The kernel checks if the packet matches any of those iptables rules. One of them says that if any packet has the destination IP equal to 172.30.0.1 and destination portequal to 80, the packet's destination IP and port should be replaced with the Ip and port of a randomly selected pod.
The packet in the example matches that rule and so its destination IP/port is changed. In the example, pod B2 was randomly selected, so the packet's destinationIP is changed to 10.1.2.1 (pod B2's IP) and the port to 8080 (the target port specified in the Service spec). From here on, it's exactly as if the client pod had sent the packet to pod B directly instead of through the service.
LEADER ELECTION MECHANISM USED IN CONTROL PLANE COMPONENTS
The leader election mechanism works purely by creating a resource in the API server.
Take the Scheduler, for example. All instances of the Scheduler try to create (and later update) an Endpoints resource called kube-scheduler.
The control-plane.alpha.kubernetes.io/leader annotation is the important part. As you can see, it contains a field called holderIdentity, which holds the name of the current leader. The first instance that succeeds in putting its name there becomes the leader. Instances race each other to do that, but there's always only one winner.
service account
the API server requires clients to authenticate themselves before they're allowed to perform operations on the server. And you've already seen how pods can authenticate by sending thecontents of the file /var/run/secrets/kibernetes.io/serviceaccount/token, which is mounted into each container's filesystem through a secret volume. But what exactly does that file represent? Every pod is associated with a ServiceAccount, which represents the identity of the app running in the pod. The token file holds the ServiceAccount's authentication token. When an app uses this token to connect to the API server, the authentication plugin authenticates the ServiceAccount and passes the ServiceAccount's username back to the API server core. ServiceAccount usernames are formatted like this:
system:serviceaccount:<namespace>:<service account name>
The APl server passes this username to the configured authorization plugins, which determine whether the action the app is trying to perform is allowed to be performed by the ServiceAccount.
ServiceAccounts are nothing more than a way for an application running inside a pod to authenticate itself with the API server. As already mentioned, applications do that by passing the ServiceAccount's token in the request.
A default ServiceAccount is automatically created for each namespace.
Each pod is associated with a single ServiceAccount in the pod's namespace, but multiple pods can use the same ServiceAccount.
QoS (Quality of Service)
-
BestEffort (the lowest priority)
- pods that don't have any requests or limits set at all (in any of their containers
-
Burstable
-
Guaranteed (the highest)
- pods whose containers’requests are equal to the limits for all resources
which process gets killed when memory is low
When the system is overcommitted, the QoS classes determine which container getskilled first so the freed resources can be given to higher priority pods. First in line to get killed are pods in the BestEffort class, followed by Burstable pods, and finally Guaranteed pods, which only get killed if system processes need memory.
containers with the same QoS class
Each running process has an OutOfMemory (OOM) score. The system selects the process to kill by comparing OOM scores of all the running processes. When memory needs to be freed, the process with the highest score gets killed.
OOM scores are calculated from two things: the percentage of the available memory the process is consuming and a fixed OOM score adjustment, which is based on the pod's QoS class and the container's requested memory.
Collecting and retrieving actual resource usages
The Kubelet itself contains an agent called cAdvisor, which performs the basic collection of resource consumption data for both individual containers running on the node and the node as a whole. Gathering those statistics centrally for the whole cluster requires you to run an additional component called Heapster.
Heapster runs as a pod on one of the nodes and is exposed through a regular Kubernetes Service, making it accessibie at a stable IP address. It collects from all cAdvisors in the cluster and exposes it in a single location.