【Kubernetes】ReplicaSet 如何选择要删除的 Pod - 缩容优先级深度解析

showyoui

已于 2025-07-01 09:04:34 修改

阅读量803

点赞数 26

CC 4.0 BY-SA版权

分类专栏：云原生文章标签：开源 kubernetes 容器云原生

于 2025-06-29 18:46:53 首次发布

本文链接：https://blog.csdn.net/showyoui/article/details/149001708

云原生专栏收录该内容

6 篇文章

订阅专栏

文章目录

概述

当您缩减一个 Deployment 或 ReplicaSet 的副本数时，控制器必须从其管理的众多 Pod 中做出选择：删除哪一个？这是一个在应用更新和弹性伸缩中频繁发生的操作。与因节点资源不足而引发的"被动"驱逐不同，控制器的"主动"缩容遵循一套独立的、明确的优先级规则。本文将深入解析控制器在缩容场景下选择待删除 Pod 的内部决策逻辑。

核心问题：控制器如何在自己的 Pod 中做选择？

假设一个 Deployment 管理着 5 个完全相同的 Pod，现在需要缩减到 3 个。由于这 5 个 Pod 都来自同一个 Pod Template，它们的 PriorityClass 和 QoS 等级 通常是完全一样的。因此，控制器无法使用这两个指标来区分它们。

那么，控制器究竟依据什么来排序和选择呢？

Talk is cheap, show me the code!
让我们读一下 kubernetes 的源码，并逐步解释：

// ActivePodsWithRanks is a sortable list of pods and a list of corresponding
// ranks which will be considered during sorting.  The two lists must have equal
// length.  After sorting, the pods will be ordered as follows, applying each
// rule in turn until one matches:
//
//  1. If only one of the pods is assigned to a node, the pod that is not
//     assigned comes before the pod that is.
//  2. If the pods' phases differ, a pending pod comes before a pod whose phase
//     is unknown, and a pod whose phase is unknown comes before a running pod.
//  3. If exactly one of the pods is ready, the pod that is not ready comes
//     before the ready pod.
//  4. If controller.kubernetes.io/pod-deletion-cost annotation is set, then
//     the pod with the lower value will come first.
//  5. If the pods' ranks differ, the pod with greater rank comes before the pod
//     with lower rank.
//  6. If both pods are ready but have not been ready for the same amount of
//     time, the pod that has been ready for a shorter amount of time comes
//     before the pod that has been ready for longer.
//  7. If one pod has a container that has restarted more than any container in
//     the other pod, the pod with the container with more restarts comes
//     before the other pod.
//  8. If the pods' creation times differ, the pod that was created more recently
//     comes before the older pod.
//
// In 6 and 8, times are compared in a logarithmic scale. This allows a level
// of randomness among equivalent Pods when sorting. If two pods have the same
// logarithmic rank, they are sorted by UUID to provide a pseudorandom order.
//
// If none of these rules matches, the second pod comes before the first pod.
//
// The intention of this ordering is to put pods that should be preferred for
// deletion first in the list.
type ActivePodsWithRanks struct {
	// Pods is a list of pods.
	Pods []*v1.Pod

	// Rank is a ranking of pods.  This ranking is used during sorting when
	// comparing two pods that are both scheduled, in the same phase, and
	// having the same ready status.
	Rank []int

	// Now is a reference timestamp for doing logarithmic timestamp comparisons.
	// If zero, comparison happens without scaling.
	Now metav1.Time
}

func (s ActivePodsWithRanks) Len() int {
	return len(s.Pods)
}

func (s ActivePodsWithRanks) Swap(i, j int) {
	s.Pods[i], s.Pods[j] = s.Pods[j], s.Pods[i]
	s.Rank[i], s.Rank[j] = s.Rank[j], s.Rank[i]
}

// Less compares two pods with corresponding ranks and returns true if the first
// one should be preferred for deletion.
func (s ActivePodsWithRanks) Less(i, j int) bool {
	// 1. Unassigned < assigned
	// If only one of the pods is unassigned, the unassigned one is smaller
	if s.Pods[i].Spec.NodeName != s.Pods[j].Spec.NodeName && (len(s.Pods[i].Spec.NodeName) == 0 || len(s.Pods[j].Spec.NodeName) == 0) {
		return len(s.Pods[i].Spec.NodeName) == 0
	}
	// 2. PodPending < PodUnknown < PodRunning
	if podPhaseToOrdinal[s.Pods[i].Status.Phase] != podPhaseToOrdinal[s.Pods[j].Status.Phase] {
		return podPhaseToOrdinal[s.Pods[i].Status.Phase] < podPhaseToOrdinal[s.Pods[j].Status.Phase]
	}
	// 3. Not ready < ready
	// If only one of the pods is not ready, the not ready one is smaller
	if podutil.IsPodReady(s.Pods[i]) != podutil.IsPodReady(s.Pods[j]) {
		return !podutil.IsPodReady(s.Pods[i])
	}

	// 4. lower pod-deletion-cost < higher pod-deletion cost
	if utilfeature.DefaultFeatureGate.Enabled(features.PodDeletionCost) {
		pi, _ := helper.GetDeletionCostFromPodAnnotations(s.Pods[i].Annotations)
		pj, _ := helper.GetDeletionCostFromPodAnnotations(s.Pods[j].Annotations)
		if pi != pj {
			return pi < pj
		}
	}

	// 5. Doubled up < not doubled up
	// If one of the two pods is on the same node as one or more additional
	// ready pods that belong to the same replicaset, whichever pod has more
	// colocated ready pods is less
	if s.Rank[i] != s.Rank[j] {
		return s.Rank[i] > s.Rank[j]
	}
	// TODO: take availability into account when we push minReadySeconds information from deployment into pods,
	//       see https://github.com/kubernetes/kubernetes/issues/22065
	// 6. Been ready for empty time < less time < more time
	// If both pods are ready, the latest ready one is smaller
	if podutil.IsPodReady(s.Pods[i]) && podutil.IsPodReady(s.Pods[j]) {
		readyTime1 := podReadyTime(s.Pods[i])
		readyTime2 := podReadyTime(s.Pods[j])
		if !readyTime1.Equal(readyTime2) {
			if !utilfeature.DefaultFeatureGate.Enabled(features.LogarithmicScaleDown) {
				return afterOrZero(readyTime1, readyTime2)
			} else {
				if s.Now.IsZero() || readyTime1.IsZero() || readyTime2.IsZero() {
					return afterOrZero(readyTime1, readyTime2)
				}
				rankDiff := logarithmicRankDiff(*readyTime1, *readyTime2, s.Now)
				if rankDiff == 0 {
					return s.Pods[i].UID < s.Pods[j].UID
				}
				return rankDiff < 0
			}
		}
	}
	// 7. Pods with containers with higher restart counts < lower restart counts
	if res := compareMaxContainerRestarts(s.Pods[i], s.Pods[j]); res != nil {
		return *res
	}
	// 8. Empty creation time pods < newer pods < older pods
	if !s.Pods[i].CreationTimestamp.Equal(&s.Pods[j].CreationTimestamp) {
		if !utilfeature.DefaultFeatureGate.Enabled(features.LogarithmicScaleDown) {
			return afterOrZero(&s.Pods[i].CreationTimestamp, &s.Pods[j].CreationTimestamp)
		} else {
			if s.Now.IsZero() || s.Pods[i].CreationTimestamp.IsZero() || s.Pods[j].CreationTimestamp.IsZero() {
				return afterOrZero(&s.Pods[i].CreationTimestamp, &s.Pods[j].CreationTimestamp)
			}
			rankDiff := logarithmicRankDiff(s.Pods[i].CreationTimestamp, s.Pods[j].CreationTimestamp, s.Now)
			if rankDiff == 0 {
				return s.Pods[i].UID < s.Pods[j].UID
			}
			return rankDiff < 0
		}
	}
	return false
}

ReplicaSet 的删除优先级排序

Deployment 的缩容逻辑实际上是由其控制的 ReplicaSet 来执行的。ReplicaSet 控制器在挑选要删除的 Pod 时，会遵循一个精心设计的排序算法，目标是优先删除"价值最低"的 Pod，以最小化对服务的影响。

这个排序逻辑直接实现在 ReplicaSet 控制器内部，而不是一个通用的工具函数中。它定义了明确的优先级顺序，排名越靠前，越优先被删除。

根据 ActivePodsWithRanks.Less() 方法的实现，共有 8 层决策规则，按顺序应用直到找到匹配的规则：

节点分配状态: 如果只有一个 Pod 被分配到节点，未分配节点的 Pod 会优先于已分配节点的 Pod 被删除。
Pod 阶段状态: 如果 Pod 的阶段不同，按照 Pending < Unknown < Running 的顺序，Pending 状态的 Pod 优先于 Unknown 状态，Unknown 状态优先于 Running 状态被删除。
就绪状态: 如果只有一个 Pod 就绪，未就绪的 Pod 会优先于就绪的 Pod 被删除。
Pod 删除成本注解 (controller.kubernetes.io/pod-deletion-cost): 这是最重要、最直接的人工干预手段。从 Kubernetes v1.22 开始成为 Beta 特性，您可以给 Pod 添加这个注解来影响删除顺序。
- 控制器会优先删除成本较低的 Pod。
- 该注解的值必须是可以被解析为 int32 的字符串。
- 没有此注解的 Pod 默认成本为 0。
- 这允许您基于应用自身的逻辑（如 Pod 是否已完成特定任务、是否为空闲连接状态）来引导控制器做出更智能的决策。
Pod 排名 (Rank): 如果 Pod 的排名不同，排名值较高的 Pod 会优先于排名值较低的 Pod 被删除。排名通常基于以下因素计算：
- 节点负载均衡：同一节点上相关 Pod 数量越多，该节点上的 Pod rank 值越高
- 资源分布：用于实现更均匀的资源分布
- 业务逻辑：可以根据应用特定的业务逻辑调整 rank 值
就绪时间对比: 如果两个 Pod 都处于就绪状态但就绪时间不同，会优先删除就绪时间较短的 Pod（使用对数时间比较，增加随机性）。
容器重启次数: 如果一个 Pod 的容器重启次数比另一个 Pod 的任何容器都多，重启次数较多的 Pod 会被优先删除，因为频繁重启通常表示 Pod 不稳定。
Pod 创建时间: 如果 Pod 的创建时间不同，会优先删除创建时间较晚 (Newest) 的 Pod，使用对数时间比较以增加随机性。这是一种保护长时间运行的、可能包含重要状态或缓存的旧 Pod 的策略。

特殊情况：StatefulSet

与 ReplicaSet 不同，StatefulSet 的缩容逻辑非常简单直接：严格按照 Pod 序号的倒序进行删除。例如，一个 3 副本的 StatefulSet (ss-0, ss-1, ss-2) 缩容到 2 个副本时，一定会先删除 ss-2。这是为了保证其有序、稳定的特性。

决策流程图

关键应用：使用 `pod-deletion-cost`

假设您有一个应用，其中一些 Pod 负责处理实时请求，另一些 Pod 可能因为负载下降而处于空闲状态。您可以通过一个外部监控系统，在 Pod 空闲时为其添加一个较低的删除成本。

示例：将一个 Pod 标记为易于删除

# 为 my-pod-xyz 添加一个很低的删除成本
kubectl annotate pod my-pod-xyz controller.kubernetes.io/pod-deletion-cost="-100"

当下次缩容发生时，这个 Pod 将会因为其极低的删除成本而被优先选中。

总结

控制器（如 Deployment/ReplicaSet）在缩容时，不使用 PriorityClass 或 QoS 作为主要的决策依据。
其决策核心是优先删除"价值最低"的 Pod，完整的 8 层排序依据依次是：节点分配状态 -> Pod 阶段状态 -> 就绪状态 -> 删除成本注解 -> Pod 排名 -> 就绪时间 -> 重启次数 -> 创建时间。
controller.kubernetes.io/pod-deletion-cost 注解是控制缩容行为最直接、最强大的工具，允许将应用层面的状态反馈给 Kubernetes。
Pod 排名 (Rank) 机制提供了更细粒度的控制，可以基于节点负载均衡、资源分布等因素进行智能删除决策。
如果两个 Pod 在所有规则上都相同，会按照 UUID 进行排序，提供伪随机顺序。
时间比较（就绪时间和创建时间）使用对数尺度，在相等的 Pod 之间提供一定程度的随机性。
理解这套完整的多层排序逻辑，可以帮助您在设计弹性伸缩策略时，更精细地控制应用行为，确保服务的平滑与稳定。