如何使 Kubernetes 集群自动扩容?Cluster Autoscaler 全面解析( 二 )


  • random:随机选择一个 NodeGroup 。如果未指定,则默认为此策略 。
  • most-pods:选择能够调度最多 Pod 的 NodeGroup,比如有的 Pod 未调度是因为 nodeSelector,此策略会优先选择能满足的 NodeGroup 来保证大多数的 Pod 可以被调度 。
  • least-waste:为避免浪费,此策略会优先选择能满足 Pod 需求资源的最小资源类型的 NodeGroup 。
  • price:根据 CloudProvider 提供的价格模型,选择最省钱的 NodeGroup 。
  • priority:通过配置优先级来进行选择,用起来比较麻烦,需要额外的配置,可以看文档(https://github.com/kubernetes/autoscaler/blob/master/cluster-autoscaler/expander/priority/readme.md) 。
如果有需要,也可以平衡相似 NodeGroup 中的 Node 数量,避免 NodeGroup 达到 MaxSize 而导致无法加入新 Node 。通过
--balance-similar-node-groups 选项配置,默认为 false 。
在经过一系列的操作后,最终计算出要扩容的 Node 数量及 NodeGroup,使用 CloudProvider 执行 IncreaseSize 操作,增加云厂商的伸缩组大小,从而完成扩容操作 。
文字表达能力不足,如果有不清晰的地方,可以参考下面的 ScaleUP 源码解析 。
Scale Down
缩容是一个可选的功能,通过 --scale-down-enabled 选项配置,默认为 true 。
在 Cluster Autoscaler 监控 Node 资源时,如果发现有 Node 满足以下三个条件时,就会标记这个 Node 为 unneeded:
  • Node 上运行的所有的 Pod 的 Cpu 和内存之和小于该 Node 可分配容量的 50% 。可通过 --scale-down-utilization-threshold 选项改变这个配置 。
  • Node 上所有的 Pod 都可以被调度到其他节点 。
  • Node 没有表示不可缩容的 annotaition 。
如果一个 Node 被标记为 unneeded 超过 10 分钟(可通过
--scale-down-unneeded-time 选项配置),则使用 CloudProvider 执行 DeleteNodes 操作将其删除 。一次最多删除一个 unneeded Node,但空 Node 可以批量删除,每次最多删除 10 个(通过 ----max-empty-bulk-delete 选项配置) 。
实际上并不是只有这一个判定条件,还会有其他的条件来阻止删除这个 Node,比如 NodeGroup 已达到 MinSize,或在过去的 10 分钟内有过一次 Scale UP 操作(通过
--scale-down-delay-after-add 选项配置)等等,更详细可查看文档(
https://github.com/kubernetes/autoscaler/blob/master/cluster-autoscaler/FAQ.md#how-does-scale-down-work) 。
Cluster Autoscaler 的工作机制很复杂,但其中大部分都能通过 flags 进行配置,如果有需要,请详细阅读文档:
https://github.com/kubernetes/autoscaler/blob/master/cluster-autoscaler/FAQ.md
如何实现 CloudProvider如果使用上述中已实现接入的云厂商,只需要通过 --cloud-provider 选项指定来自哪个云厂商就可以,如果想要对接自己的 IaaS 或有特定的业务逻辑,就需要自己实现 CloudProvider Interface 与 NodeGroupInterface 。并将其注册到 builder 中,用于通过 --cloud-provider 参数指定 。
builder 在 cloudprovider/builder 中的 builder_all.go (
https://github.com/kubernetes/autoscaler/blob/master/cluster-autoscaler/cloudprovider/builder/builder_all.go) 中注册,也可以在其中新建一个自己的 build,通过 go 文件的 +build 编译参数来指定使用的 CloudProvider 。
CloudProvider 接口与 NodeGroup 接口在 cloud_provider.go (
https://github.com/kubernetes/autoscaler/blob/master/cluster-autoscaler/cloudprovider/cloud_provider.go) 中定义,其中需要注意的是 Refresh 方法,它会在每一次循环(默认 10 秒)的开始时调用,可在此时请求接口并刷新 NodeGroup 状态,通常的做法是增加一个 manager 用于管理状态 。有不理解的部分可参考其他 CloudProvider 的实现 。
type CloudProvider interface { // Name returns name of the cloud provider. Name() string // NodeGroups returns all node groups configured for this cloud provider. // 会在一次循环中多次调用此方法,所以不适合每次都请求云厂商服务,可以在 Refresh 时存储状态 NodeGroups() []NodeGroup // NodeGroupForNode returns the node group for the given node, nil if the node // should not be processed by cluster autoscaler, or non-nil error if such // occurred. Must be implemented. // 同上 NodeGroupForNode(*apiv1.Node) (NodeGroup, error) // Pricing returns pricing model for this cloud provider or error if not available. // Implementation optional. // 如果不使用 price expander 就可以不实现此方法 Pricing() (PricingModel, errors.AutoscalerError) // GetAvailablemachineTypes get all machine types that can be requested from the cloud provider. // Implementation optional. // 没用,不需要实现 GetAvailableMachineTypes() ([]string, error) // NewNodeGroup builds a theoretical node group based on the node definition provided. The node group is not automatically // created on the cloud provider side. The node group is not returned by NodeGroups() until it is created. // Implementation optional. // 通常情况下,不需要实现此方法,但如果你需要 ClusterAutoscaler 创建一个默认的 NodeGroup 的话,也可以实现 。// 但其实更好的做法是将默认 NodeGroup 写入云端的伸缩组 NewNodeGroup(machineType string, labels map[string]string, systemLabels map[string]string,taints []apiv1.Taint, extraResources map[string]resource.Quantity) (NodeGroup, error) // GetResourceLimiter returns struct containing limits (max, min) for resources (cores, memory etc.). // 资源限制对象,会在 build 时传入,通常情况下不需要更改,除非在云端有显示的提示用户更改的地方,否则使用时会迷惑用户 GetResourceLimiter() (*ResourceLimiter, error) // GPULabel returns the label added to nodes with GPU resource. // GPU 相关,如果集群中有使用 GPU 资源,需要返回对应内容 。hack: we assume anything which is not cpu/memory to be a gpu. GPULabel() string // GetAvailableGPUTypes return all available GPU types cloud provider supports. // 同上 GetAvailableGPUTypes() map[string]struct{} // Cleanup cleans up open resources before the cloud provider is destroyed, i.e. go routines etc. // CloudProvider 只会在启动时被初始化一次,如果每次循环后有需要清除的内容,在这里处理 Cleanup() error // Refresh is called before every main loop and can be used to dynamically update cloud provider state. // In particular the list of node groups returned by NodeGroups can change as a result of CloudProvider.Refresh(). // 会在 StaticAutoscaler RunOnce 中被调用 Refresh() error}// NodeGroup contains configuration info and functions to control a set// of nodes that have the same capacity and set of labels.type NodeGroup interface { // MaxSize returns maximum size of the node group. MaxSize() int // MinSize returns minimum size of the node group. MinSize() int // TargetSize returns the current target size of the node group. It is possible that the // number of nodes in Kubernetes is different at the moment but should be equal // to Size() once everything stabilizes (new nodes finish startup and registration or // removed nodes are deleted completely). Implementation required. // 响应的是伸缩组的节点数,并不一定与 kubernetes 中的节点数保持一致 TargetSize() (int, error) // IncreaseSize increases the size of the node group. To delete a node you need // to explicitly name it and use DeleteNode. This function should wait until // node group size is updated. Implementation required. // 扩容的方法,增加伸缩组的节点数 IncreaseSize(delta int) error // DeleteNodes deletes nodes from this node group. Error is returned either on // failure or if the given node doesn't belong to this node group. This function // should wait until node group size is updated. Implementation required. // 删除的节点一定要在该节点组中 DeleteNodes([]*apiv1.Node) error // DecreaseTargetSize decreases the target size of the node group. This function // doesn't permit to delete any existing node and can be used only to reduce the // request for new nodes that have not been yet fulfilled. Delta should be negative. // It is assumed that cloud provider will not delete the existing nodes when there // is an option to just decrease the target. Implementation required. // 当 ClusterAutoscaler 发现 kubernetes 节点数与伸缩组的节点数长时间不一致,会调用此方法来调整 DecreaseTargetSize(delta int) error // Id returns an unique identifier of the node group. Id() string // Debug returns a string containing all information regarding this node group. Debug() string // Nodes returns a list of all nodes that belong to this node group. // It is required that Instance objects returned by this method have Id field set. // Other fields are optional. // This list should include also instances that might have not become a kubernetes node yet. // 返回伸缩组中的所有节点,哪怕它还没有成为 kubernetes 的节点 Nodes() ([]Instance, error) // TemplateNodeInfo returns a schedulernodeinfo.NodeInfo structure of an empty // (as if just started) node. This will be used in scale-up simulations to // predict what would a new node look like if a node group was expanded. The returned // NodeInfo is expected to have a fully populated Node object, with all of the labels, // capacity and allocatable information as well as all pods that are started on // the node by default, using manifest (most likely only kube-proxy). Implementation optional. // ClusterAutoscaler 会将节点信息与节点组对应,来判断资源条件,如果是一个空的节点组,那么就会通过此方法来虚拟一个节点信息 。TemplateNodeInfo() (*schedulernodeinfo.NodeInfo, error) // Exist checks if the node group really exists on the cloud provider side. Allows to tell the // theoretical node group from the real one. Implementation required. Exist() bool // Create creates the node group on the cloud provider side. Implementation optional. // 与 CloudProvider.NewNodeGroup 配合使用 Create() (NodeGroup, error) // Delete deletes the node group on the cloud provider side. // This will be executed only for autoprovisioned node groups, once their size drops to 0. // Implementation optional. Delete() error // Autoprovisioned returns true if the node group is autoprovisioned. An autoprovisioned group // was created by CA and can be deleted when scaled to 0. Autoprovisioned() bool}


推荐阅读