Scheduling Framework Migration
Scheduling Framework Migration
Scheduling Framework Migration
Objective
The scheduler framework is now completely implemented, the next step is to move the
scheduler core into it. The goal of this doc is to devise a reasonably safe migration plan to the
scheduler’s framework.
Migration Plan
The core of the migration is re-writing predicates and priorities, as well as the scheduler’s
current binding and queue-sorting logic (possibly preemption as well), as framework plugins.
Specifically, predicates and priorities have two major dependencies that we need to take into
account while migrating:
1. The scheduler’s Predicate/Priority Policy configuration allows admins to specify which
predicates/priorities the scheduler should run. Users of this configuration API should
migrate to the Plugins component config API. However, we can’t deprecate the Policy
API until component config is GA, and so we have to support both APIs until we are able
to deprecate the Policy API.
2. Three components import predicates logic, Cluster Autoscaler, Kubelet and DaemonSet
Controller. Those components should migrate to calling the Plugins directly (details TBD
as how this may look like).
Milestone 1
Goal
Create migration paths.
Steps
1. Write equivalent Plugins for currently supported priorities and predicates. Those plugins
should only call the existing predicates/priorities logic, NOT move it.
2. Related to scheduler configuration
a. Add a translation layer that converts predicate/priority scheduler configuration
(whether it came from the AlgorithmProvider or Policy API) to a Plugin config for
ones that have an equivalent Plugin implementation.
b. Until we are able to deprecate the Policy API, we will continue to support
configuring predicates/priorities via both: Policy API and Plugins API, but only
one can be specified.
c. Create a component config API to allow creating custom Plugins to support the
migration of custom predicates/priorities (e.g., custom
RequestedToCapacityRatio scoring Plugin).
3. Related to the CA, DaemonSet Controller and Kubelet dependencies:
a. Create an interface through which they can call Filter Plugins.
In addition to the fact that all this will have to pass integration tests, we will be able to
incrementally move predicates/priorities incrementally instead of switching all of them suddenly
to the framework, this is facilitated by the configuration translation layer which will allow us to
add/remove the mappings as we wish.
Outcome
1. All predicates/priorities as Plugins.
2. An API in the scheduler component config that allows creating custom Plugins.
3. An interface for CA and DaemonSet to call filter plugins
Milestone 2
Goal
Cleanup migration paths.
Steps
1. Move CA, DaemonSet Controller and Kubelet to call the new API instead of calling the
predicates directly.
2. Cleanup core scheduler code calling into predicates/priorities (mostly in
core/generic_scheduler.go)
3. Move predicates/priorities code into the Plugins (instead of just calling the
predicates/priorities functions)
a. Start with the simple ones as classified later in the doc.
b. The move should be just copy/paste as much as possible: we shouldn’t try to
optimize the code while moving to avoid introducing new sources of errors.
c. Identify opportunities for optimization, and open issues for them.
4. Deprecate the Policy API and remove the Policy to Plugin config translation layer.
Outcome
1. Scheduler features implemented natively as Plugins.
2. Scheduler configuration is more streamlined, only done via component config.
Milestone 3
Goal
Leverage the power of the framework.
Steps
1. Framework in GA.
2. Optimize the features if we happen to identify opportunities during Milestone 2.
3. Perhaps deprecate generic scheduler in favor of the framework (embed its logic in the
framework implementation)
Predicates as Plugins
The core logic of Predicates will be implemented as Filters.
Default Predicates
Simple predicates
Those should not require more than the Filter extension point, and include:
● CheckNodeUnschedulablePredicate
● PodFitsResources (GeneralPred)
● PodFitsHost (GeneralPred)
● PodFitsHostPorts (GeneralPred)
● PodMatchNodeSelector (GeneralPred, includes node affinity)
● NoDiskConflict
● PodToleratesNodeTaints
● CSIMaxVolumeLimitChecker
● VolumeBindingChecker
● VolumeZoneChecker
We will also add a common pre-filter plugin named GeneralPreFilter that pre-computes generic
metadata used by different plugins and stores it in PluginContext (which is shared between
plugins). For now, GeneralPreFilter will produce and store the following metadata:
● Pod resource requests as computed by GetResourceRequest
● Pod QoS class as computed by GetPodQOS
Complex predicates
Those will require a special pre-filter, and have potential for optimizations (or shooting ourselves
in the foot :)), this includes:
● EvenPodsSpreadPredicate
● InterPodAffinityMatches
Custom Predicates
Policy allows curating two types of custom predicates:
CheckNodeLabelPresence
This custom predicate allows configuring the scheduler with a set of {label string, exist
boolean} tuples that the scheduler uses to filter a node based on whether all of those labels
exist/not-exist, regardless of their value.
In a sense, this custom predicate offers some sort of a default node affinity. It is not possible
however to map it directly to the canonical nodeAffinity API since we don’t really have a “node
anti-affinity” API based on labels (we have node taints but doesn’t map well to what this custom
predicate offers).
I think it is reasonable to migrate this predicate as a custom Plugin that can be configured with
the set of desired {label string, exist boolean} tuples using PluginConfig.Args.
CheckServiceAffinity
This custom predicate allows configuring the scheduler with a set of labels that will be used to
represent the equivalent of “a default inter-pod affinity constraint” over pods of a Service.
Basically the set of node labels specified is equivalent to “topologyKey”.
This custom predicate can be implemented in terms of podAffinity, so instead of migrating it as a
separate Plugin, we could try to provide a new API in ComponentConfig to officially represent a
default podAffinity, and define better semantics as to how that works for Pods that explicitly have
podAffinity set.
First, once CSI is GA, MaxPDVolumeCountChecker and its variants can be deprecated, this
includes:
● MaxEBSVolumeCountPred
● MaxGCEPDVolumeCountPred
● MaxAzureDiskVolumeCountPred
● MaxCinderVolumeCountPred
Second, TaintBasedEvictions feature is in Beta, it will graduate to GA in 1.17 and should allow
us to deprecate the following predicates:
● CheckNodeMemoryPressurePredicate
● CheckNodePIDPressurePredicate
● CheckNodeDiskPressurePredicate
● CheckNodeConditionPredicate
Priorities as Plugins
Priorities will be implemented as Score plugins (with potentially other extension points to
precompute data).
Default Priorities
Simple Priorities
● NodePreferAvoidPods
● TaintToleration
● ImageLocality
● NodeAffinity
We could add a single PreFilter plugin that produces metadata shared between all those plugins
and stored in PluginContext. This is equivalent to PriorityMetadata.
Complex Priorities
● InterPodAffinityPriority
● Resource utilization-based variants (complex just because multiple plugins will
potentially share a good amount of code)
○ BalancedResourcePriority
○ LeastResourcePriority
○ MostRequestedPriority (not a defaultScheduler priority)
○ RequestedToCapacity (not a defaultScheduler priority)
● EvenPodsSpread (flag gated)
Custom Priorities
NodeLabelPriority
Implemented by CalculateNodeLabelPriorityMap. This is a “soft” version of
CheckNodeLabelPresence custom predicate. Similar to CheckNodeLabelPresence, we will
likely need to provide a custom Plugin that can be configured via PluginConfig.Args to continue
to support this feature.
ServiceAntiAffinityPriority
This basically builds what is equivalent to a default soft anti-affinity for the pods of a Service. If
we can rely on anti-pod affinity to implement this predicate, then we may not need to migrate it.
RequestedToCapacityRatio
We will need to provide a Plugin for this one, and the arguments will be provided via
PluginConfig.Args. The Plugin implementing this one will likely be shared with
leastResourcePriority and balancedResourcePriority.
Since before ComponentConfig, the scheduler allows admins to specify a Policy to configure the
scheduler using two flags: one that loads Policy from a file, the other from a ConfigMap. After
we introduced ComponentConfig to the scheduler, those flags were deprecated and a Policy
file/ConfigMap can now be specified through ComponentConfig parameter. In this context,
deprecating Policy has two benefits:
First, simplifies configuring the scheduler: ComponentConfig itself is loaded from a file, the file
path is provided to the scheduler via a flag named “config”, which means the scheduler
configuration is practically split between two files, ComponentConfig and Policy. Consolidating
all configuration options in a single configuration file simplifies configuring the scheduler, and to
do that we need roll all options currently specified in Policy into the scheduler’s
ComponentConfig.
Second, simplifies the migration to the scheduler framework: currently the main use of Policy is
to configure the scheduler with the set of predicates and priorities to run in each scheduling
cycle. Predicates and Priorities will be re-written as framework plugins, which can already be
configured via the scheduler’s ComponentConfig. By deprecating Policy config we remove the
need for a permanent translation layer between Policy API that allows specifying
predicates/priorities and framework Plugins.