curl <certificate details> \ https://<master>/api/v1/nodes/<node>/proxy/stats/summary
The node must preserve node stability when available compute resources are low. This is especially important when dealing with incompressible resources such as memory or disk. If either resource is exhausted, the node becomes unstable.
Failure to disable swap memory makes the node not recognize it is under MemoryPressure. To take advantage of memory based evictions, operators must disable swap. |
Using eviction policies, a node can proactively monitor for and prevent against total starvation of a compute resource.
In cases where a node is running low on available resources, it can proactively
fail one or more pods in order to reclaim the starved resource using an eviction
policy. When the node fails a pod, it terminates all containers in the pod, and
the PodPhase
is transitioned to Failed.
Platform administrators can configure eviction settings within the node-config.yaml file.
The node can be configured to trigger eviction decisions on the signals described in the table below. The value of each signal is described in the description column based on the node summary API.
To view the signals:
curl <certificate details> \ https://<master>/api/v1/nodes/<node>/proxy/stats/summary
Eviction Signal | Description |
---|---|
|
|
|
|
|
|
|
|
|
|
The node supports two file system partitions when detecting disk pressure.
The nodefs
file system that the node uses for local disk volumes, daemon logs,
and so on (for example, the file system that provides /
).
The imagefs
file system that the container runtime uses for storing images and
individual container writable layers.
The node auto-discovers these file systems using cAdvisor
.
If you store volumes and logs in a dedicated file system, the node will not monitor that file system at this time.
As of OpenShift Origin 3.4, the node supports the ability to trigger eviction decisions based on disk pressure. Operators must opt in to enable disk-based evictions. Prior to evicting pods due to disk pressure, the node will also perform container and image garbage collection. In future releases, garbage collection will be deprecated in favor of a pure disk eviction based configuration. |
You can configure a node to specify eviction thresholds, which trigger the node to reclaim resources.
Eviction thresholds can be soft, for when you allow a grace period before reclaiming resources, and hard, for when the node takes immediate action when a threshold is met.
Thresholds are configured in the following form:
<eviction_signal><operator><quantity>
Valid eviction-signal
tokens as defined by eviction signals.
Valid operator
tokens are <
.
Valid quantity
tokens must match the quantity representation used by
Kubernetes.
an eviction threshold can be expressed as a percentage if it ends with the %
token.
For example, if an operator has a node with 10Gi of memory, and that operator wants to induce eviction if available memory falls below 1Gi, an eviction threshold for memory can be specified as either of the following:
memory.available<1Gi memory.available<10% ---
A soft eviction threshold pairs an eviction threshold with a required administrator-specified grace period. The node does not reclaim resources associated with the eviction signal until that grace period is exceeded. If no grace period is provided, the node errors on startup.
In addition, if a soft eviction threshold is met, an operator can specify a
maximum allowed pod termination grace period to use when evicting pods from the
node. If specified, the node uses the lesser value among the
pod.Spec.TerminationGracePeriodSeconds
and the maximum-allowed grace period.
If not specified, the node kills pods immediately with no graceful termination.
To configure soft eviction thresholds, the following flags are supported:
eviction-soft
: a set of eviction thresholds (for example,
memory.available<1.5Gi
) that, if met over a corresponding grace period,
triggers a pod eviction.
eviction-soft-grace-period
: a set of eviction grace periods (for
example, memory.available=1m30s
) that correspond to how long a soft eviction
threshold must hold before triggering a pod eviction.
eviction-max-pod-grace-period
: the maximum-allowed grace period (in
seconds) to use when terminating pods in response to a soft eviction threshold
being met.
A hard eviction threshold has no grace period and, if observed, the node takes immediate action to reclaim the associated starved resource. If a hard eviction threshold is met, the node kills the pod immediately with no graceful termination.
To configure hard eviction thresholds, the following flag is supported:
eviction-hard
: a set of eviction thresholds (for example,
memory.available<1Gi
) that, if met, triggers a pod eviction.
If a node is oscillating above and below a soft eviction threshold, but not exceeding its associated grace period, the corresponding node condition oscillates between true and false, which can confuse the scheduler.
To protect this, set the following flag to control how long the node must wait before transitioning out of a pressure condition:
eviction-pressure-transition-period
: the duration that the node has
to wait before transitioning out of an eviction pressure condition.
Before toggling the condition back to false, the node ensures that it has not observed a met eviction threshold for the specified pressure condition for the period specified.
The node evaluates and monitors eviction thresholds every 10 seconds and the value can not be modified. This is the housekeeping interval.
The node can map one or more eviction signals to a corresponding node condition.
If an eviction threshold is met, independent of its associated grace period, the node reports a condition indicating that the node is under pressure.
The following node conditions are defined that correspond to the specified eviction signal.
Node Condition | Eviction Signal | Description |
---|---|---|
|
|
Available memory on the node has satisfied an eviction threshold. |
|
|
Available disk space and inodes on either the node’s root file system or image file system has satisfied an eviction threshold. |
When the above is set the node continues to report node status updates at the
frequency specified by the node-status-update-frequency
argument, which
defaults to 10s
.
If an eviction criteria is satisfied, the node initiates the process of reclaiming the pressured resource until it observes that the signal has gone below its defined threshold. During this time, the node does not support scheduling any new pods.
The node attempts to reclaim node-level resources prior to evicting end-user
pods. If disk pressure is observed, the node reclaims node-level resources
differently if the machine has a dedicated imagefs
configured for the
container runtime.
If an eviction threshold is met and the grace period is passed, the node initiates the process of evicting pods until it observes the signal going below its defined threshold.
The node ranks pods for eviction by their quality of service, and, among those with the same quality of service, by the consumption of the starved compute resource relative to the pod’s scheduling request.
BestEffort
: pods that consume the most of the starved resource are failed
first.
Burstable
: pods that consume the most of the starved resource relative to their
request for that resource are failed first. If no pod has exceeded its request,
the strategy targets the largest consumer of the starved resource.
Guaranteed
: pods that consume the most of the starved resource relative to
their request are failed first. If no pod has exceeded its request, the strategy
targets the largest consumer of the starved resource.
A Guaranteed
pod will never be evicted because of another pod’s resource
consumption unless a system daemon (node, docker, journald, etc) is
consuming more resources than were reserved via system-reserved, or
kube-reserved allocations or if the node has only Guaranteed
pods remaining.
If the latter, the node evicts a Guaranteed
pod that least impacts node
stability and limits the impact of the unexpected consumption to other
Guaranteed
pods.
Local disk is a BestEffort
resource. If necessary, the node will evict pods
one at a time to reclaim disk when DiskPressure
is encountered. The node ranks
pods by quality of service. If the node is responding to inode starvation, it
will reclaim inodes by evicting pods with the lowest quality of service first.
If the node is responding to lack of available disk, it will rank pods within a
quality of service that consumes the largest amount of local disk, and evict
those pods first.
At this time, volumes that are backed by local disk are only deleted when a pod is deleted from the API server instead of when the pod is terminated. As a result, if a pod is evicted as a consequence of consuming too much disk in
an This will be remedied in a future release. |
The scheduler views node conditions when placing additional pods on the node. For example, if the node has an eviction threshold like the following:
eviction-hard is "memory.available<500Mi"
and available memory falls below 500Mi, the node reports a value in Node.Status.Conditions
as MemoryPressure
as true.
Node Condition | Scheduler Behavior |
---|---|
|
If a node reports this condition, the scheduler will not place |
|
If a node reports this condition, the scheduler will not place any additional pods on that node. |
Consider the following scenario:
Node memory capacity of 10Gi
.
The operator wants to reserve 10% of memory capacity for system daemons (kernel, node, etc.).
The operator wants to evict pods at 95% memory utilization to reduce thrashing and incidence of system OOM.
A node reports two values:
Capacity
: How much resource is on the machine
Allocatable
: How much resource is made available for scheduling.
The goal is to allow the scheduler to fully allocate a node and to not have evictions occur.
Evictions should only occur if pods use more than their requested amount of resource.
To facilitate this scenario, the node configuration file (the node-config.yaml file) is modified as follows:
kubeletArguments: eviction-hard: (1) - "memory.available<500Mi" system-reserved: - "1.5Gi"
1 | This threshold can either be eviction-hard or eviction-soft . |
Soft eviction usage is more common when you are targeting a certain level of utilization, but can tolerate temporary spikes. It is recommended that the soft eviction threshold is always less than the hard eviction threshold, but the time period is operator specific. The system reservation should also cover the soft eviction threshold. |
Implicit in this configuration is the understanding that system-reserved
should include the amount of memory covered by the eviction threshold.
To reach that capacity, either some pod is using more than its request, or the
system is using more than 1Gi
.
If a node has 10 Gi of capacity, and you want to reserve 10% of that capacity for the system daemons, do the following:
capacity = 10 Gi system-reserved = 10 Gi * .01 = 1 Gi
The node allocatable value in this setting becomes:
allocatable = capacity - system-reserved = 9 Gi
This means by default, the scheduler will schedule pods that request 9 Gi of memory to that node.
If you want to turn on eviction so that eviction is triggered when the node observes that available memory falls below 10% of capacity for 30 seconds, or immediately when it falls below 5% of capacity, you need the scheduler to see allocatable as 8Gi. Therefore, ensure your system reservation covers the greater of your eviction thresholds.
capacity = 10 Gi eviction-threshold = 10 Gi * .05 = .5 Gi system-reserved = (10Gi * .01) + eviction-threshold = 1.5 Gi allocatable = capacity - system-reserved = 8.5 Gi
You must set system-reserved
equal to the amount of resource you want to
reserve for system-daemons, plus the amount of resource you want to reserve
before triggering evictions.
This configuration ensures that the scheduler does not place pods on a node that immediately induce memory pressure and trigger eviction assuming those pods use less than their configured request.
If the node experiences a system out of memory (OOM) event before it is able to reclaim memory, the node depends on the OOM killer to respond.
The node sets a oom_score_adj
value for each container based on the quality
of service for the pod.
Quality of Service | oom_score_adj Value |
---|---|
|
-998 |
|
1000 |
|
min(max(2, 1000 - (1000 * memoryRequestBytes) / machineMemoryCapacityBytes), 999) |
If the node is unable to reclaim memory prior to experiencing a system OOM
event, the oom_killer
calculates an oom_score
:
% of node memory a container is using + `oom_score_adj` = `oom_score`
The node then kills the container with the highest score.
Containers with the lowest quality of service that are consuming the largest amount of memory relative to the scheduling request are failed first.
Unlike pod eviction, if a pod container is OOM failed, it can be restarted by
the node based on its RestartPolicy
.
If a node evicts a pod that was created by a DaemonSet, the pod will immediately be recreated and rescheduled back to the same node, because the node has no ability to distinguish a pod created from a DaemonSet versus any other object.
In general, DaemonSets should not create BestEffort
pods to avoid being
identified as a candidate pod for eviction. Instead DaemonSets should ideally
launch Guaranteed
pods.