Overview

The node must preserve node stability when available compute resources are low. This is especially important when dealing with incompressible resources such as memory or disk. If either resource is exhausted, the node becomes unstable.

Failure to disable swap memory makes the node not recognize it is under MemoryPressure.

To take advantage of memory based evictions, operators must disable swap.

Eviction Policy

Using eviction policies, a node can proactively monitor for and prevent against total starvation of a compute resource.

In cases where a node is running low on available resources, it can proactively fail one or more pods in order to reclaim the starved resource using an eviction policy. When the node fails a pod, it terminates all containers in the pod, and the PodPhase is transitioned to Failed.

Platform administrators can configure eviction settings within the node-config.yaml file.

Eviction Signals

The node can be configured to trigger eviction decisions on the signals described in the table below. The value of each signal is described in the description column based on the node summary API.

To view the signals:

curl <certificate details> \
  https://<master>/api/v1/nodes/<node>/proxy/stats/summary
Table 1. Supported Eviction Signals
Eviction Signal Description

memory.available

memory.available = node.status.capacity[memory] - node.stats.memory.workingSet

nodefs.available

nodefs.available = node.stats.fs.available

nodefs.inodesFree

nodefs.inodesFree = node.stats.fs.inodesFree

imagefs.available

imagefs.available = node.stats.runtime.imagefs.available

imagefs.inodesFree

imagefs.inodesFree = node.stats.runtime.imagefs.inodesFree

The node supports two file system partitions when detecting disk pressure.

  • The nodefs file system that the node uses for local disk volumes, daemon logs, and so on (for example, the file system that provides /).

  • The imagefs file system that the container runtime uses for storing images and individual container writable layers.

The node auto-discovers these file systems using cAdvisor.

If you store volumes and logs in a dedicated file system, the node will not monitor that file system at this time.

As of OpenShift Origin 3.4, the node supports the ability to trigger eviction decisions based on disk pressure. Operators must opt in to enable disk-based evictions. Prior to evicting pods due to disk pressure, the node will also perform container and image garbage collection. In future releases, garbage collection will be deprecated in favor of a pure disk eviction based configuration.

Eviction Thresholds

You can configure a node to specify eviction thresholds, which trigger the node to reclaim resources.

Eviction thresholds can be soft, for when you allow a grace period before reclaiming resources, and hard, for when the node takes immediate action when a threshold is met.

Thresholds are configured in the following form:

<eviction_signal><operator><quantity>
  • Valid eviction-signal tokens as defined by eviction signals.

  • Valid operator tokens are <.

  • Valid quantity tokens must match the quantity representation used by Kubernetes.

  • an eviction threshold can be expressed as a percentage if it ends with the % token.

For example, if an operator has a node with 10Gi of memory, and that operator wants to induce eviction if available memory falls below 1Gi, an eviction threshold for memory can be specified as either of the following:


memory.available<1Gi memory.available<10% ---

Soft Eviction Thresholds

A soft eviction threshold pairs an eviction threshold with a required administrator-specified grace period. The node does not reclaim resources associated with the eviction signal until that grace period is exceeded. If no grace period is provided, the node errors on startup.

In addition, if a soft eviction threshold is met, an operator can specify a maximum allowed pod termination grace period to use when evicting pods from the node. If specified, the node uses the lesser value among the pod.Spec.TerminationGracePeriodSeconds and the maximum-allowed grace period. If not specified, the node kills pods immediately with no graceful termination.

To configure soft eviction thresholds, the following flags are supported:

  • eviction-soft: a set of eviction thresholds (for example, memory.available<1.5Gi) that, if met over a corresponding grace period, triggers a pod eviction.

  • eviction-soft-grace-period: a set of eviction grace periods (for example, memory.available=1m30s) that correspond to how long a soft eviction threshold must hold before triggering a pod eviction.

  • eviction-max-pod-grace-period: the maximum-allowed grace period (in seconds) to use when terminating pods in response to a soft eviction threshold being met.

Hard Eviction Thresholds

A hard eviction threshold has no grace period and, if observed, the node takes immediate action to reclaim the associated starved resource. If a hard eviction threshold is met, the node kills the pod immediately with no graceful termination.

To configure hard eviction thresholds, the following flag is supported:

  • eviction-hard: a set of eviction thresholds (for example, memory.available<1Gi) that, if met, triggers a pod eviction.

Oscillation of Node Conditions

If a node is oscillating above and below a soft eviction threshold, but not exceeding its associated grace period, the corresponding node condition oscillates between true and false, which can confuse the scheduler.

To protect this, set the following flag to control how long the node must wait before transitioning out of a pressure condition:

  • eviction-pressure-transition-period: the duration that the node has to wait before transitioning out of an eviction pressure condition.

Before toggling the condition back to false, the node ensures that it has not observed a met eviction threshold for the specified pressure condition for the period specified.

Eviction Monitoring Interval

The node evaluates and monitors eviction thresholds every 10 seconds and the value can not be modified. This is the housekeeping interval.

Mapping Eviction Signals to Node Conditions

The node can map one or more eviction signals to a corresponding node condition.

If an eviction threshold is met, independent of its associated grace period, the node reports a condition indicating that the node is under pressure.

The following node conditions are defined that correspond to the specified eviction signal.

Table 2. Node Conditions Related to Low Resources
Node Condition Eviction Signal Description

MemoryPressure

memory.available

Available memory on the node has satisfied an eviction threshold.

DiskPressure

nodefs.available, nodefs.inodesFree, imagefs.available, or imagefs.inodesFree

Available disk space and inodes on either the node’s root file system or image file system has satisfied an eviction threshold.

When the above is set the node continues to report node status updates at the frequency specified by the node-status-update-frequency argument, which defaults to 10s.

Reclaiming Node-level Resources

If an eviction criteria is satisfied, the node initiates the process of reclaiming the pressured resource until it observes that the signal has gone below its defined threshold. During this time, the node does not support scheduling any new pods.

The node attempts to reclaim node-level resources prior to evicting end-user pods. If disk pressure is observed, the node reclaims node-level resources differently if the machine has a dedicated imagefs configured for the container runtime.

With Imagefs

If the nodefs file system meets eviction thresholds, the node frees up disk space in the following order:

  • Delete dead pods/containers

If the imagefs file system meets eviction thresholds, the node frees up disk space in the following order:

  • Delete all unused images

Without Imagefs

If the nodefs file system meets eviction thresholds, the node frees up disk space in the following order:

  • Delete dead pods/containers

  • Delete all unused images

Eviction of Pods

If an eviction threshold is met and the grace period is passed, the node initiates the process of evicting pods until it observes the signal going below its defined threshold.

The node ranks pods for eviction by their quality of service, and, among those with the same quality of service, by the consumption of the starved compute resource relative to the pod’s scheduling request.

  • BestEffort: pods that consume the most of the starved resource are failed first.

  • Burstable: pods that consume the most of the starved resource relative to their request for that resource are failed first. If no pod has exceeded its request, the strategy targets the largest consumer of the starved resource.

  • Guaranteed: pods that consume the most of the starved resource relative to their request are failed first. If no pod has exceeded its request, the strategy targets the largest consumer of the starved resource.

A Guaranteed pod will never be evicted because of another pod’s resource consumption unless a system daemon (node, docker, journald, etc) is consuming more resources than were reserved via system-reserved, or kube-reserved allocations or if the node has only Guaranteed pods remaining.

If the latter, the node evicts a Guaranteed pod that least impacts node stability and limits the impact of the unexpected consumption to other Guaranteed pods.

Local disk is a BestEffort resource. If necessary, the node will evict pods one at a time to reclaim disk when DiskPressure is encountered. The node ranks pods by quality of service. If the node is responding to inode starvation, it will reclaim inodes by evicting pods with the lowest quality of service first. If the node is responding to lack of available disk, it will rank pods within a quality of service that consumes the largest amount of local disk, and evict those pods first.

At this time, volumes that are backed by local disk are only deleted when a pod is deleted from the API server instead of when the pod is terminated.

As a result, if a pod is evicted as a consequence of consuming too much disk in an EmptyDir volume, the pod will be evicted, but the local volume usage will not be reclaimed by the node. The node will keep evicting pods on the node to prevent total exhaustion of disk. Operators can reclaim the disk by manually deleting the evicted pods from the node once terminated.

This will be remedied in a future release.

Scheduler

The scheduler views node conditions when placing additional pods on the node. For example, if the node has an eviction threshold like the following:

eviction-hard is "memory.available<500Mi"

and available memory falls below 500Mi, the node reports a value in Node.Status.Conditions as MemoryPressure as true.

Table 3. Node Conditions and Scheduler Behavior
Node Condition Scheduler Behavior

MemoryPressure

If a node reports this condition, the scheduler will not place BestEffort pods on that node.

DiskPressure

If a node reports this condition, the scheduler will not place any additional pods on that node.

Example Scenario

Consider the following scenario:

  • Node memory capacity of 10Gi.

  • The operator wants to reserve 10% of memory capacity for system daemons (kernel, node, etc.).

  • The operator wants to evict pods at 95% memory utilization to reduce thrashing and incidence of system OOM.

A node reports two values:

  • Capacity: How much resource is on the machine

  • Allocatable: How much resource is made available for scheduling.

The goal is to allow the scheduler to fully allocate a node and to not have evictions occur.

Evictions should only occur if pods use more than their requested amount of resource.

To facilitate this scenario, the node configuration file (the node-config.yaml file) is modified as follows:

kubeletArguments:
  eviction-hard: (1)
    - "memory.available<500Mi"
  system-reserved:
    - "1.5Gi"
1 This threshold can either be eviction-hard or eviction-soft.

Soft eviction usage is more common when you are targeting a certain level of utilization, but can tolerate temporary spikes. It is recommended that the soft eviction threshold is always less than the hard eviction threshold, but the time period is operator specific. The system reservation should also cover the soft eviction threshold.

Implicit in this configuration is the understanding that system-reserved should include the amount of memory covered by the eviction threshold.

To reach that capacity, either some pod is using more than its request, or the system is using more than 1Gi.

If a node has 10 Gi of capacity, and you want to reserve 10% of that capacity for the system daemons, do the following:

capacity = 10 Gi
system-reserved = 10 Gi * .01 = 1 Gi

The node allocatable value in this setting becomes:

allocatable = capacity - system-reserved = 9 Gi

This means by default, the scheduler will schedule pods that request 9 Gi of memory to that node.

If you want to turn on eviction so that eviction is triggered when the node observes that available memory falls below 10% of capacity for 30 seconds, or immediately when it falls below 5% of capacity, you need the scheduler to see allocatable as 8Gi. Therefore, ensure your system reservation covers the greater of your eviction thresholds.

capacity = 10 Gi
eviction-threshold = 10 Gi * .05 = .5 Gi
system-reserved = (10Gi * .01) + eviction-threshold = 1.5 Gi
allocatable = capacity - system-reserved = 8.5 Gi

You must set system-reserved equal to the amount of resource you want to reserve for system-daemons, plus the amount of resource you want to reserve before triggering evictions.

This configuration ensures that the scheduler does not place pods on a node that immediately induce memory pressure and trigger eviction assuming those pods use less than their configured request.

Out of Resource and Out of Memory

If the node experiences a system out of memory (OOM) event before it is able to reclaim memory, the node depends on the OOM killer to respond.

The node sets a oom_score_adj value for each container based on the quality of service for the pod.

Table 4. Quality of Service OOM Scores
Quality of Service oom_score_adj Value

Guaranteed

-998

BestEffort

1000

Burstable

min(max(2, 1000 - (1000 * memoryRequestBytes) / machineMemoryCapacityBytes), 999)

If the node is unable to reclaim memory prior to experiencing a system OOM event, the oom_killer calculates an oom_score:

% of node memory a container is using + `oom_score_adj` = `oom_score`

The node then kills the container with the highest score.

Containers with the lowest quality of service that are consuming the largest amount of memory relative to the scheduling request are failed first.

Unlike pod eviction, if a pod container is OOM failed, it can be restarted by the node based on its RestartPolicy.

DaemonSets and Out of Resource Handling

If a node evicts a pod that was created by a DaemonSet, the pod will immediately be recreated and rescheduled back to the same node, because the node has no ability to distinguish a pod created from a DaemonSet versus any other object.

In general, DaemonSets should not create BestEffort pods to avoid being identified as a candidate pod for eviction. Instead DaemonSets should ideally launch Guaranteed pods.