2. Cluster-Wide Configuration¶
2.1. Configuration Layout¶
The cluster is defined by the Cluster Information Base (CIB), which uses XML notation. The simplest CIB, an empty one, looks like this:
An empty configuration
<cib crm_feature_set="3.6.0" validate-with="pacemaker-3.5" epoch="1" num_updates="0" admin_epoch="0">
<configuration>
<crm_config/>
<nodes/>
<resources/>
<constraints/>
</configuration>
<status/>
</cib>
The empty configuration above contains the major sections that make up a CIB:
cib
: The entire CIB is enclosed with acib
element. Certain fundamental settings are defined as attributes of this element.configuration
: This section – the primary focus of this document – contains traditional configuration information such as what resources the cluster serves and the relationships among them.crm_config
: cluster-wide configuration optionsnodes
: the machines that host the clusterresources
: the services run by the clusterconstraints
: indications of how resources should be placed
status
: This section contains the history of each resource on each node. Based on this data, the cluster can construct the complete current state of the cluster. The authoritative source for this section is the local executor (pacemaker-execd process) on each cluster node, and the cluster will occasionally repopulate the entire section. For this reason, it is never written to disk, and administrators are advised against modifying it in any way.
In this document, configuration settings will be described as properties or options based on how they are defined in the CIB:
- Properties are XML attributes of an XML element.
- Options are name-value pairs expressed as
nvpair
child elements of an XML element.
Normally, you will use command-line tools that abstract the XML, so the distinction will be unimportant; both properties and options are cluster settings you can tweak.
2.2. CIB Properties¶
Certain settings are defined by CIB properties (that is, attributes of the
cib
tag) rather than with the rest of the cluster configuration in the
configuration
section.
The reason is simply a matter of parsing. These options are used by the configuration database which is, by design, mostly ignorant of the content it holds. So the decision was made to place them in an easy-to-find location.
Attribute | Description |
---|---|
admin_epoch | When a node joins the cluster, the cluster performs a
check to see which node has the best configuration. It
asks the node with the highest ( Warning: Never set this value to zero. In such cases, the cluster cannot tell the difference between your configuration and the “empty” one used when nothing is found on disk. |
epoch | The cluster increments this every time the configuration is updated (usually by the administrator). |
num_updates | The cluster increments this every time the configuration or status is updated (usually by the cluster) and resets it to 0 when epoch changes. |
validate-with | Determines the type of XML validation that will be done
on the configuration. If set to |
cib-last-written | Indicates when the configuration was last written to disk. Maintained by the cluster; for informational purposes only. |
have-quorum | Indicates if the cluster has quorum. If false, this may
mean that the cluster cannot start resources or fence
other nodes (see |
dc-uuid | Indicates which cluster node is the current leader. Used by the cluster when placing resources and determining the order of some events. Maintained by the cluster. |
2.3. Cluster Options¶
Cluster options, as you might expect, control how the cluster behaves when confronted with various situations.
They are grouped into sets within the crm_config
section. In advanced
configurations, there may be more than one set. (This will be described later
in the chapter on Rules where we will show how to have the cluster use
different sets of options during working hours than during weekends.) For now,
we will describe the simple case where each option is present at most once.
You can obtain an up-to-date list of cluster options, including their default
values, by running the man pacemaker-schedulerd
and
man pacemaker-controld
commands.
Option | Default | Description |
---|---|---|
cluster-name | An (optional) name for the cluster as a whole.
This is mostly for users’ convenience for use
as desired in administration, but this can be
used in the Pacemaker configuration in
Rules (as the |
|
dc-version | Version of Pacemaker on the cluster’s DC. Determined automatically by the cluster. Often includes the hash which identifies the exact Git changeset it was built from. Used for diagnostic purposes. |
|
cluster-infrastructure | The messaging stack on which Pacemaker is currently running. Determined automatically by the cluster. Used for informational and diagnostic purposes. |
|
no-quorum-policy | stop | What to do when the cluster does not have quorum. Allowed values:
|
batch-limit | 0 | The maximum number of actions that the cluster may execute in parallel across all nodes. The “correct” value will depend on the speed and load of your network and cluster nodes. If zero, the cluster will impose a dynamically calculated limit only when any node has high load. If -1, the cluster will not impose any limit. |
migration-limit | -1 | The number of live migration actions that the cluster is allowed to execute in parallel on a node. A value of -1 means unlimited. |
symmetric-cluster | true | Whether resources can run on any node by default (if false, a resource is allowed to run on a node only if a location constraint enables it) |
stop-all-resources | false | Whether all resources should be disallowed from running (can be useful during maintenance) |
stop-orphan-resources | true | Whether resources that have been deleted from
the configuration should be stopped. This value
takes precedence over |
stop-orphan-actions | true | Whether recurring operations that have been deleted from the configuration should be cancelled |
start-failure-is-fatal | true | Whether a failure to start a resource on a
particular node prevents further start attempts
on that node? If |
enable-startup-probes | true | Whether the cluster should check the pre-existing state of resources when the cluster starts |
maintenance-mode | false | Whether the cluster should refrain from monitoring, starting and stopping resources |
stonith-enabled | true | Whether the cluster is allowed to fence nodes (for example, failed nodes and nodes with resources that can’t be stopped. If true, at least one fence device must be configured before resources are allowed to run. If false, unresponsive nodes are immediately assumed to be running no resources, and resource recovery on online nodes starts without any further protection (which can mean data loss if the unresponsive node still accesses shared storage, for example). See also the requires resource meta-attribute. |
stonith-action | reboot | Action the cluster should send to the fence agent
when a node must be fenced. Allowed values are
|
stonith-timeout | 60s | How long to wait for |
stonith-max-attempts | 10 | How many times fencing can fail for a target before the cluster will no longer immediately re-attempt it. |
stonith-watchdog-timeout | 0 | If nonzero, and the cluster detects
If this is set to a positive value, unseen nodes are assumed to self-fence within this much time. Warning: It must be ensured that this value is
larger than the If this is set to a negative value, and
Warning: In this case, it is essential (and
currently not verified by pacemaker) that
|
concurrent-fencing | false | Whether the cluster is allowed to initiate
multiple fence actions concurrently. Fence actions
initiated externally, such as via the
|
fence-reaction | stop | How should a cluster node react if notified of its
own fencing? A cluster node may receive
notification of its own fencing if fencing is
misconfigured, or if fabric fencing is in use that
doesn’t cut cluster communication. Allowed values
are |
priority-fencing-delay | 0 | Apply this delay to any fencing targeting the lost
nodes with the highest total resource priority in
case we don’t have the majority of the nodes in
our cluster partition, so that the more
significant nodes potentially win any fencing
match (especially meaningful in a split-brain of a
2-node cluster). A promoted resource instance
takes the resource’s priority plus 1 if the
resource’s priority is not 0. Any static or random
delays introduced by |
cluster-delay | 60s | Estimated maximum round-trip delay over the network (excluding action execution). If the DC requires an action to be executed on another node, it will consider the action failed if it does not get a response from the other node in this time (after considering the action’s own timeout). The “correct” value will depend on the speed and load of your network and cluster nodes. |
dc-deadtime | 20s | How long to wait for a response from other nodes during startup. The “correct” value will depend on the speed/load of your network and the type of switches used. |
cluster-ipc-limit | 500 | The maximum IPC message backlog before one cluster daemon will disconnect another. This is of use in large clusters, for which a good value is the number of resources in the cluster multiplied by the number of nodes. The default of 500 is also the minimum. Raise this if you see “Evicting client” messages for cluster daemon PIDs in the logs. |
pe-error-series-max | -1 | The number of scheduler inputs resulting in errors to save. Used when reporting problems. A value of -1 means unlimited (report all), and 0 means none. |
pe-warn-series-max | 5000 | The number of scheduler inputs resulting in warnings to save. Used when reporting problems. A value of -1 means unlimited (report all), and 0 means none. |
pe-input-series-max | 4000 | The number of “normal” scheduler inputs to save. Used when reporting problems. A value of -1 means unlimited (report all), and 0 means none. |
enable-acl | false | Whether Access Control Lists (ACLs) should be used to authorize modifications to the CIB |
placement-strategy | default | How the cluster should allocate resources to nodes
(see Utilization and Placement Strategy). Allowed values are
|
node-health-strategy | none | How the cluster should react to node health
attributes (see Tracking Node Health). Allowed values
are |
node-health-base | 0 | The base health score assigned to a node. Only
used when |
node-health-green | 0 | The score to use for a node health attribute whose
value is |
node-health-yellow | 0 | The score to use for a node health attribute whose
value is |
node-health-red | 0 | The score to use for a node health attribute whose
value is |
cluster-recheck-interval | 15min | Pacemaker is primarily event-driven, and looks
ahead to know when to recheck the cluster for
failure timeouts and most time-based rules
(since 2.0.3). However, it will also recheck the
cluster after this amount of inactivity. This has
two goals: rules with |
shutdown-lock | false | The default of false allows active resources to be
recovered elsewhere when their node is cleanly
shut down, which is what the vast majority of
users will want. However, some users prefer to
make resources highly available only for failures,
with no recovery for clean shutdowns. If this
option is true, resources active on a node when it
is cleanly shut down are kept “locked” to that
node (not allowed to run elsewhere) until they
start again on that node after it rejoins (or for
at most |
shutdown-lock-limit | 0 | If |
remove-after-stop | false | Deprecated Should the cluster remove resources from Pacemaker’s executor after they are stopped? Values other than the default are, at best, poorly tested and potentially dangerous. This option is deprecated and will be removed in a future release. |
startup-fencing | true | Advanced Use Only: Should the cluster fence unseen nodes at start-up? Setting this to false is unsafe, because the unseen nodes could be active and running resources but unreachable. |
election-timeout | 2min | Advanced Use Only: If you need to adjust this value, it probably indicates the presence of a bug. |
shutdown-escalation | 20min | Advanced Use Only: If you need to adjust this value, it probably indicates the presence of a bug. |
join-integration-timeout | 3min | Advanced Use Only: If you need to adjust this value, it probably indicates the presence of a bug. |
join-finalization-timeout | 30min | Advanced Use Only: If you need to adjust this value, it probably indicates the presence of a bug. |
transition-delay | 0s | Advanced Use Only: Delay cluster recovery for the configured interval to allow for additional or related events to occur. This can be useful if your configuration is sensitive to the order in which ping updates arrive. Enabling this option will slow down cluster recovery under all conditions. |