cluster2 architecture
-------------------------------------------------------------------------------

version 0.1 - March 12, 2008

David Teigland <teigland@redhat.com>

This document describes the architecture of cluster2, the second generation
clustering code from http://sources.redhat.com/cluster.  It's used in RHEL5.
cluster1, the first generation code, used in RHEL4, is roughly described in
"Symmetric Cluster Architecture" (http://people.redhat.com/teigland/sca.pdf).

The big changes from the first to second generation are:
- moving the clustering code from the kernel to userspace
- using openais as the core membership/messaging system
- largely rewriting the dlm to work much better

One major motivation now driving the evolution of the clustering code is to
remove code developed, used and supported only by RH.  It is being replaced
with code developed, used and supported by a broad community.  Collaboration
on the same code with other developers and companies is crucial for the
success of this clustering work.

This means rethinking and reengineering many parts of the cluster architecture
to reach the point where community code/infrastructure can replace code
specific to our project.  Or, if code can't be immediately replaced, the
reengineering should take steps toward removing project-specific code in the
next version.

The sections roughly follow the steps to set up and run a cluster, and
ultimately gfs.  As an architectural document, the emphasis is on defining the
different parts, and describing how the parts interact or relate to each
other.


cluster.conf
-------------------------------------------------------------------------------

To get started setting up a cluster, create a cluster.conf file, an xml
configuration file defining the basic properties of the desired cluster.
Choose a name for the cluster; it can be up to 16 characters long, and should
be unique from any other clusters running on the network.  Each node that will
be a part of the cluster should have a clusternode entry.  The name should
resolve to the IP address on the network interface to be used for cluster
communication.  The nodeid's need to be a positive integer, and unique within
this cluster.

<cluster name="alpha" config_version="1">

<clusternodes>
	<clusternode name="node1.example.com" nodeid="1">
	</clusternode>

	<clusternode name="node2.example.com" nodeid="2">
	</clusternode>

	<clusternode name="node3.example.com" nodeid="3">
	</clusternode>
</clusternodes>

</cluster>

Copy this file to /etc/cluster/cluster.conf on all the nodes listed.

Related man pages: cluster.conf(5)


ccs
-------------------------------------------------------------------------------

CCS is the "cluster configuration system".  The ccs daemon, ccsd, is usually
started with no command line options.  It has two main jobs.  First, it
provides cluster.conf information to local programs/daemons through its own
API, libccs.  This is simply an indirect way for programs to read values from
cluster.conf.  Second, it can propagate and replace cluster.conf on cluster
nodes as it's updated.

libccs

The main job of ccsd is to provide cluster.conf information to other local
programs through libccs.  A program needs to connect to ccsd, read config
values by providing xpath strings (e.g. "/cluster/@name"), then disconnect.
Normal connections are denied if the cluster (cman) does not have quorum.
Forced connections should always work.  The ccs_test command can be used to
test these capabilities.

updates

The second job of ccsd is to update cluster.conf on members of a running
cluster.  When 'ccs_tool update' is run, the local ccsd broadcasts a new
cluster.conf file, and ccsd on other cluster nodes replace their copy with the
new one.  Also, when ccsd starts up, it broadcasts its cluster.conf to see if
it needs to be replaced by a newer version in use on other nodes.  Online
updates to cluster.conf should be limited to adding new nodes to the cluster,
which should be infrequent.  When cluster.conf is updated, the config_version
should be incremented; ccsd uses this to compare cluster.conf files on
different nodes.

Many people wish to manage cluster.conf themselves, and copy an updated
cluster.conf to other nodes manually.  To do this, the "update" ccsd feature
can be disabled with the -X command line option.  In this case, there's no
need for ccsd to do any network communication, or interact with cman at all.
(Although I think it still might in some cases, unfortunately.  This should be
fixed.)

[Problems can occur in the complicated handling of cluster.conf syncing, even
in the absence of any updates.  This will prevent the main libccs functions
from working, which prevents the cluster from starting.  So, a problem in a
frequently unnecessary feature (update mechanism) can jeopardize the critically
important libccs function, without which a cluster cannot be started.  This
makes the -X option an important feature to know about.]

cman

- ccsd does it's own network communication in the cluster; it cannot use cman
for cluster communication because cman depends on reading config values
through ccs to start.

- cman also keeps track of the config_version and requires all cluster members
to be using the same version.

update procedure

There are three approaches to updating cluster.conf.  When the cluster is
online, updates should be limited to adding new nodes and fencing devices for
them; changes should not be made to existing nodes.

cluster offline
- edit new cluster.conf
- copy manually to all nodes

cluster online option 1
- edit new cluster.conf, increment config version
- copy manually to all nodes
- run cman_tool version -r to tell cman the new config_version

cluster online option 2
- edit new cluster.conf, increment config version
- ccs_tool update <new cluster.conf>
- check that the old cluster.conf has been replaced by the new one
  on all cluster members; the cluster needs to have quorum
  (ccs_tool update also updates cman)

In the future, ccs/ccsd will no longer exist; their function will be
incorporated into the cluster manager.

Related man pages: ccs(7), ccsd(8), ccs_tool(8), ccs_test(8)


openais
-------------------------------------------------------------------------------

Openais (openais.org) is the central cluster membership and messaging system.
Cman is the part of openais we use to configure and interact with aisexec, the
openais executive.  The "ais" in the name refers to the set of libraries on
top of the central membership system that implement the AIS programming API's.

aisexec

Aisexec is the name of the openais binary that runs on each system.  It loads
many of its parts from the plugins in /usr/libexec/lcrso/.  We do not allow
aisexec to be executed directly from the command line or from the openais init
script.  If it is started directly, it will load its configuration from
/etc/openais.conf; we need to configure it differently.

totem

Totem is the core membership/messaging code, based on the published totem
algorithm for reliable group communication.  A token is circulated among all
members; if one member fails, a token timeout will be detected and the
remaining nodes will reconfigure the cluster membership.  The token is
unicast, other openais messages and application messages are multicast.

Totem implements "virtual synchrony" (VS), which is an established method for
replicating state across a cluster of nodes.  VS (totem) guarantees the
ordering of cluster messages and configuration changes, such that all nodes
see all messages and changes in the same order.  Higher level distributed
programs can be built depending on these messaging guarantees, and behave as
if all instances were operating synchronously.

http://en.wikipedia.org/wiki/Virtual_synchrony

cman

Cman is a collection of things that make openais more usable.  To a user, cman
makes openais look and behave much like the first generation (RHEL4) cman.
Cman also adds the quorum feature on top of openais, which is computed from
the openais membership.  It uses the traditional quorum model from VMS.

- service_cman.lcrso is an openais plugin; it configures openais
  according to cman defaults and cluster.conf, it keeps track of quorum,
  and it supplies information to libcman queries
- libcman is a library with API's mainly for querying cluster state;
  it communicates with service_cman through a unix socket
- cman_tool is a command line program for starting/stopping aisexec
  and also provides a CLI for most of the libcman queries

cpg

Processes on cluster nodes can form a "process group" with a given name and
send messages to the group.  A "closed process group" (cpg), in particular,
has a membership; defined as those processes that have explicitly joined the
cpg by name.  The cpg members are identified by a nodeid/pid pair, and are
notified of changes to the cpg membership.  The properties of virtual
synchrony (see totem above) apply to the cpg messages and configuration
changes.  A process can join any number of cpg's.

A cpg is typically used to replicate state among cooperating processes on
different cluster nodes.  Named instances of cluster resources can be
associated with a unique cpg by using the resource name as the cpg name.  This
approach fits nicely with the symmetric, server-less architecture our cluster
components have traditionally followed.  Managing the state of fencing, dlm
and cluster file systems is quite naturally done through the use of cpg's.

[Like most other openais services, the cpg service does not provide a way of
querying or inspecting the cpg state.  This is a significant limitation that
should be addressed in the future.]

AIS

The SA Forum (saforum.org) develops the "Application Interface Specification"
(AIS) which is a software API for interacting with HA/cluster frameworks like
ours.  The idea is to provide standardized API's for developing HA/cluster
programs.  Openais has a set of libraries that unofficially implement the AIS
API's.  The current openais development release contains the following AIS
API's in varying stages of completion:

CKPT (version B.01.01) - checkpoint
CLM  (version B.01.01) - membership
EVT  (version B.01.01) - events
LCK  (version B.01.01) - locking
MSG  (version B.01.01) - messages
AMF  (version B.01.02) - managing HA applications

These are all implemented as openais plugins, using the core openais
subsystems such as totem.

CKPT is the checkpoint service, and is the only one of the AIS services that
we currently use in other components of our architecture.  Other AIS services
would probably be appropriate to use in the future.

More information:
http://openais.org
http://saform.org

Related man pages: openais_overview(8), cpg_overview(8), cman(5)


cman
-------------------------------------------------------------------------------

As summarized in the previous section, cman is a collection of things on the
periphery of openais that allow us to control and interact with it.  As such,
they are more visible to the user and require more detailed explanations.

joining the cluster

The command "cman_tool join" starts aisexec, which is how a node joins the
cluster.  It sets an environment variable telling aisexec to be configured via
cman (OPENAIS_DEFAULT_CONFIG_IFACE=cmanconfig), and then forks/execs aisexec.
(If cman_tool command line options are used to override cluster.conf values,
they are also passed to aisexec/service_cman via environment variables.)

Once running, aisexec calls into service_cman for configuration.  Service_cman
reads cluster.conf values through libccs.  The settings it looks for are:

. cluster name
. cluster id
. multicast address
. udp port

These things are all related, and involve distinguishing different clusters.
Each cluster must be given a unique name in cluster.conf.  The cluster name
may be used by higher level systems for ownership or access permission.  The
cluster name is not used directly within openais.

<cluster name="foo"/>

The cluster name is hashed to create the 16 bit cluster id.  The cluster id
can also be assigned explicitly.  The cluster id is used to isolate clusters
that see the same network traffic.  The two bytes of cluster id are used to
generate the multicast address to be used by openais: 239.192.x.y where x is
the upper 8 bits and y is the lower 8 bits.

<cman cluster_id="1234"/>

The multicast address can also be assigned explicitly.  

<cman>
<multicast addr="229.192.0.1"/>
</cman>

The default UDP ports used by openais are 5405/5404.  These can be changed if
the defaults are not available, or to isolate different clusters that need to
use the same multicast address.

<cman port="6809"/>

will cause openais to use ports 6809 and 6808 for internode communication.

. local node name
. local node id

service_cman must first identify the cluster.conf clusternode entry for the
local node.  It tries a number of things, like uname and hostnames on
available network interfaces to match one entry.  If it cannot find a match it
will not start the cluster.  The nodename must resolve to the IP address on a
network interface; this is the address that will be used for cluster
communication.  A unique nodeid must be assigned for each node; these id's are
how aisexec identifies nodes.

<clusternode name="node1.example.com" nodeid="1"/>

. expected votes
. local votes

The quorum system is based on votes assigned to each cluster member, and it
needs to know the vote settings for all possible members when starting up.
The default vote settings are typically used, in which case no voting related
information needs to be included in cluster.conf.  Each node defaults to one
vote, and expected votes defaults to the sum of votes of all nodes in
cluster.conf.  See the quorum section for more information.

. encryption key

All openais network traffic is encrypted.  By default cman uses the cluster
name as the security key.  If more security is needed, a file can be specified
with a key that will be used to encrypt the communication.  Of course, the
content of the key file must be the same on all nodes.  A key file must be
manually copied to all nodes.

<cman keyfile="/etc/openais.key"/>

. openais parameters

When openais is started by cman, the openais.conf file is not used.  Many of
the configuration parameters listed in openais.conf can be set in cluster.conf
instead.  Cman will read openais parameters from the following sections in
cluster.conf and load them into openais:

<totem />
<logging />
<event />
<aisexec />
<group />

See the openais.conf man page for the specific parameters that can be set in
these sections.  Note that settings in the <clusternodes> section will
override any comparable settings in the openais sections above (in particular,
bindnetaddr, mcastaddr, mcastport and nodeid will always be replaced by values
in <clusternodes>).

. token timeout

The token timeout is the number of milliseconds a node has to send the token
before the node is considered failed.  Cman sets this to a default of 10000
(10 seconds).  Setting the following openais parameter changes it to 5000 (5
seconds):

<totem token="5000"/>

. logging

The following settings add extra logging for cman and cpg:

<logging to_stderr="yes">
  <logger ident="CPG" debug="on" to_stderr="yes"/>
  <logger ident="CMAN" debug="on" to_stderr="yes"/>
</logging>

leaving the cluster

The "cman_tool leave" command terminates aisexec and causes the node to leave
the cluster.  The leave command will fail and return a "busy" error if other
systems running on the node are still actively using the cluster.  "cman_tool
leave force" will terminate the cluster even if other systems are using it (it
may not be possible to clean up the resulting state without a node reboot.)

Other system daemons that depend on the cluster register with cman when they
start, so they'll receive a callback from cman when a leave has been
requested.  The daemons report back to cman whether or not they are still
using the cluster or can cleanly shut down.  If openais decides to shut down,
these other daemons will notice and also exit.

When enough nodes leave the cluster, quorum will be lost, which may block
other nodes from shutting down if they still need to stop cluster programs
that are blocked by the loss of quorum.  The "cman_tool leave remove" command
avoids this problem by reducing the expected votes on the remaining nodes by
the votes of the node that left.

analyzing the cluster

The libcman library provides an API for querying information about the
cluster, cluster members, and quorum.  It also provides a way for apps to
receive callbacks about certain cluster events.  libcman requests are passed
to service_cman (and results sent back) through a unix socket.

Most of the information available from libcman can be accessed from the
command line by using cman_tool:
cman_tool status - show properties of the cluster
cman_tool nodes - show properties of cluster nodes

Cman debugging information can be dynamically enabled using cman_tool -d.
See the cman_tool man page for more information.

quorum

Quorum is a majority voting scheme where cluster nodes are assigned votes, and
they contribute their votes to the cluster while they are members.  If the
cluster has a majority of all possible votes, it has quorum (also called
quorate), otherwise it does not.  Quorum is a boolean property of the cluster.

. node votes: the number of votes assigned to individual nodes
  (nodes can be assigned different numbers of votes, each node typically has 1)

. cluster votes: the number of votes the cluster has
  (the vote sum of all current cluster members)

. expected votes: the total possible votes the cluster could have
  (the vote sum of all nodes listed in cluster.conf)

. quorum votes: the number of votes the cluster needs to have quorum
  (expected_votes / 2 + 1, rounded up)

The cluster has quorum if cluster votes >= quorum votes.

Cman calculates quorum.  It determines node votes and expected votes by
reading cluster.conf, it uses the current membership from aisexec to determine
cluster votes, and it simply calculates quorum votes mathematically.  Cman
reports the quorum state through libcman to any interested applications.  The
quorum state does not effect the general behavior of openais/cman.

Quorum factors into decision making in various programs in different ways (or
not at all.)  It can be useful for programs that want to deal with a possible
"split brain" condition of the cluster.  Split brain is when two or more
instances of the same cluster exist in parallel and do not know about each
other (also called a cluster partition).  In this situation, the quorum
algorithm is guaranteed to report that the cluster has quorum in only one of
the partitions.  Using this, a program can make a decision based on the
cluster possessing quorum and be guaranteed that no other instances of the
program make the same decision.

If no votes value is explicitly assigned to a node in cluster.conf, it will
get 1 vote by default.  This can be changed with the "votes" setting:

<clusternode name="node1.example.com" nodeid="1" votes="2"/>

The default number of expected votes is the sum of votes from all cluster
nodes listed in cluster.conf.  This can be changed, although it should be done
with caution:

<cman expected_votes="4"/>

In the special case of a two node cluster, quorum can be disabled as follows,
allowing the cluster to have quorum with just one member:

<cman two_node="1" expected_votes="1"/>

(This may not be a legitimate exception for some uses of cman.  In the case of
clusters used for fencing/dlm/gfs, described below, it can be safe given an
appropriate fencing configuration that reboots nodes.)

Expected votes is largely a static value, but it can be changed in a running
cluster in some special cases.  Expected votes will grow automatically if a
node is added to the cluster.  While running, cman will adopt the largest
value of expected votes it sees.  When a new node joins the cluster for the
first time (after being added to cluster.conf), it propagates a new,
increased, expected votes value that includes its own vote.

Expected votes can also be reduced, but there is some danger in doing so
because it can lead to multiple quorate partitions in a partitioned cluster.
Because of this, e.v. should not be reduced automatically, but only by
manually administration.  With "cman_tool expected -e <votes>", an
administrator can lower e.v. to give the cluster quorum when it is known that
there is no cluster partition, and reducing quorum won't disturb a partitioned
majority cluster.  Similarly, when manually joining nodes to the cluster,
"cman_tool join -e <votes>" can be used to give the partially started cluster
quorum when the state of the other cluster nodes is known.

When shutting down a node, it is convenient to use "cman_tool leave remove"
which will reduce e.v. by the votes of the node being shut down.  This should
be used when shutting down the entire cluster, or when shutting down a node
that will be out of service for some time.  Using "leave remove" more widely
may also be acceptable, but would make multiple quorate partitions more likely
given a cluster partition.

More information:
http://people.redhat.com/pcaulfie/docs/aiscman.pdf

Related man pages: cman(5), cman_tool(8), cluster.conf(5), openais.conf(5)


cpg
-------------------------------------------------------------------------------

when to use a cpg

In the context of this document, a cluster application is one where separate
instances of the app are running in parallel on multiple nodes in the cluster.
Simple apps can be run in parallel no differently than if one instance were
running.  For more complex apps, however, the instances need to be aware of
each other, communicate, and operate synchronously with replicated state.
Depending on the app, the instances may operate symmetrically as equal peers,
or asymmetrically, with instances doing different jobs (the more symmetrical
the design, the better it seems to fit with cpg.)

A cpg, closed process group, is a group of cooperating processes that are
aware of each other and can communicate.  The processes can be running on any
node in the cluster, and are identified to each other with a nodeid:pid pair.
A cpg is designed to be used for cluster applications as described above,
where each instance of the app joins the cpg when it starts.

A cpg may be paired with the app directly, or the app may use multiple cpg's,
pairing a separate cpg with each instance of a resource it is managing, for
example.  An app can join any number of cpg's.

how to use a cpg

A process joins a cpg through libcpg and provides the name of the cpg it wants
to join.  The processes that have joined the cpg are identified in the cpg's
explicit list of members.  A process is removed from the member list when it
leaves, or if it exits/fails before leaving.  Members are notified through a
"configuration change" (confchg) callback when the member list changes.

Each cpg is identified by a unique name which is used to join it.  An app
should use its own name as the name of the cpg to avoid collisions with other
apps.  Or, if the app uses multiple cpg's, it should combine the app name with
the unique name of the resource it's associating with the cpg.

When a process joins a cpg, the first callback it receives is the confchg
showing it being added to the member list.  The process is not a member of the
cpg until this confchg arrives.  When a process leaves, the final callback it
receives is the confchg showing it being removed from the member list.  The
process remains a member of the cpg until this confchg arrives.

After joining a cpg, the process can multicast messages to other cpg members.
Actions that need to be synchronized (e.g. changes to state that needs to be
in sync) are typically done by first sending a message describing the action
to the group, and then acting only when the message is received.  Since all
members receive all messages in the same order, all instances of the app will
take the actions in the same sequence.

why to use a cpg

The biggest benefit of a cpg are the guarantees about the delivery order of
messages and confchgs.  The notion of Virtual Synchrony applies to cpg's as
explained in the earlier section.  So, if a number of messages and
configuration changes all happen at about the same time, all members are
guaranteed to see the messages and confchgs occur in the same order.  A
cluster app should be designed to exploit this fact.

For example, say that P1 and P2 are members of a cpg, and the following three
things happen at the same time: P1 sends message X, P2 sends message Y, P3
joins the cpg.  These three things could be seen in any of the following
orders, but if P1 sees (a), then P2 and P3 are also guaranteed to see (a).

a) recv X, recv Y, add P3
b) recv X, add P3, recv Y
c) recv Y, recv X, add P3
d) recv Y, add P3, recv X
e) add P3, recv X, recv Y
f) add P3, recv Y, recv X

(Note that P3 will only recv X or Y if they follow P3 being added.)


using openais for cluster storage systems
-------------------------------------------------------------------------------

openais as described above could be used as a foundation for many different
clustering related programs/applications.

The rest of the document focuses on four cluster applications built on
openais, all centered around sharing storage in the cluster: an i/o fencing
system, a distributed lock manager (dlm), a cluster volume manager (clvm), and
a cluster file system (gfs).  These are often used together, but they are
independent subsystems that can also be used individually, or in combination
with other cluster applications not covered here.

. fencing+dlm may be used in combination with a cluster database
. fencing+dlm may be used for HA application management (see rgmanager)
. fencing+dlm+clvm may be used for managing shared storage with local fs's
. fencing+dlm+clvm may be used with gfs, or a different cfs 

In the event that other cluster applications are used in conjunction with some
of the subsystems above, it would be important for that app to also use
openais as its source for cluster membership.


groupd
-------------------------------------------------------------------------------

The groupd daemon (and libgroup interface) is a compatibility layer that was
created in cluster2 to help migrate fence/dlm/gfs subsystems to the new
openais-based infrastructure.  This layer translates between openais/cpg, and
fence/dlm/gfs which still expect cluster1-like interfaces and behaviors.  In
the future, fence/dlm/gfs will no longer require this complex and cumbersome
translation and will use the openais/cpg interfaces directly.

An application uses the libgroup interface to join and leave a cpg.  groupd
translates group joins/leaves into cpg joins/leaves, and translates cpg
configuration change callbacks into group callbacks.  In general, groupd
translates each cpg callback into a sequence of three callbacks to an app.

An application provides a set of callback functions that groupd will use to
notify it of changes to the group membership.  There are three callbacks to
the app for each group membership change: stop, start, and finish.  The app
needs to acknowledge the first two callbacks because groupd executes a barrier
after them.  Barrier 1 follows callback 1 (stop), and barrier 2 follows
callback 2 (start).  Callback 2 (start) won't happen until barrier 1
completes, and callback 3 (finish) won't happen until barrier 2 completes.

The purpose of callback 1 (stop), is to allow the app to prepare for
configuration changes in other subsystems and itself.  The purpose of callback
2 (start) is to tell the app to begin recovery/reconfiguration.  The new group
membership is provided with the start callback.  The reason for barrier 1 is
that some apps need to suspend operations everywhere before any instance
begins operating with a new configuration.  The purpose of barrier 2 is so
that apps can safely free any state related to recovery, knowing that it will
no longer be needed.

In the case where the change is due to a node failure, the first barrier
(after stop) extends to all groups effected by the change, i.e. the first
barrier requires the stop callback to be acked by all effected groups on all
nodes.  Also, all groups effected by a failure event are restarted in a fixed
order after the full stop barrier, i.e. all groups at level 0 finish starting
before any groups at level 1 are started, and all groups at level 1 finish
starting before any groups at level 2 are started.  Groupd defines fenced to
be at level 0, dlm to be at level 1 and gfs to be at level 2.  This simply
allows the apps to avoid checking the state of a lower level app directly.

Groupd buffers cpg events, and allows an app to finish processing a join or
leave before it is notified about the next join or leave.  Change events for
node failures do interrupt the processing of any current event, though,
because the completion of the current event may depend on the newly failed
node.

Groupd creates its own cpg through which it sends all messages, instead of
sending messages through the cpg's for the individual groups.  This is
necessary because it processes the cpg events asynchronously; the cpg
membership doesn't match the membership being applied to the application.
Messages need to be sent to all app members, which may include nodes no longer
in the cpg.

Groupd does not send apps a callback 2 (start) until the cluster has quorum.
This simply allows the apps to avoid checking and waiting for quorum
themselves.

group_tool

The group_tool command displays information about groups that exist on the
given node.  A circular debug log from groupd can be printed with "group_tool
dump".

The following shows node1, node2, node3 and node4, with nodeids 1,2,3,4 are
members of the dlm lockspace foo.

[node1]# group_tool
type             level name  id       state       
dlm              1     foo   00010001 none        
[1 2 3 4]

The -v option displays extra information related to join/leave/failure events
(no events in progress in the following):

[node1 ~]# group_tool -v
type             level name  id       state node id local_done
dlm              1     foo   00010001 none        
[1 2 3 4]

. type: "fence" for fence domain, "dlm" for lockspaces, "gfs" for mount groups
. level: 0 for fence, 1 for dlm, 2 for gfs
. name: name of the group, gfs uses fsname for the mount group and lockspace
. id: a global unique id for this group, generated by groupd
. state: the status of change events for the group, join/leave/fail
. node: the nodeid of the node that initiated the change event
. id: the event id
. local_done: 1 if local app has acked the stop or start callback, 0 if not
. the nodeids of the group members are shown between [ ]

groupd logic

The following is the sequence of internal states that groupd uses to manage
changes to each group.  There are JOIN, LEAVE and FAIL versions of each,
depending on whether the change is due to a group join, leave or node failure.

An "app" refers to fenced, dlm_controld, or gfs_controld running in the
context of a particular group; "all apps" refers to the given app/group on all
the nodes where it's running.

- all apps running normally
  state: none

- groupd notified on all nodes, through a cpg callback, of a
  join, leave, or failure in a specific group/cpg
  state: BEGIN
  all apps sent callback 1 (stop)

- local app got callback 1 (stop), hasn't acked it yet
  state: STOP_WAIT, local_done = 0

  The app may be trying to block/suspend its activity at this time.

- local app has acked callback 1
  state: STOP_WAIT, local_done = 1
  groupd waiting on barrier 1 for all apps to ack cb1
  groupd collecting APP_STOPPED messages from all group members

  The app block/suspend has completed on this node.

  Debug: look for other nodes in the group in STOP_WAIT, local_done = 0
  where the app (fenced, dlm_controld, gfs_controld) has not yet
  acked callback 1 (stop), the app needs to be inspected to find out why.

- all apps have acked callback 1 (stop)
  state: ALL_STOPPED
  all apps sent callback 2 (start)

  In the case of a node failure, groupd waits for quorum after ALL_STOPPED
  before sending callback 2 to apps.

  In the case of a node leaving the group, the app is sent a special
  "terminate" callback here instead of a start callback.  This is the last
  callback it will receive.
  
- local app got callback 2 (start), hasn't acked it yet
  state: START_WAIT, local_done = 0

  The start callback is accompanied by a list of group members, which will
  include a new node if this is a join event, or will be missing a node if
  this is a leave or failure event.

  The app is generally doing some kind of recovery/reconfig during this time
  to adjust to the addition or removal of nodes in the group.

  In the case of a new node joining the group, this start callback is the
  first it receives, and indicates that it is now in the group.

- local app has acked callback 2
  state: START_WAIT, local_done = 1
  groupd waiting on barrier 2 for all apps to ack cb2
  groupd collecting APP_STARTED messages from all group members

  The app recovery/reconfig completed on this node.

  Debug: look for nodes in the group in START_WAIT, local_done = 0
  where the app (fenced, dlm_controld, gfs_controld) has not yet
  acked callback 2 (start), the app needs to be inspected to find out why.

- all apps have acked callback 2 (start)
  state: ALL_STARTED
  all apps sent callback 3 (finish)

- all apps running normally
  state: none


fencing
-------------------------------------------------------------------------------

What, When, Why

I/O fencing is forcibly preventing a machine from writing to shared storage.
It is usually done by turning off the power to the machine, or turning off the
switch port connecting the machine to storage.  Fencing is needed in shared
storage clusters where all machines modify shared storage, and uncontrolled or
uncoordinated modifications could corrupt the shared storage.

- failure fencing

A node needs to be fenced when it is using shared storage, fails, and the
remaining nodes want to recover the shared storage for it without waiting to
confirm that it's been reset.  Once they've done the recovery, they can
continue using the shared storage.  There are two situations where fencing
would not be necessary following a node failure:

. The remaining nodes are willing to wait to do recovery until they confirm
  the failed node is in a safe state.  This is possible, but usually
  impractical, because it would often require waiting for an administrator
  to reset the failed node. (See manual fencing for actually using this option.)

. The node failure is due to a hard, terminal failure which would make it
  impossible for the machine to make any further writes to storage.  It's
  not possible to automatically distinguish this kind of failure from others
  without administrator intervention.

Assuming the cluster does not want to wait indefinitely to confirm that a
failed node has been reset, fencing becomes necessary because the failure
could be due to one of the following which are indistinguishable from a hard,
terminal failure:

   i) a machine hang
  ii) a cluster partition

Given either of these, the shared storage could easily be corrupted if
remaining nodes recover and use shared storage without first fencing.
Examples of how corruption could happen in each case without fencing:

   i) Say that a machine hangs, the membership system declares that it has
failed, and the remaining nodes recover for it without fencing.  During
recovery, the remaining nodes take over the distributed locks the failed node
held, write to shared storage to recover what the node was doing, and then
continue using shared storage.  If the hang clears on the failed node, the
node will continue running wherever it was, which could be in code modifying
the shared storage.  These modifications will be based on old, invalid state,
and will corrupt the shared storage.

  ii) Say that a machine's network connection to the cluster is broken.  This
creates the most common type of network partition.  The disconnected machine
sees that all other nodes have failed, and loses quorum.  The other nodes see
the that the disconnected machine has failed, and retain quorum.  Due to the
loss of quorum, gfs on the disconnected node can no longer acquire new locks
to modify new parts of the fs.  However, gfs will continue to modify parts of
the fs for locks it already holds.  If the nodes in the quorate partition do
recovery for the failed/partitioned node without fencing, then nodes in both
partitions may be modifying the same shared storage differently.

- start up fencing

Besides node failure, there's another situation where fencing is needed.  When
nodes form a new cluster and want to begin using (initialize/recover) shared
storage, they have no knowledge of the previous state of the cluster, and they
need to fence any nodes that *may* have been in the previous cluster
membership but are now in an unknown state (they are not currently cluster
members.)

The nodes with unknown state need to be fenced because they might be:

   i) in a hung state, having had gfs mounted in the previous cluster
  ii) in a cluster partition with gfs mounted

Given either of these, shared storage could easily be corrupted if nodes in
the new cluster begin using shared storage without fencing the unknown nodes.
Examples of how corruption could happen in each case without fencing:

(In both examples, nodes A, B and C are the only nodes listed in cluster.conf.)

   i) Say cluster nodes A,B,C all have gfs mounted.  A,B are both reset at
once, and at the same time C hangs.  A,B come back and form a new quorate
cluster.  A and B know nothing about C.  If A,B begin using gfs without first
fencing C, the hang of C could clear allowing C to continue writing to gfs
based on old, invalid state, corrupting the fs.

  ii) Say cluster nodes A,B,C all have gfs mounted.  A,B are both reset at
once, and at the same time are partitioned from C.  A,B come back and form a
new quorate cluster.  A and B know nothing about C.  If A,B begin using gfs
without first fencing C, they will corrupt the fs because gfs on C may still
be modifying parts of the fs it holds locks on.

How

The fencing system implementation has three parts:

1. fence domain: deciding who should be fenced and when
     implementation of the requirements described above

2. fence config: determining how to fence a given node
     closely related to the way fencing info is specified in cluster.conf

3. fence agents: doing the actual fencing
     programs/scripts specific to hardware devices

The first two are a part of the fencing daemon, fenced.  The third is done by
a collection of "agent" programs and scripts, run by fenced, each of which
interacts with specific hardware (network power switch, SAN switch).

fence domain

The fence domain initiates fencing operations against nodes; it decides who
should be fenced and when.  The domain is a group (cpg) of cluster nodes that
want to participate in fencing: being fenced themselves if they fail, and able
to fence others.  A node joins the domain after joining the cluster, but
before starting systems like gfs that use shared storage.  To join the domain,
a node starts fenced, then runs "fence_tool join" which tells fenced to join
the domain.  

If a member of the domain fails, it will be fenced by another member, usually
the member with the lowest nodeid.  fenced is notified of node failures
through callbacks in the domain group/cpg.  A node can leave the domain with
the "fence_tool leave" command, after which it will no longer be fenced if it
fails.

GFS requires fencing be used, so when a gfs mount is attempted, gfs verifies
the node is a fence domain member; the fence domain does fencing on behalf of
gfs.  Similarly, fence_tool leave first checks that no shared storage systems
like gfs are still in use.

fence domain: quorum

If the cluster does not have quorum, the fence domain will still keep track of
failed domain members that need fencing, but it won't actually fence any
machines until quorum is regained.  This is a crucial application of quorum;
it prevents a single malfunctioning node, or minority group of partitioned
nodes from disrupting a majority cluster that is operating fine.

When the cluster regains quorum, the domain will continue processing fencing
operations for any domain members that had failed while there was no quorum.
However, before actually fencing a node, fenced always checks whether the node
has been reset and rejoined the cluster.  If it has, there is no longer a
reason to fence it, so the fencing operation is skipped.  A common example is
when a node in the domain fails, causing the cluster to lose quorum.  The
domain on the remaining nodes queue a fencing operation for the failed node,
but wait to carry it out.  If the failed node is reset and rejoins the
cluster, causing quorum to be regained, the nodes with the pending fencing
operation will continue processing it.  However, they will find that the
victim has now been reset and rejoined the cluster, so the fencing operation
will just complete without any agent being run against the node.

fence domain: startup fencing

Startup fencing is done when a newly formed domain first acquires quorum.  All
possible nodes (those listed in cluster.conf) that are not currently cluster
members are fenced.  fenced uses a configurable delay between the addition of
a new victim at startup, and actually carrying out the fencing.  If the victim
joins the cluster within this delay, it won't be fenced (see above).  This
avoids unnecessarily fencing nodes in the common scenario where a whole set of
nodes are joining the cluster at once.  (See post_join_delay.)

fence domain: cluster application

The following describes how fenced works.  The fencing logic is driven by
groupd callbacks in the domain (ultimately cpg configuration changes).

- At startup, the list of current domain members is initialized to contain
  all nodes listed in cluster.conf.  Startup fencing is the effect of doing
  this.

- groupd callback 1 (stop) not used

- groupd callback 2 (start)

  (groupd does not send this callback until the cluster has quorum, so fenced
  does not need to check for quorum itself.)

  Compare the previous list of domain members to the new list of domain
  members given in the start callback.  If there are any nodes in the old
  list, but not in the new list, they have either left the domain or have
  failed.  The nodes that have failed are added to a "victims" list of nodes
  to be fenced.

  Replace the existing list of domain members with the new list.

  If the local node does not have the low nodeid in the group, send groupd
  an ack for callback 2 (start), and do no more.

  If the local node has the low nodeid in the group, it is responsible to
  carry out fencing on the victims.  It looks up the fence agent from
  cluster.conf, forks/execs the agent against a victim, waits for the
  result of the agent, retries until the victim has been successfully
  fenced, then repeats for the next victim.

  After all victims have been fenced, it sends groupd an ack for
  callback 2 (start).

- groupd callback 3 (finish)

  By this, fenced knows that fencing has completed by the low nodeid.
  Clear the list of victims.

If the low nodeid fails while fencing, the remaining nodes will not get a
finish callback, but will get a stop callback followed by another start.  They
will already have nodes in their list of victims, and will add to it.  The new
low nodeid will then process its victims list which will contain victims from
the previous start and the current start.  A node could end up being fenced
more than once if the node that fenced it fails before sending an ack to
groupd.  In this case, the next low nodeid will fence it again.  The following
is an example of this case:

- cluster.conf contains nodes A,B,C,D,E with nodeids 1,2,3,4,5 respectively

- A,B,C,D,E are all cluster members and fence domain members
  domain members = [A,B,C,D,E], victims = [-]

- A fails

- B,C,D,E get callback 2 (start) with a new list of domain members [B,C,D,E]
  fenced sees that A has been removed and adds A to the victims list
  domain members = [B,C,D,E], victims = [A]

- C,D,E ack callback 2 (start)

- B begins fencing A, but fails
  (whether B actually fences A before failing does not matter)

- C,D,E get callback 2 (start) with a new list of domain members [C,D,E]
  fenced sees that B has been removed and adds B to the victims list
  domain members = [C,D,E], victims = [A,B]

- D,E ack callback 2 (start)

- C fences A, agent completes successfully

- C fences B, agent completes successfully

- C acks callback 2 (start)

- C,D,E get callback 3 (finish)
  fenced knows that fencing of victims has completed and clears victims list
  domain members = [C,D,E], victims = [-]

fence domain: optional configuration

The optional <fence_daemon> section in cluster.conf can be used to configure
the fenced behavior.

. post_join_delay: seconds to delay fencing after a node joins
. post_fail_delay: seconds to delay fencing after a node fails
. clean_start: set to 1 to disable startup fencing (dangerous)

fenced can also be run with corresponding command line options:

. -j: post_join_delay
. -f: post_fail_delay
. -c: clean_start
. -D: run fenced in foreground and print debugging info to stdout

fence domain: notes
 
1. A node failure is defined as the cluster membership system declaring that a
node is no longer a member of the cluster.  It determines this by whatever
membership/liveness protocol it uses.  This protocol is not immune to false
positives, where a node is declared to be failed when it really isn't.
Fencing is often the first obvious consequence of a spurious node failure in
the membership system.  To reduce these unwarranted fencing/recovery events,
the node failure detection in the membership system needs to reduce false
positives.  One common way of doing this is to increase the time that the
membership system waits for an unresponsive node before declaring it failed.

2. The fencing what/when/why section, and it's implementation by the fence
domain, is geared toward the requirements of shared storage clusters like gfs;
it may not be ideal for other kinds of cluster apps that are interested in
fencing.  Other kinds of clusters, with different fencing requirements, may
want to use an alternative to fence domain, although the fence config and
fence agent parts may still be applicable.

3. After fencing a node, fenced tells cman about it, so that cman can display
basic fencing information along with a node listing:  cman_tool nodes -F

4. fenced keeps a circular debug log that can be dumped: fence_tool dump

fence config

The fencing configuration is by definition closely related to the structure
and content of cluster.conf.  When the domain decides to fence a given node,
it needs to figure out which agent to use and what parameters to pass that
agent so the given node is fenced.

<clusternodes>
<clusternode name="foo" nodeid="1">
	<fence>
		<method name="1">
			<device name="nps" port="1"/>
		</method>
	</fence>
</clusternode>

<clusternode name="bar" nodeid="2">
	<fence>
		<method name="1">
			<device name="nps" port="2"/>
		</method>
	</fence>
</clusternode>
<clusternodes>

<fencedevices>
	<fencedevice name="nps" agent="fence_nps" ipaddr="1.1.1.1" passwd="x"/>
</fencedevices>

If the domain decides to fence node foo, it looks up the first device entry in
the first method of foo's fence section.  A node's device entry refers to a
specific device in the <fencedevices> section, and provides node-specific
parameters that need to be used with that device.  The referenced fencedevice
entry contains the name of the fence agent to be used.  fenced forks and execs
the agent, combines the fencedevice entry and the node's device entry, and
passes that data to the agent via stdin.  fenced waits for the agent to exit,
and uses the exit value to determine success or failure.

fence config: multiple methods

In advanced configurations, multiple fencing methods can be defined for a
node.  If fencing fails using the first method, fenced will try the next
method, and continue to cycle through methods until one succeeds.

<clusternode name="foo" nodeid="1">
	<fence>
		<method name="1">
			<device name="nps" port="1"/>
		</method>
		<method name="2">
			<device name="san" port="1"/>
		</method>
	</fence>
</clusternode>

<fencedevices>
	<fencedevice name="nps" agent="fence_nps" ipaddr="1.1.1.1" passwd="x"/>
	<fencedevice name="san" agent="fence_san" ipaddr="2.2.2.2" passwd="x"/>
</fencedevices>

fence config: dual path, redundant power

Sometimes fencing a node requires disabling two power ports or two i/o paths.
This is done by specifying two or more devices within a method.  Each device
is handled independently, and all must succeed for the fencing operation to be
considered a success.

In the following configuration, node foo has two i/o paths connecting it to
shared storage, and the paths go through separate SAN switches.

<clusternode name="foo" nodeid="1">
	<fence>
		<method name="1">
			<device name="san1" port="1"/>
			<device name="san2" port="1"/>
		</method>
	</fence>
</clusternode>

<fencedevices>
	<fencedevice name="san1" agent="fence_san" ipaddr="1.1.1.1" passwd="x"/>
	<fencedevice name="san2" agent="fence_san" ipaddr="2.2.2.2" passwd="x"/>
</fencedevices>

When using power switches to fence nodes with dual power supplies, the agents
must be told to turn off both power ports before restoring power to either
port.  The default off-on behavior of the agent could result in the power
never being fully disabled to the node.

In the the following configuration, the agent is run four times:
1. powers off the first power port
2. powers off the second power port
3. powers on the first power port
4. powers on the second power port

<clusternode name="foo" nodeid="1">
	<fence>
		<method name="1">
			<device name="nps1" port="1" action="off"/>
			<device name="nps2" port="1" action="off"/>
			<device name="nps1" port="1" action="on"/>
			<device name="nps2" port="1" action="on"/>
		</method>
	</fence>
</clusternode>

<fencedevices>
	<fencedevice name="nps1" agent="fence_nps" ipaddr="1.1.1.1" passwd="x"/>
	<fencedevice name="nps2" agent="fence_nps" ipaddr="2.2.2.2" passwd="x"/>
</fencedevices>

fence config: agent parameters

Each agent will have specific parameters that are required in the fencedevice
section and in each node's device section.  Consult the man page for an agents
for details on its cluster.conf configuration.

fence agents

The fencing agents interact with a specific piece of hardware to fence a given
machine.  They are stand-alone programs or scripts that can take all input
parameters from the command line.  They can also take input parameters from
stdin, which is what fenced uses.  The exit code of the program/script
indicates the success or failure of the fencing operation.  In the case of a
failure, the agent should also write an error message (up to 256 bytes) to
stdout.  The message will be read by fenced and included in a syslog message
from fenced for the failure.  The agents should avoid using any other part of
cluster infrastructure.

The various types of fencing agents include:

(Note: some of these agents may no longer even work, they are just given as
examples of the different kinds of agents.)

- network power switch

A node logs into a network power switch, unfortunately this is often via a
telnet interface, and turns of the specified power port.  agents: fence_apc,
fence_wti

- machine management interfaces

Systems management interfaces accessible via the network can often be used to
power off a node:  ILO (HP Integrated Lights Out card), DRAC (Dell Remote
Access Card), RSA (IBM RSA II management interface).  agents: fence_ilo,
fence_drac, fence_rsa, fence_ipmi

- SAN switch

A node logs into a SAN switch and disables the port belonging to the node
being fenced.  agents: fence_brocade, fence_mcdata, fence_sanbox2

- SCSI persistent reservation

Using SCSI persistent reservations for fencing can be complicated because
commands must be run at start up time to set it up, and the specific SCSI
devices need to be derived from the logical volumes that are being used as
shared storage.  agent: fence_scsi

- network block storage server

If the shared storage is being exported over the network by a server, the
server software can potentially be contacted and told to ignore i/o from the
node being fenced.  agent: fence_gnbd, and an agent for a server exporting an
iscsi target would be appropriate.

- virtual machine

Virtual machines can fence each other with the assistance of an external
daemon that keeps track of virtual to physical machine mappings.  agent:
fence_xvm

- manual override

The fence_manual agent writes a syslog message telling the administrator to
manually reset the failed node, and then run the fence_ack_manual program
after it's reset.  When fence_ack_manual is run, fence_manual will return to
fenced with a successful result.  If the failed node rejoins the cluster
before fence_ack_manual has been run, fence_manual completes successfully
without the ack.  If no fencing method is defined for a node, it is the
equivalent of manual fencing; fenced will wait indefinitely for the admin to
override the fencing operation via the command "echo victimnodename >>
/var/run/cluster/fenced_override", which should not be done until the failed
node has been safely turned off or reset.

fence_node

The fence_node <nodename> command is a small program that can be used to fence
a specific node listed in cluster.conf, independent of the fence domain and
without any involvement from fenced.  It uses the same code as fenced to look
up the appropriate agent and paramters in cluster.conf, and then runs the
agent.

It would be simple to also offer this fence_node(nodename) capability in the
form of a library call from a future libfence library.  If a different kind of
openais/cman cluster has its own requirements for fencing, and its own method
for deciding what node to fence and when to fence it, it may prefer to call
fence_node(nodename) through a libfence API instead of via a command line
"fence_node nodename" call.  Parts 2 and 3 of our fencing system would remain
the same, and the fence domain in fenced would simply call
libfence:fence_node().

Part 1 of the fencing architecture, the fence domain, should be considered one
of multiple applications that makes requests for nodes to be fenced.  The
domain does not provide fencing capabilities to other applications, but rather
is a consumer of fencing capabilities from parts 2 and 3 of the architecture
(or libfence above).  The fence_node program is another consumer that
parallels the domain.  Yet another consumer, at the same level as the domain
and fence_node, may be an application management system. 

The fence domain application consumes fencing services on behalf of dlm and
gfs specifically.  The style in which the fence domain consumes fencing for
dlm/gfs (according to their requirements) *may* also be suitable,
incidentally, for other applications.  One should not assume, however, that
this is the case.  The fencing requirements of another application could
easily be different enough from those of dlm/gfs, that the domain system would
not be appropriate.  Another method would needed to decide who is fenced and
when, according to the app's requirements.  In particular, if an application
requires that a node always be fenced after it fails, or be fenced as soon as
possible after it fails, the domain would not be suitable.


dlm
-------------------------------------------------------------------------------

The following document describes dlm concepts and the API:
http://people.redhat.com/ccaulfie/docs/rhdlmbook.pdf

The dlm is a heavy-weight solution for distributed locking.  Other distributed
locking services may be better suited for smaller, less demanding
applications.  (See the DLK AIS service.)

An architectural overview of the dlm follows, dividing the dlm into three main
parts:

1. dlm-kernel: the lock manager itself, which operates in the kernel
2. dlm_controld: the part of the dlm that interacts with the userspace
   cluster infrastructure and controls dlm-kernel according to cluster events
3. libdlm: the library through which userland applications use the dlm

The mechanisms dlm_controld uses to interact with dlm-kernel will be covered
in part 2, and the mechanisms libdlm uses to interact with dlm-kernel will be
covered in part 3.

dlm-kernel

The dlm provides a VMS-style distributed locking service that cluster
applications can use for synchronization.  The dlm is implemented in the
kernel.  The implementation of the lock manager itself has not been
documented, but can be found in the linux kernel source under linux/fs/dlm.
The dlm API for kernel applications (like gfs) is defined in
include/linux/dlm.h.  The dlm is implemented in the kernel primarily because a
cluster file system like gfs uses it so heavily.

During normal operation, the dlm runs on its own; it doesn't make use of
services from other cluster components.  Internode communication by the dlm is
handled by the dlm itself, using TCP or SCTP.  IP addresses are provided by
dlm_controld.

Dlm locking takes place in the context of a "lockspace", which is a group of
nodes that share a common namespace for locks.  Separate lockspaces are
typically used for different application instances or distinct resources like
an individual file system.

The lockspace membership is managed by dlm_controld in userspace.  Any change
in the lockspace membership requires the dlm to do lock recovery within the
given lockspace.  Recoveries interrupt the normal operation of a lockspace
when nodes (processes) join or leave the lockspace.  The recovery for a node
removal (by leaving or failing, it doesn't matter), is usually longer that the
recovery for a node addition.  During recovery, calls into the dlm will block.

dlm_controld

The userspace dlm_controld daemon is an openais cluster application that uses
cpg's (through groupd) to manage lockspace membership.  There are three types
of control events where dlm_controld and dlm-kernel need to interact:

1. a local app is joining a lockspace
2. a local app is leaving a lockspace
3. a remote app is joining or leaving a lockspace

The interactions take place through debugfs and sysfs files, which should be
considered internal dlm interfaces.  For a lockspace named foo, lockspace
control files exist under:

. /sys/kernel/dlm/foo/
. /sys/kernel/config/dlm/cluster/spaces/foo/

dlm_controld creates a configfs directory entry to add a node to a lockspace
and removes its directory entry to remove it from a lockspace.  dlm_controld
blocks normal locking activity in a lockspace by writing to a sysfs file for
the lockspace.  This happens before adjusting the membership on any of the
nodes.  Locking activity is resumed after the adjustments have been made.  The
lockspace's first step in resuming activity is doing recovery.

local join

- app calls into dlm-kernel new_lockspace function
  . creates the necessary kernel structures
  . sends an "online foo" uevent message to dlm_controld
  . blocks waiting for a result to be written to /sys/kernel/dlm/foo/event_done

- dlm_controld
  . receives the "online foo" message and joins the group (cpg) "foo"
  . waits for the join to complete (gets start callback from groupd)
  . adds directory entries for each lockspace member under
    /sys/kernel/config/dlm/cluster/spaces/foo/nodes/
  . writes "1" to /sys/kernel/dlm/foo/control which tells dlm-kernel to
    begin recovery in lockspace foo using the set of members under foo/nodes/
  . writes "0" to /sys/kernel/dlm/foo/event_done which tells dlm-kernel that
    the lockspace join is complete

- app blocked in dlm-kernel new_lockspace function
  . wakes up after "0" is written to event_done
  . returns a lockspace handle to the app that can now be used for standard
    locking operations

- app begins using the dlm to do locking; calls into the dlm block until
  lockspace recovery is complete (recovery began above at write control,
  and may or may not be complete by the time the app begins using the dlm)

local leave

- app calls into dlm-kernel release_lockspace function
  . sends an "offline foo" uevent message to dlm_controld
  . blocks waiting for a result to be written to /sys/kernel/dlm/foo/event_done

- dlm_controld
  . receives the "offline foo" message and leaves the group (cpg) "foo"
  . waits for the leave to complete (gets terminate callback from groupd)
  . removes all directory entries from
    /sys/kernel/config/dlm/cluster/spaces/foo/nodes/
  . writes "0" to /sys/kernel/dlm/foo/event_done which tells dlm-kernel that
    the lockspace leave is complete

- app blocked in dlm-kernel release_lockspace function
  . wakes up after "0" is written to event_done
  . frees structures for the lockspace
  . returns to app

remote join/leave/fail

- app using lockspace foo is running on three nodes with nodeids 1,2,3
  . configfs entries for foo on all nodes are
    /sys/kernel/config/dlm/cluster/spaces/foo/nodes/1/
    /sys/kernel/config/dlm/cluster/spaces/foo/nodes/2/
    /sys/kernel/config/dlm/cluster/spaces/foo/nodes/3/

- app on node 3 exits (node does not fail)

- dlm_controld (on nodes 1 and 2)
  . gets groupd callback 1 (stop)
  . writes "0" to /sys/kernel/dlm/foo/control which tells dlm-kernel to
    suspend operations in lockspace foo, the write does not return until
    all in-progress operations have completed
  . sends groupd an ack for callback 1 (stop)
  . groupd does standard barrier 1; it's important for operations in foo
    to be to be suspended on all nodes before any begin recovery
  . gets groupd callback 2 (start)
  . removes node 3 from lockspace, removes directory for nodeid 3
    /sys/kernel/config/dlm/cluster/spaces/foo/nodes/3
  . writes "1" to /sys/kernel/dlm/foo/control which tells dlm-kernel to
    begin recovery in lockspace foo using the set of members under foo/nodes/
  . sends groupd an ack for callback 2 (start)
  . groupd does standard barrier 2 and sends callback 3 (finish) when it
    completes, but it is not used by dlm_controld

- dlm-kernel (on nodes 1 and 2)
  . begins recovery of lockspace foo after "1" written to control
  . when recovery completes, lock operations from the app are resumed

node comms

Apart from the control-oriented events above, dlm_controld also provides
dlm-kernel with information about cluster nodes, outside the context of a
specific lockspace.  In particular, the dlm needs to know the IP address of
each nodeid present in any lockspace.  For each node in the cluster
(regardless of its activity in any lockspace) dlm_controld creates a directory
under /sys/kernel/config/dlm/cluster/comms/.  In the example above, each node
would have these entries:

/sys/kernel/config/dlm/cluster/comms/1/
/sys/kernel/config/dlm/cluster/comms/2/
/sys/kernel/config/dlm/cluster/comms/3/

Each nodeid directory contains three entries: nodeid, addr, local.  The nodeid
is written to nodeid (the same as the directory name).  The IP address is
written to addr (in the form of a binary sockaddr_storage structure).  A "1"
is written to local under the node's own entry.

tuning

Tuning parameters for dlm-kernel also exist in configfs:

# ls /sys/kernel/config/dlm/cluster/
buffer_size  lkbtbl_size  recover_timer  spaces/      toss_secs
comms/       log_debug    rsbtbl_size    tcp_port
dirtbl_size  protocol     scan_secs      timewarn_cs

The protocol, timewarn and log_debug settings can be placed in cluster.conf
under <dlm/>, and dlm_controld will set them (the others need to be set
manually.)

  <dlm protocol="tcp" timewarn="500" log_debug="0"/>

. buffer_size
. dirtbl_size
. lkbtbl_size
. log_debug
. protocol
. recover_timer
. rsbtbl_size
. scan_secs
. tcp_port
. timewarn_cs
. toss_secs

libdlm

Userspace programs can use the dlm through the libdlm library.  The library
interface is defined in /usr/include/libdlm.h.

The library uses devices to communicate with dlm-kernel.  It passes operations
to dlm-kernel by writing to the device and reads the device to get completion
results.  Writes to /dev/misc/dlm-control device are used to create and
release lockspaces.  A separate device, /dev/misc/dlm_<name>, is created for
locking operations in each lockspace.  The structures used by libdlm to
communicate with dlm-kernel are defined in /usr/include/linux/dlm_device.h.
libdlm handles backward compatibility with previous kernel interfaces.

Related man pages: libdlm(3), and a man page for each function defined in
libdlm.h.

dlm_tool

The dlm_tool program can be used to display the resources and locks in a
lockspace, and to join/leave a lockspace.

The "dlm_tool lockdump foo" command displays all locks held by local processes
in lockspace foo.  The -M option adds "master copy" (MSTCPY) locks to the
dump; locks owned by remote processes that are mastered locally.  Lockdump
reads the list of locks for lockspace foo from dlm-kernel through the debugfs
file /sys/kernel/debug/dlm/foo_locks.

The "dlm_tool lockdebug foo" command also dumps resources and locks, but in a
different format that includes some additional debugging information.

The "dlm_tool join foo" and "dlm_tool leave foo" commands can be used to
create and release lockspace foo.  The -m <mode> option can be used with join
to specify the permission of the resulting lockspace device.  This can be
useful for creating lockspaces for applications without root permission, or to
control the lifetime of a lockspace for some reason.

configuration options

Lockspaces usually use a resource directory to keep track of which node is the
master of each resource.  The dlm can operate without the resource directory,
though, by statically assigning the master of a resource using a hash of the
resource name.

  <dlm>
    <lockspace name="foo" nodir="1">
  </dlm>

The nodir setting can be combined with node weights to create a configuration
where select node(s) are the master of all resources/locks.  These "master"
nodes can be viewed as "lock servers" for the other nodes.

  <dlm>
    <lockspace name="foo" nodir="1">
      <master name="node01"/>
    </lockspace>
  </dlm>

or,

  <dlm>
    <lockspace name="foo" nodir="1">
      <master name="node01"/>
      <master name="node02"/>
    </lockspace>
  </dlm>

Lock management will be partitioned among the available masters.  There can be
any number of masters defined.  The designated master nodes will master all
resources/locks (according to the resource name hash).  When no masters are
members of the lockspace, then the nodes revert to the common
fully-distributed configuration.  Recovery is faster, with little disruption,
when a non-master node joins/leaves.

There is no special mode in the dlm for this lock server configuration, it's
just a natural consequence of combining the "nodir" option with node weights.
When a lockspace has master nodes defined, the master has a default weight of
1 and all non-master nodes have weight of 0.  Explicit non-zero weights can
also be assigned to master nodes, e.g.

  <dlm>
    <lockspace name="foo" nodir="1">
      <master name="node01" weight="2"/>
      <master name="node02" weight="1"/>
    </lockspace>
  </dlm>

In which case node01 will master 2/3 of the total resources and node02 will
master the other 1/3.

fencing requirements

The dlm itself does not require fencing for correct operation.  Fencing is
defined as preventing a node from writing to shared storage, and the dlm does
not use shared storage in any way.  Shared-storage-related applications that
use the dlm, however, may assume that the dlm will not grant locks from failed
nodes before a failed node has been fenced.  In this way, the dlm is used as a
proxy for checking the fencing status of failed nodes.  (This is not the case
with gfs, which checks fencing explicitly.)  For this reason, it is best for
the dlm to not begin recovery for a failed node until fencing has occurred.

alternate cluster management systems

dlm_controld uses openais (cpg through groupd and cman) to drive dlm-kernel.
An alternative daemon could be written to drive dlm-kernel from a different
cluster management system.  This alternative daemon would use the same
configfs/sysfs control interfaces that dlm-kernel exposes.  A second
dlm_controld is expected in the future that will use openais API's directly,
without the groupd/libgroup translation layer.


gfs
-------------------------------------------------------------------------------

Much like the dlm, gfs has a userland daemon, gfs_controld, that interacts
with the cluster infrastructure and controls gfs-kernel through sysfs when
necessary.  The amount of control needed between gfs-kernel and gfs_controld
is less than the dlm case.

gfs_controld uses cpg's (through groupd) to manage the membership of each
"mount group": which nodes have each filesystem mounted.  Joining and leaving
the mount group is done by mount.gfs and umount.gfs before/after gfs-kernel
does the standard filesystem mount/unmount.  gfs-kernel does not need to know
the mount group membership (in contrast to dlm-kernel which needs to know the
lockspace membership).

Recovery is the only area where gfs-kernel requires control input from
gfs_controld.  Specifically, gfs_controld suspends/unsuspends gfs locking
before/after recoveries, and tells gfs-kernel which journals need recovery.
gfs-kernel activity needs to be suspended before dlm recovery begins for the
given fs.  DLM locks granted before gfs journal recovery has completed cannot
be used for normal fs access, only for journal recovery.

gfs_controld maintains state about each member of the mount group.  When a
node joins (i.e. mounts the fs), gfs_controld needs to sync the state of the
existing members to the new member.  This state includes, the mount options of
each member, the journal id used by each member, the recovery status of each
journal, and the first-mounter state of the fs.

[References to gfs-kernel include both the gfs and lock_dlm kernel modules.
The lock_dlm module is a part of gfs (despite the misleading name).  It's a
bridge between gfs and the dlm and between gfs and the userspace cluster
infrastructure.  GFS has historically isolated its interactions with specific
locking/clustering systems within lock modules.  This has allowed gfs to more
readily switch between different external lock and cluster management
systems.]

mkfs

When a gfs file system is created, gfs_mkfs requires two parameters related to
the clustering/locking infrastructure.  It stores these in the on-disk
superblock:

-p <lockproto>
-t <locktable>

The lockproto is the lock module that the fs should use.  Two currently exist,
lock_nolock and lock_dlm (see above).  If the fs will be used as a local fs,
lock_nolock should be used.  If the fs will be used by the cluster (the whole
point of gfs), lock_dlm should be used.

The locktable is not used by gfs directly, but is passed to the lock module.
lock_nolock ignores the locktable.  lock_dlm requires that the locktable be of
the form <clustername>:<fsname>.  clustername is the name of the cluster whose
member's may mount this fs.  A node that is not a member of a cluster, or is a
member of a cluster with a different name will not be permitted to mount this
fs.  The clustername was defined cluster.conf: <cluster name="alpha"/>.
fsname is a unique name chosen at the time of mkfs for this specific file
system.  This name will be used by the cluster infrastructure
(gfs_controld/cpg) to manage mounts of this fs.

mount

The mount(8) command calls mount.gfs which joins the mount group for the fs
before calling mount(2) where control passes to gfs-kernel for the traditional
filesystem mount from disk.

(When creating cpg's for gfs mount groups, groupd prefixes the name with 2,
e.g. "2:foo".  It uses a 1 prefix for dlm lockspace cpg's, e.g. "1:foo".)

- an example mkfs for reference in the examples below,
  gfs_mkfs -p lock_dlm -t alpha:foo -j 4 /dev/foo

- mount /dev/foo /mnt
  . mount(8) calls mount.gfs

- mount.gfs
  . reads the lockproto and locktable from the superblock
    lockproto="lock_dlm" and locktable="alpha:foo"
  . asks gfs_controld to join the mount group for this fs; it connects
    to gfs_controld and sends it the message
    "join /mnt gfs lock_dlm alpha:foo rw /dev/foo"
  . reads a preliminary result from gfs_controld, "0" if the mount was
    accepted and has begun, "-EXYZ" if the mount was rejected in which
    case it prints an error message and exits
  . waits to receive the final result from gfs_controld, blocks reading
    on the connection to gfs_controld

- gfs_controld
  . receives the join request from mount.gfs
  . checks that the local node is a member of the cluster named in the
    locktable (alpha), and that the local node is a member of the fence domain
  . joins the mount group (cpg) "foo"
  . waits for the join to complete
  . gets groupd callback 2 (start)
  . immediately sends groupd an ack for callback 2 (start)
  . syncs state for mount group foo with other nodes
  . picks an available journal id for the local mount to use
  . sends the result of the join back to mount.gfs
    if an error, the result message is "error: <description>"
    if success, the result message is "hostdata=jid=X:id=Y:first=Z"
    X is the journal id that the local mount should use
    Y is a globally unique filesystem identifier used for posix locks
    Z is 1 if this is the first mount of the fs, 0 otherwise

- mount.gfs
  . receives the join result, the hostdata string
  . appends the hostdata to the mount options it will pass to gfs-kernel
  . calls mount(2)

- gfs-kernel
  . does kernel filesystem mount
  . calls dlm_new_lockspace() using fsname (foo) as the lockspace name
  . returns from mount(2)

- mount.gfs
  . sends a message to gfs_controld with the result of the mount(2)
  . adds a line to /etc/mtab for this mount and exits

If mount(2) returns an error, mount.gfs has to leave the group, similar to the
unmounting procedure; this is a lot to do to back out at this point, so we
want to avoid getting an error back from mount(2) if we can help it.

unmount

The umount(8) command calls umount.gfs which first calls umount(2) to do the
traditional filesystem unmount in gfs-kernel.  When the umount(2) completes,
umount.gfs leaves the mount group for the fs.

- umount /mnt
  . umount(8) calls umount.gfs
  . umount.gfs calls umount(2)

- gfs-kernel
  . does kernel filesystem unmount
  . calls dlm_release_lockspace()
  . returns from umount(2)

- umount.gfs
  . asks gfs_controld to leave the mount group for this fs; it connects
    to gfs_controld and sends it the message "leave /mnt foo 0"
  . waits to receive a reply message from gfs_controld

- gfs_controld
  . sends a reply back to umount.gfs
  . leaves the mount group
  . waits for the leave to complete

- umount.gfs
  . receives the reply from gfs_controld, a string containing "0" for ok,
    or "-EXYZ" for an error
  . removes the /etc/mtab line for this fs and exits

- gfs_controld
  . gets groupd terminate callback indicating the leave has completed

recovery

- gfs-kernel, filesystem foo (on nodes 1, 2 and 3)
  . running normally 
  . nodeid 1 using jid 0, nodeid 2 using jid 1, nodeid 3 using jid 2

- node 3 fails

- gfs_controld (on nodes 1 and 2)
  . gets groupd callback 1 (stop)
  . writes "1" to /sys/fs/gfs/alpha:foo/lock_module/block which tells
    gfs-kernel to block any new lock requests and not use the any locks
    that are granted from requests in progress
  . sends groupd an ack for callback 1 (stop)
  . groupd does expanded barrier 1, covering all groups effected by failure
  . groupd completes all callbacks at fence and dlm levels
  . gets groupd callback 2 (start)
  . writes "2", the jid that was used by node 3, to
    /sys/fs/gfs/alpha:foo/lock_module/recover, which tells gfs-kernel that
    journal 2 needs recovery

- gfs-kernel (on nodes 1 and 2)
  . both try to recover journal 2, gfs-kernel sorts out which node actually
    does the recovery
  . sends a "change foo" uevent message to gfs_controld indicating that
    recovery for the last write to "recover" has completed

- gfs_controld (on nodes 1 and 2)
  . receives the "change foo" message and reads the result of recovery from:
    /sys/fs/gfs/alpha:foo/lock_module/recover_done - reports the jid to
    which the following recovery status applies (2 in this case)
    /sys/fs/gfs/alpha:foo/lock_module/recover_status - the result of recovery
    308 (LM_RD_GAVEUP) means recovery failed or didn't happen
    309 (LM_RD_SUCCESS) means recovery completed or was already done
  . if there are more journals to recover, repeats the steps above
  . sends the results of recovery to the group
  . receives own recovery results
  . sends groupd an ack for callback 2 (start)
  . gets groupd callback 3 (finish)
  . writes "0" to /sys/fs/gfs/alpha:foo/lock_module/block which tells
    gfs-kernel to unblock new lock requests

- gfs-kernel (on nodes 1 and 2)
  . running normally again

posix locks

gfs-kernel sends posix lock operations to gfs_controld for processing.
gfs-kernel and gfs_controld communicate through the /dev/misc/lock_dlm_plock
device.  The API is defined in include/linux/lock_dlm_plock.h.  Processing
plocks could potentially be done by a separate process; it is largely
independent from the other activities in gfs_controld.

Posix lock state is replicated on all nodes in the mount group.  Every plock
operation is multicast through a cpg, and processed on all nodes when it is
received.  On the node that originated the message/operation, the result from
the processing is written back to gfs-kernel.

When a new node mounts an fs and joins the mount group, the existing posix
locks need to be copied to it.  gfs_controld uses the checkpoint AIS service
to do the syncing.  The low nodeid in the mount group writes all posix lock
state to a checkpoint, then notifies the new mounter, which reads the
checkpoint and installs all the existing plock state.

debugging

The gfs_controld daemon keeps a circular buffer of debug messages that can be
dumped with the 'group_tool dump gfs' command.

The state of all posix locks for a specific fs can also be dumped from
gfs_controld with the 'group_tool dump plocks <fsname>' command.