This is a technical description of how a simple gfs mount, gfs unmount and
node recovery work from the perspective of the cluster infrastructure in
RHEL5.

The most difficult and complicated scenarios in the infrastructure relate to
the *combinations* of mounts, unmounts and node recoveries.  No attempt is
made to document how those cases are handled.

The groupd layer makes things more complicated.  It will be removed in the
future and the gfs/dlm daemons will interact with libcpg directly.


gfs mount
-------------------------------------------------------------------------------

(clustername = foo, fsname = bar
 mount happening on node02 with nodeid 2
 node01 with nodeid 1 already has bar mounted)

mount(8) calls mount.gfs

mount.gfs reads the superblock off the device and gets
  lockproto = lock_dlm
  locktable = foo:bar

mount.gfs sends a message to gfs_controld
  "join <mountpoint> gfs lock_dlm foo:bar <options> <dev>"

mount.gfs waits to receive a reply message from gfs_controld

---

gfs_controld verifies the node is a member of cluster "foo"

gfs_controld verifies the node is a member of the fence domain

gfs_controld joins the group "bar" via libgroup:group_join()

	groupd receives "join bar" message from gfs_controld

	groupd joins the cpg "gfs_bar" via libcpg:cpg_join()

	ALL: groupd receives a cpg confchg (configuration change callback)
	for bar with cpg members = 1,2 - see that 2 has been added

ALL: gfs_controld receives a "stop bar" callback from groupd

ALL: gfs_controld blocks locking in gfs by setting a gfs-kernel flag
/sys/fs/gfs/foo:bar/lock_module/block to 1

	ALL: groupd receives a "stop_done bar" callback from gfs_controld
	and sends a "stopped bar" message to all
	ALL: groupd waits to receive stopped messages from all

ALL: gfs_controld receives a "start bar" callback from groupd indicating group
members = 1,2

ALL: gfs_controld sends a "start_done bar" message back to groupd, syncs state
for bar among the new nodes, selects a journal id (jid) for the new node, and
on the node with mount.gfs, sends a reply back to mount.gfs

	ALL: groupd receives a "start_done bar" callback from gfs_controld
	and sends a "started bar" message to all
	ALL: groupd waits to receive started messages from all

ALL: gfs_controld receives a "finish bar" callback from groupd

ALL: gfs_controld unblocks locking in gfs by setting the gfs-kernel flag
/sys/fs/gfs/foo:bar/lock_module/block to 0

---

mount.gfs reads the reply from gfs_controld, a string containing
"0" for ok, or "-EXXX" for an error

if ok, mount.gfs reads a second message from gfs_controld containing the
string of gfs-specific mount options it should use, e.g.
  "hostdata=jid=1:id=196609:first=0"
  jid is the journal that the mounting node should use
  id is a unique global, numeric identifier that gfs can use to distinguish
  between fs's (not very important).
  first is 1 if this is the first node to mount the fs, 0 otherwise; the
  first node to mount the fs checks and recovers all journals during mount

mount.gfs does mount(2) system call

gfs-kernel does fs-specific mounting stuff and calls dlm_new_lockspace() to
join the lockspace for this fs [see below for how joining lockspace works]

mount(2) returns 0

mount.gfs sends a message to gfs_controld with the result of the mount(2)

mount.gfs adds a line to /etc/mtab for this mount and exits

if mount(2) returns an error, mount.gfs has to leave the group, similar to the
unmounting procedure -- this is a lot to do to back out at this point, so we
want to avoid getting an error back from mount(2) if we can help it

ALL: gfs_controld receives the mount result

===

This is what the dlm does when joining a lockspace.  The dlm interacts with
the cluster infrastructure on its own, and the caller, e.g. the fs, doesn't
see any of this.

dlm_new_lockspace("bar") is called in the kernel

sends an "online bar" uevent to dlm_controld in userspace

waits for dlm_controld to write to sysfs file indicating that it's done
/sys/kernel/dlm/x/event_done

dlm_controld joins the group "bar" via libgroup:group_join()

	groupd receives "join bar" message from dlm_controld
	(it distinguishes this dlm group from the gfs group with same name)

	groupd joins the cpg "dlm_bar" via libcpg:cpg_join()

	ALL: groupd receives a cpg confchg for bar with cpg members = 1,2

ALL: dlm_controld receives a "stop bar" callback from groupd

ALL: dlm_controld blocks activity in the lockspace by setting a dlm-kernel
flag by writing 0 to /sys/kernel/dlm/bar/control

	ALL: groupd receives a "stop_done bar" callback from dlm_controld
	and sends a "stopped bar" message to all
	ALL: groupd waits to receive stopped messages from all

ALL: dlm_controld receives a "start bar" callback from groupd indicating group
members = 1,2

ALL: dlm_controld tells dlm-kernel the new members of the lockspace by:
mkdir /sys/kernel/config/dlm/cluster/spaces/bar/nodes/1
mkdir /sys/kernel/config/dlm/cluster/spaces/bar/nodes/2

ALL: dlm_controld starts recovery in dlm-kernel by writing 1 to
/sys/kernel/dlm/bar/control

	ALL: groupd receives a "start_done bar" callback from dlm_controld
	and sends a "started bar" message to all
	ALL: groupd waits to receive started messages from all

dlm_controld tells dlm-kernel that the join event is complete by writing to
/sys/kernel/dlm/x/event_done which causes dlm_new_lockspace() to complete and
return to caller

ALL: after dlm-kernel recovery is complete, normal locking activity resumes

ALL: dlm_controld receives a "finish bar" callback from groupd which isn't
used for anything


gfs unmount
-------------------------------------------------------------------------------

(clustername = foo, fsname = bar
 bar mounted by node01, node02 and node03 with nodeids 1,2,3
 node03 unmounts)

umount(8) calls umount.gfs

umount.gfs gets the device for the specified dir from /proc/mounts

umount.gfs reads the superblock off the device and gets
  lockproto = lock_dlm
  locktable = foo:bar

umount.gfs does umount(2) system call

gfs-kernel does fs-specific unmounting stuff and calls dlm_release_lockspace()
to leave the lockspace for this fs [see below for how leaving lockspace works]

umount(2) returns 0

umount.gfs sends a message to gfs_controld
  "leave <mountpoint> foo:bar 0"

umount.gfs waits to receive a reply message from gfs_controld

gfs_controld sends a reply back to umount.gfs

umount.gfs reads the reply from gfs_controld, a string containing
"0" for ok, or "-EXXX" for an error
if ok, umount.gfs removes the /etc/mtab line for this fs and exits

gfs_controld leaves the group "bar" via libgroup:group_leave()

	groupd receives "leave bar" message from gfs_controld

	groupd leaves the cpg "gfs_bar" via libcpg:cpg_leave()

	ALL: groupd receives a cpg confchg (configuration change callback)
	for bar with cpg members = 1,2 - see that 3 has been removed

ALL: gfs_controld receives a "stop bar" callback from groupd

ALL: gfs_controld blocks locking in gfs by setting a gfs-kernel flag
/sys/fs/gfs/foo:bar/lock_module/block to 1

	ALL: groupd receives a "stop_done bar" callback from gfs_controld
	and sends a "stopped bar" message to all
	ALL: groupd waits to receive stopped messages from all

node03: gfs_controld receives a "terminate bar" callback from groupd and frees
structures for bar

(ALL is now node01 and node02)

ALL: gfs_controld receives a "start bar" callback from groupd indicating group
members = 1,2

ALL: gfs_controld sends a "start_done bar" message back to groupd

	ALL: groupd receives a "start_done bar" callback from gfs_controld
	and sends a "started bar" message to all
	ALL: groupd waits to receive started messages from all

ALL: gfs_controld receives a "finish bar" callback from groupd

ALL: gfs_controld unblocks locking in gfs by setting the gfs-kernel flag
/sys/fs/gfs/foo:bar/lock_module/block to 0

==

(this is what the dlm does when someone leaves the lockspace, like above)

dlm_release_lockspace("bar") is called in the kernel

sends an "offline bar" uevent to dlm_controld in userspace

waits for dlm_controld to write to sysfs file indicating that it's done
/sys/kernel/dlm/x/event_done

dlm_controld leaves the group "bar" via libgroup:group_leave()

	groupd receives "leave bar" message from dlm_controld
	(it distinguishes this dlm group from the gfs group with same name)

	groupd leaves the cpg "dlm_bar" via libcpg:cpg_leave()

	ALL: groupd receives a cpg confchg for bar with cpg members = 1,2

ALL: dlm_controld receives a "stop bar" callback from groupd

ALL: dlm_controld blocks activity in the lockspace by setting a dlm-kernel
flag by writing 0 to /sys/kernel/dlm/bar/control

	ALL: groupd receives a "stop_done bar" callback from dlm_controld
	and sends a "stopped bar" message to all
	ALL: groupd waits to receive stopped messages from all

node03: dlm_controld receives a "terminate bar" callback from groupd and
writes to /sys/kernel/dlm/x/event_done which causes dlm_release_lockspace() to
complete and return to caller

(ALL is now node01 and node02)

ALL: dlm_controld receives a "start bar" callback from groupd indicating group
members = 1,2

ALL: dlm_controld tells dlm-kernel the new members of the lockspace by:
rmdir /sys/kernel/config/dlm/cluster/spaces/bar/nodes/3

ALL: dlm_controld starts recovery in dlm-kernel by writing 1 to
/sys/kernel/dlm/bar/control

	ALL: groupd receives a "start_done bar" callback from dlm_controld
	and sends a "started bar" message to all
	ALL: groupd waits to receive started messages from all

ALL: after dlm-kernel recovery is complete, normal locking activity resumes

ALL: dlm_controld receives a "finish bar" callback from groupd which isn't
used for anything


node recovery
-------------------------------------------------------------------------------

(clustername = foo, fsname = bar
 three nodes have bar mounted: node01, node02 and node03 with nodeids 1,2,3)

node03 fails

	ALL: groupd receives a cpg confchg (configuration change callback)
	for bar with cpg members = 1,2

	ALL: groupd sees that nodeid 3 has been removed due to NODEDOWN
	and stops all groups that 3 was a member of

ALL: fenced receives a "stop default" callback from groupd

ALL: dlm_controld receives a "stop bar" callback from groupd

ALL: gfs_controld "receives a "stop bar" callback from groupd

ALL: fenced does nothing with stop callback

ALL: dlm_controld blocks activity in the lockspace by setting a dlm-kernel
flag by writing 0 to /sys/kernel/dlm/bar/control

ALL: gfs_controld blocks locking in gfs by setting a gfs-kernel flag
/sys/fs/gfs/foo:bar/lock_module/block to 1

	ALL: groupd receives a "stop_done default" callback from fenced
	and sends a "stopped default" message to all

	ALL: groupd receives a "stop_done bar" callback from dlm_controld
	and sends a "stopped bar" message to all

	ALL: groupd receives a "stop_done bar" callback from gfs_controld
	and sends a "stopped bar" message to all

	ALL: groupd waits to receive fenced stopped messages from all
	ALL: groupd waits to receive dlm_controld stopped messages from all
	ALL: groupd waits to receive gfs_controld started messages from all

	ALL: groupd waits for the cluster to gain quorum if it's been lost

ALL: fenced receives a "start default" callback from groupd indicating
group members = 1,2

ALL: fenced sees node03 has failed, and selects it to be a victim

node01: (lowest nodeid) runs the fence agent against node03
node02: defers fencing to node01 and sends start_done message back to groupd
node01: fence agent completes successfully
node01: sends start_done message back to groupd

	ALL: groupd receives a "start_done default" callback from fenced
	and sends a "started default" message to all
	ALL: groupd waits to receive started messages from all

ALL: fenced receives a "finish default" callback from groupd removes node03
from its victim list

ALL: dlm_controld receives a "start bar" callback from groupd indicating group
members = 1,2

ALL: dlm_controld tells dlm-kernel about the dead node by:
rmdir /sys/kernel/config/dlm/cluster/spaces/bar/nodes/3

ALL: dlm_controld starts recovery in dlm-kernel by writing 1 to
/sys/kernel/dlm/bar/control

	ALL: groupd receives a "start_done bar" callback from dlm_controld
	and sends a "started bar" message to all
	ALL: groupd waits to receive started messages from all

ALL: after dlm-kernel recovery is complete, normal locking activity resumes

ALL: dlm_controld receives a "finish bar" callback from groupd which isn't
used for anything

ALL: gfs_controld receives a "start bar" callback from groupd indicating
group members = 1,2

ALL: gfs_controld tells gfs-kernel to recover the journal that node03 was
using by writing the jid to /sys/fs/gfs/foo:bar/lock_module/recover

node01: gfs-kernel does journal recovery for node03
node02: gfs-kernel sees that node01 is doing the journal recovery so skips it

ALL: gfs-kernel sends a "change" uevent to gfs_controld in userspace when it's
done with recovery

ALL: gfs_controld sends the result of the gfs-kernel recovery to all

	ALL: groupd receives a "start_done bar" callback from gfs_controld
	and sends a "started bar" message to all
	ALL: groupd waits to receive started messages from all

ALL: gfs_controld receives a "finish bar" callback from groupd

ALL: gfs_controld unblocks locking in gfs by setting the gfs-kernel flag
/sys/fs/gfs/foo:bar/lock_module/block to 0