This is a technical description of how a simple gfs mount, gfs unmount and node recovery work from the perspective of the cluster infrastructure in RHEL5. The most difficult and complicated scenarios in the infrastructure relate to the *combinations* of mounts, unmounts and node recoveries. No attempt is made to document how those cases are handled. The groupd layer makes things more complicated. It will be removed in the future and the gfs/dlm daemons will interact with libcpg directly. gfs mount ------------------------------------------------------------------------------- (clustername = foo, fsname = bar mount happening on node02 with nodeid 2 node01 with nodeid 1 already has bar mounted) mount(8) calls mount.gfs mount.gfs reads the superblock off the device and gets lockproto = lock_dlm locktable = foo:bar mount.gfs sends a message to gfs_controld "join gfs lock_dlm foo:bar " mount.gfs waits to receive a reply message from gfs_controld --- gfs_controld verifies the node is a member of cluster "foo" gfs_controld verifies the node is a member of the fence domain gfs_controld joins the group "bar" via libgroup:group_join() groupd receives "join bar" message from gfs_controld groupd joins the cpg "gfs_bar" via libcpg:cpg_join() ALL: groupd receives a cpg confchg (configuration change callback) for bar with cpg members = 1,2 - see that 2 has been added ALL: gfs_controld receives a "stop bar" callback from groupd ALL: gfs_controld blocks locking in gfs by setting a gfs-kernel flag /sys/fs/gfs/foo:bar/lock_module/block to 1 ALL: groupd receives a "stop_done bar" callback from gfs_controld and sends a "stopped bar" message to all ALL: groupd waits to receive stopped messages from all ALL: gfs_controld receives a "start bar" callback from groupd indicating group members = 1,2 ALL: gfs_controld sends a "start_done bar" message back to groupd, syncs state for bar among the new nodes, selects a journal id (jid) for the new node, and on the node with mount.gfs, sends a reply back to mount.gfs ALL: groupd receives a "start_done bar" callback from gfs_controld and sends a "started bar" message to all ALL: groupd waits to receive started messages from all ALL: gfs_controld receives a "finish bar" callback from groupd ALL: gfs_controld unblocks locking in gfs by setting the gfs-kernel flag /sys/fs/gfs/foo:bar/lock_module/block to 0 --- mount.gfs reads the reply from gfs_controld, a string containing "0" for ok, or "-EXXX" for an error if ok, mount.gfs reads a second message from gfs_controld containing the string of gfs-specific mount options it should use, e.g. "hostdata=jid=1:id=196609:first=0" jid is the journal that the mounting node should use id is a unique global, numeric identifier that gfs can use to distinguish between fs's (not very important). first is 1 if this is the first node to mount the fs, 0 otherwise; the first node to mount the fs checks and recovers all journals during mount mount.gfs does mount(2) system call gfs-kernel does fs-specific mounting stuff and calls dlm_new_lockspace() to join the lockspace for this fs [see below for how joining lockspace works] mount(2) returns 0 mount.gfs sends a message to gfs_controld with the result of the mount(2) mount.gfs adds a line to /etc/mtab for this mount and exits if mount(2) returns an error, mount.gfs has to leave the group, similar to the unmounting procedure -- this is a lot to do to back out at this point, so we want to avoid getting an error back from mount(2) if we can help it ALL: gfs_controld receives the mount result === This is what the dlm does when joining a lockspace. The dlm interacts with the cluster infrastructure on its own, and the caller, e.g. the fs, doesn't see any of this. dlm_new_lockspace("bar") is called in the kernel sends an "online bar" uevent to dlm_controld in userspace waits for dlm_controld to write to sysfs file indicating that it's done /sys/kernel/dlm/x/event_done dlm_controld joins the group "bar" via libgroup:group_join() groupd receives "join bar" message from dlm_controld (it distinguishes this dlm group from the gfs group with same name) groupd joins the cpg "dlm_bar" via libcpg:cpg_join() ALL: groupd receives a cpg confchg for bar with cpg members = 1,2 ALL: dlm_controld receives a "stop bar" callback from groupd ALL: dlm_controld blocks activity in the lockspace by setting a dlm-kernel flag by writing 0 to /sys/kernel/dlm/bar/control ALL: groupd receives a "stop_done bar" callback from dlm_controld and sends a "stopped bar" message to all ALL: groupd waits to receive stopped messages from all ALL: dlm_controld receives a "start bar" callback from groupd indicating group members = 1,2 ALL: dlm_controld tells dlm-kernel the new members of the lockspace by: mkdir /sys/kernel/config/dlm/cluster/spaces/bar/nodes/1 mkdir /sys/kernel/config/dlm/cluster/spaces/bar/nodes/2 ALL: dlm_controld starts recovery in dlm-kernel by writing 1 to /sys/kernel/dlm/bar/control ALL: groupd receives a "start_done bar" callback from dlm_controld and sends a "started bar" message to all ALL: groupd waits to receive started messages from all dlm_controld tells dlm-kernel that the join event is complete by writing to /sys/kernel/dlm/x/event_done which causes dlm_new_lockspace() to complete and return to caller ALL: after dlm-kernel recovery is complete, normal locking activity resumes ALL: dlm_controld receives a "finish bar" callback from groupd which isn't used for anything gfs unmount ------------------------------------------------------------------------------- (clustername = foo, fsname = bar bar mounted by node01, node02 and node03 with nodeids 1,2,3 node03 unmounts) umount(8) calls umount.gfs umount.gfs gets the device for the specified dir from /proc/mounts umount.gfs reads the superblock off the device and gets lockproto = lock_dlm locktable = foo:bar umount.gfs does umount(2) system call gfs-kernel does fs-specific unmounting stuff and calls dlm_release_lockspace() to leave the lockspace for this fs [see below for how leaving lockspace works] umount(2) returns 0 umount.gfs sends a message to gfs_controld "leave foo:bar 0" umount.gfs waits to receive a reply message from gfs_controld gfs_controld sends a reply back to umount.gfs umount.gfs reads the reply from gfs_controld, a string containing "0" for ok, or "-EXXX" for an error if ok, umount.gfs removes the /etc/mtab line for this fs and exits gfs_controld leaves the group "bar" via libgroup:group_leave() groupd receives "leave bar" message from gfs_controld groupd leaves the cpg "gfs_bar" via libcpg:cpg_leave() ALL: groupd receives a cpg confchg (configuration change callback) for bar with cpg members = 1,2 - see that 3 has been removed ALL: gfs_controld receives a "stop bar" callback from groupd ALL: gfs_controld blocks locking in gfs by setting a gfs-kernel flag /sys/fs/gfs/foo:bar/lock_module/block to 1 ALL: groupd receives a "stop_done bar" callback from gfs_controld and sends a "stopped bar" message to all ALL: groupd waits to receive stopped messages from all node03: gfs_controld receives a "terminate bar" callback from groupd and frees structures for bar (ALL is now node01 and node02) ALL: gfs_controld receives a "start bar" callback from groupd indicating group members = 1,2 ALL: gfs_controld sends a "start_done bar" message back to groupd ALL: groupd receives a "start_done bar" callback from gfs_controld and sends a "started bar" message to all ALL: groupd waits to receive started messages from all ALL: gfs_controld receives a "finish bar" callback from groupd ALL: gfs_controld unblocks locking in gfs by setting the gfs-kernel flag /sys/fs/gfs/foo:bar/lock_module/block to 0 == (this is what the dlm does when someone leaves the lockspace, like above) dlm_release_lockspace("bar") is called in the kernel sends an "offline bar" uevent to dlm_controld in userspace waits for dlm_controld to write to sysfs file indicating that it's done /sys/kernel/dlm/x/event_done dlm_controld leaves the group "bar" via libgroup:group_leave() groupd receives "leave bar" message from dlm_controld (it distinguishes this dlm group from the gfs group with same name) groupd leaves the cpg "dlm_bar" via libcpg:cpg_leave() ALL: groupd receives a cpg confchg for bar with cpg members = 1,2 ALL: dlm_controld receives a "stop bar" callback from groupd ALL: dlm_controld blocks activity in the lockspace by setting a dlm-kernel flag by writing 0 to /sys/kernel/dlm/bar/control ALL: groupd receives a "stop_done bar" callback from dlm_controld and sends a "stopped bar" message to all ALL: groupd waits to receive stopped messages from all node03: dlm_controld receives a "terminate bar" callback from groupd and writes to /sys/kernel/dlm/x/event_done which causes dlm_release_lockspace() to complete and return to caller (ALL is now node01 and node02) ALL: dlm_controld receives a "start bar" callback from groupd indicating group members = 1,2 ALL: dlm_controld tells dlm-kernel the new members of the lockspace by: rmdir /sys/kernel/config/dlm/cluster/spaces/bar/nodes/3 ALL: dlm_controld starts recovery in dlm-kernel by writing 1 to /sys/kernel/dlm/bar/control ALL: groupd receives a "start_done bar" callback from dlm_controld and sends a "started bar" message to all ALL: groupd waits to receive started messages from all ALL: after dlm-kernel recovery is complete, normal locking activity resumes ALL: dlm_controld receives a "finish bar" callback from groupd which isn't used for anything node recovery ------------------------------------------------------------------------------- (clustername = foo, fsname = bar three nodes have bar mounted: node01, node02 and node03 with nodeids 1,2,3) node03 fails ALL: groupd receives a cpg confchg (configuration change callback) for bar with cpg members = 1,2 ALL: groupd sees that nodeid 3 has been removed due to NODEDOWN and stops all groups that 3 was a member of ALL: fenced receives a "stop default" callback from groupd ALL: dlm_controld receives a "stop bar" callback from groupd ALL: gfs_controld "receives a "stop bar" callback from groupd ALL: fenced does nothing with stop callback ALL: dlm_controld blocks activity in the lockspace by setting a dlm-kernel flag by writing 0 to /sys/kernel/dlm/bar/control ALL: gfs_controld blocks locking in gfs by setting a gfs-kernel flag /sys/fs/gfs/foo:bar/lock_module/block to 1 ALL: groupd receives a "stop_done default" callback from fenced and sends a "stopped default" message to all ALL: groupd receives a "stop_done bar" callback from dlm_controld and sends a "stopped bar" message to all ALL: groupd receives a "stop_done bar" callback from gfs_controld and sends a "stopped bar" message to all ALL: groupd waits to receive fenced stopped messages from all ALL: groupd waits to receive dlm_controld stopped messages from all ALL: groupd waits to receive gfs_controld started messages from all ALL: groupd waits for the cluster to gain quorum if it's been lost ALL: fenced receives a "start default" callback from groupd indicating group members = 1,2 ALL: fenced sees node03 has failed, and selects it to be a victim node01: (lowest nodeid) runs the fence agent against node03 node02: defers fencing to node01 and sends start_done message back to groupd node01: fence agent completes successfully node01: sends start_done message back to groupd ALL: groupd receives a "start_done default" callback from fenced and sends a "started default" message to all ALL: groupd waits to receive started messages from all ALL: fenced receives a "finish default" callback from groupd removes node03 from its victim list ALL: dlm_controld receives a "start bar" callback from groupd indicating group members = 1,2 ALL: dlm_controld tells dlm-kernel about the dead node by: rmdir /sys/kernel/config/dlm/cluster/spaces/bar/nodes/3 ALL: dlm_controld starts recovery in dlm-kernel by writing 1 to /sys/kernel/dlm/bar/control ALL: groupd receives a "start_done bar" callback from dlm_controld and sends a "started bar" message to all ALL: groupd waits to receive started messages from all ALL: after dlm-kernel recovery is complete, normal locking activity resumes ALL: dlm_controld receives a "finish bar" callback from groupd which isn't used for anything ALL: gfs_controld receives a "start bar" callback from groupd indicating group members = 1,2 ALL: gfs_controld tells gfs-kernel to recover the journal that node03 was using by writing the jid to /sys/fs/gfs/foo:bar/lock_module/recover node01: gfs-kernel does journal recovery for node03 node02: gfs-kernel sees that node01 is doing the journal recovery so skips it ALL: gfs-kernel sends a "change" uevent to gfs_controld in userspace when it's done with recovery ALL: gfs_controld sends the result of the gfs-kernel recovery to all ALL: groupd receives a "start_done bar" callback from gfs_controld and sends a "started bar" message to all ALL: groupd waits to receive started messages from all ALL: gfs_controld receives a "finish bar" callback from groupd ALL: gfs_controld unblocks locking in gfs by setting the gfs-kernel flag /sys/fs/gfs/foo:bar/lock_module/block to 0