cluster2 architecture ------------------------------------------------------------------------------- version 0.1 - March 12, 2008 David Teigland This document describes the architecture of cluster2, the second generation clustering code from http://sources.redhat.com/cluster. It's used in RHEL5. cluster1, the first generation code, used in RHEL4, is roughly described in "Symmetric Cluster Architecture" (http://people.redhat.com/teigland/sca.pdf). The big changes from the first to second generation are: - moving the clustering code from the kernel to userspace - using openais as the core membership/messaging system - largely rewriting the dlm to work much better One major motivation now driving the evolution of the clustering code is to remove code developed, used and supported only by RH. It is being replaced with code developed, used and supported by a broad community. Collaboration on the same code with other developers and companies is crucial for the success of this clustering work. This means rethinking and reengineering many parts of the cluster architecture to reach the point where community code/infrastructure can replace code specific to our project. Or, if code can't be immediately replaced, the reengineering should take steps toward removing project-specific code in the next version. The sections roughly follow the steps to set up and run a cluster, and ultimately gfs. As an architectural document, the emphasis is on defining the different parts, and describing how the parts interact or relate to each other. cluster.conf ------------------------------------------------------------------------------- To get started setting up a cluster, create a cluster.conf file, an xml configuration file defining the basic properties of the desired cluster. Choose a name for the cluster; it can be up to 16 characters long, and should be unique from any other clusters running on the network. Each node that will be a part of the cluster should have a clusternode entry. The name should resolve to the IP address on the network interface to be used for cluster communication. The nodeid's need to be a positive integer, and unique within this cluster. Copy this file to /etc/cluster/cluster.conf on all the nodes listed. Related man pages: cluster.conf(5) ccs ------------------------------------------------------------------------------- CCS is the "cluster configuration system". The ccs daemon, ccsd, is usually started with no command line options. It has two main jobs. First, it provides cluster.conf information to local programs/daemons through its own API, libccs. This is simply an indirect way for programs to read values from cluster.conf. Second, it can propagate and replace cluster.conf on cluster nodes as it's updated. libccs The main job of ccsd is to provide cluster.conf information to other local programs through libccs. A program needs to connect to ccsd, read config values by providing xpath strings (e.g. "/cluster/@name"), then disconnect. Normal connections are denied if the cluster (cman) does not have quorum. Forced connections should always work. The ccs_test command can be used to test these capabilities. updates The second job of ccsd is to update cluster.conf on members of a running cluster. When 'ccs_tool update' is run, the local ccsd broadcasts a new cluster.conf file, and ccsd on other cluster nodes replace their copy with the new one. Also, when ccsd starts up, it broadcasts its cluster.conf to see if it needs to be replaced by a newer version in use on other nodes. Online updates to cluster.conf should be limited to adding new nodes to the cluster, which should be infrequent. When cluster.conf is updated, the config_version should be incremented; ccsd uses this to compare cluster.conf files on different nodes. Many people wish to manage cluster.conf themselves, and copy an updated cluster.conf to other nodes manually. To do this, the "update" ccsd feature can be disabled with the -X command line option. In this case, there's no need for ccsd to do any network communication, or interact with cman at all. (Although I think it still might in some cases, unfortunately. This should be fixed.) [Problems can occur in the complicated handling of cluster.conf syncing, even in the absence of any updates. This will prevent the main libccs functions from working, which prevents the cluster from starting. So, a problem in a frequently unnecessary feature (update mechanism) can jeopardize the critically important libccs function, without which a cluster cannot be started. This makes the -X option an important feature to know about.] cman - ccsd does it's own network communication in the cluster; it cannot use cman for cluster communication because cman depends on reading config values through ccs to start. - cman also keeps track of the config_version and requires all cluster members to be using the same version. update procedure There are three approaches to updating cluster.conf. When the cluster is online, updates should be limited to adding new nodes and fencing devices for them; changes should not be made to existing nodes. cluster offline - edit new cluster.conf - copy manually to all nodes cluster online option 1 - edit new cluster.conf, increment config version - copy manually to all nodes - run cman_tool version -r to tell cman the new config_version cluster online option 2 - edit new cluster.conf, increment config version - ccs_tool update - check that the old cluster.conf has been replaced by the new one on all cluster members; the cluster needs to have quorum (ccs_tool update also updates cman) In the future, ccs/ccsd will no longer exist; their function will be incorporated into the cluster manager. Related man pages: ccs(7), ccsd(8), ccs_tool(8), ccs_test(8) openais ------------------------------------------------------------------------------- Openais (openais.org) is the central cluster membership and messaging system. Cman is the part of openais we use to configure and interact with aisexec, the openais executive. The "ais" in the name refers to the set of libraries on top of the central membership system that implement the AIS programming API's. aisexec Aisexec is the name of the openais binary that runs on each system. It loads many of its parts from the plugins in /usr/libexec/lcrso/. We do not allow aisexec to be executed directly from the command line or from the openais init script. If it is started directly, it will load its configuration from /etc/openais.conf; we need to configure it differently. totem Totem is the core membership/messaging code, based on the published totem algorithm for reliable group communication. A token is circulated among all members; if one member fails, a token timeout will be detected and the remaining nodes will reconfigure the cluster membership. The token is unicast, other openais messages and application messages are multicast. Totem implements "virtual synchrony" (VS), which is an established method for replicating state across a cluster of nodes. VS (totem) guarantees the ordering of cluster messages and configuration changes, such that all nodes see all messages and changes in the same order. Higher level distributed programs can be built depending on these messaging guarantees, and behave as if all instances were operating synchronously. http://en.wikipedia.org/wiki/Virtual_synchrony cman Cman is a collection of things that make openais more usable. To a user, cman makes openais look and behave much like the first generation (RHEL4) cman. Cman also adds the quorum feature on top of openais, which is computed from the openais membership. It uses the traditional quorum model from VMS. - service_cman.lcrso is an openais plugin; it configures openais according to cman defaults and cluster.conf, it keeps track of quorum, and it supplies information to libcman queries - libcman is a library with API's mainly for querying cluster state; it communicates with service_cman through a unix socket - cman_tool is a command line program for starting/stopping aisexec and also provides a CLI for most of the libcman queries cpg Processes on cluster nodes can form a "process group" with a given name and send messages to the group. A "closed process group" (cpg), in particular, has a membership; defined as those processes that have explicitly joined the cpg by name. The cpg members are identified by a nodeid/pid pair, and are notified of changes to the cpg membership. The properties of virtual synchrony (see totem above) apply to the cpg messages and configuration changes. A process can join any number of cpg's. A cpg is typically used to replicate state among cooperating processes on different cluster nodes. Named instances of cluster resources can be associated with a unique cpg by using the resource name as the cpg name. This approach fits nicely with the symmetric, server-less architecture our cluster components have traditionally followed. Managing the state of fencing, dlm and cluster file systems is quite naturally done through the use of cpg's. [Like most other openais services, the cpg service does not provide a way of querying or inspecting the cpg state. This is a significant limitation that should be addressed in the future.] AIS The SA Forum (saforum.org) develops the "Application Interface Specification" (AIS) which is a software API for interacting with HA/cluster frameworks like ours. The idea is to provide standardized API's for developing HA/cluster programs. Openais has a set of libraries that unofficially implement the AIS API's. The current openais development release contains the following AIS API's in varying stages of completion: CKPT (version B.01.01) - checkpoint CLM (version B.01.01) - membership EVT (version B.01.01) - events LCK (version B.01.01) - locking MSG (version B.01.01) - messages AMF (version B.01.02) - managing HA applications These are all implemented as openais plugins, using the core openais subsystems such as totem. CKPT is the checkpoint service, and is the only one of the AIS services that we currently use in other components of our architecture. Other AIS services would probably be appropriate to use in the future. More information: http://openais.org http://saform.org Related man pages: openais_overview(8), cpg_overview(8), cman(5) cman ------------------------------------------------------------------------------- As summarized in the previous section, cman is a collection of things on the periphery of openais that allow us to control and interact with it. As such, they are more visible to the user and require more detailed explanations. joining the cluster The command "cman_tool join" starts aisexec, which is how a node joins the cluster. It sets an environment variable telling aisexec to be configured via cman (OPENAIS_DEFAULT_CONFIG_IFACE=cmanconfig), and then forks/execs aisexec. (If cman_tool command line options are used to override cluster.conf values, they are also passed to aisexec/service_cman via environment variables.) Once running, aisexec calls into service_cman for configuration. Service_cman reads cluster.conf values through libccs. The settings it looks for are: . cluster name . cluster id . multicast address . udp port These things are all related, and involve distinguishing different clusters. Each cluster must be given a unique name in cluster.conf. The cluster name may be used by higher level systems for ownership or access permission. The cluster name is not used directly within openais. The cluster name is hashed to create the 16 bit cluster id. The cluster id can also be assigned explicitly. The cluster id is used to isolate clusters that see the same network traffic. The two bytes of cluster id are used to generate the multicast address to be used by openais: 239.192.x.y where x is the upper 8 bits and y is the lower 8 bits. The multicast address can also be assigned explicitly. The default UDP ports used by openais are 5405/5404. These can be changed if the defaults are not available, or to isolate different clusters that need to use the same multicast address. will cause openais to use ports 6809 and 6808 for internode communication. . local node name . local node id service_cman must first identify the cluster.conf clusternode entry for the local node. It tries a number of things, like uname and hostnames on available network interfaces to match one entry. If it cannot find a match it will not start the cluster. The nodename must resolve to the IP address on a network interface; this is the address that will be used for cluster communication. A unique nodeid must be assigned for each node; these id's are how aisexec identifies nodes. . expected votes . local votes The quorum system is based on votes assigned to each cluster member, and it needs to know the vote settings for all possible members when starting up. The default vote settings are typically used, in which case no voting related information needs to be included in cluster.conf. Each node defaults to one vote, and expected votes defaults to the sum of votes of all nodes in cluster.conf. See the quorum section for more information. . encryption key All openais network traffic is encrypted. By default cman uses the cluster name as the security key. If more security is needed, a file can be specified with a key that will be used to encrypt the communication. Of course, the content of the key file must be the same on all nodes. A key file must be manually copied to all nodes. . openais parameters When openais is started by cman, the openais.conf file is not used. Many of the configuration parameters listed in openais.conf can be set in cluster.conf instead. Cman will read openais parameters from the following sections in cluster.conf and load them into openais: See the openais.conf man page for the specific parameters that can be set in these sections. Note that settings in the section will override any comparable settings in the openais sections above (in particular, bindnetaddr, mcastaddr, mcastport and nodeid will always be replaced by values in ). . token timeout The token timeout is the number of milliseconds a node has to send the token before the node is considered failed. Cman sets this to a default of 10000 (10 seconds). Setting the following openais parameter changes it to 5000 (5 seconds): . logging The following settings add extra logging for cman and cpg: leaving the cluster The "cman_tool leave" command terminates aisexec and causes the node to leave the cluster. The leave command will fail and return a "busy" error if other systems running on the node are still actively using the cluster. "cman_tool leave force" will terminate the cluster even if other systems are using it (it may not be possible to clean up the resulting state without a node reboot.) Other system daemons that depend on the cluster register with cman when they start, so they'll receive a callback from cman when a leave has been requested. The daemons report back to cman whether or not they are still using the cluster or can cleanly shut down. If openais decides to shut down, these other daemons will notice and also exit. When enough nodes leave the cluster, quorum will be lost, which may block other nodes from shutting down if they still need to stop cluster programs that are blocked by the loss of quorum. The "cman_tool leave remove" command avoids this problem by reducing the expected votes on the remaining nodes by the votes of the node that left. analyzing the cluster The libcman library provides an API for querying information about the cluster, cluster members, and quorum. It also provides a way for apps to receive callbacks about certain cluster events. libcman requests are passed to service_cman (and results sent back) through a unix socket. Most of the information available from libcman can be accessed from the command line by using cman_tool: cman_tool status - show properties of the cluster cman_tool nodes - show properties of cluster nodes Cman debugging information can be dynamically enabled using cman_tool -d. See the cman_tool man page for more information. quorum Quorum is a majority voting scheme where cluster nodes are assigned votes, and they contribute their votes to the cluster while they are members. If the cluster has a majority of all possible votes, it has quorum (also called quorate), otherwise it does not. Quorum is a boolean property of the cluster. . node votes: the number of votes assigned to individual nodes (nodes can be assigned different numbers of votes, each node typically has 1) . cluster votes: the number of votes the cluster has (the vote sum of all current cluster members) . expected votes: the total possible votes the cluster could have (the vote sum of all nodes listed in cluster.conf) . quorum votes: the number of votes the cluster needs to have quorum (expected_votes / 2 + 1, rounded up) The cluster has quorum if cluster votes >= quorum votes. Cman calculates quorum. It determines node votes and expected votes by reading cluster.conf, it uses the current membership from aisexec to determine cluster votes, and it simply calculates quorum votes mathematically. Cman reports the quorum state through libcman to any interested applications. The quorum state does not effect the general behavior of openais/cman. Quorum factors into decision making in various programs in different ways (or not at all.) It can be useful for programs that want to deal with a possible "split brain" condition of the cluster. Split brain is when two or more instances of the same cluster exist in parallel and do not know about each other (also called a cluster partition). In this situation, the quorum algorithm is guaranteed to report that the cluster has quorum in only one of the partitions. Using this, a program can make a decision based on the cluster possessing quorum and be guaranteed that no other instances of the program make the same decision. If no votes value is explicitly assigned to a node in cluster.conf, it will get 1 vote by default. This can be changed with the "votes" setting: The default number of expected votes is the sum of votes from all cluster nodes listed in cluster.conf. This can be changed, although it should be done with caution: In the special case of a two node cluster, quorum can be disabled as follows, allowing the cluster to have quorum with just one member: (This may not be a legitimate exception for some uses of cman. In the case of clusters used for fencing/dlm/gfs, described below, it can be safe given an appropriate fencing configuration that reboots nodes.) Expected votes is largely a static value, but it can be changed in a running cluster in some special cases. Expected votes will grow automatically if a node is added to the cluster. While running, cman will adopt the largest value of expected votes it sees. When a new node joins the cluster for the first time (after being added to cluster.conf), it propagates a new, increased, expected votes value that includes its own vote. Expected votes can also be reduced, but there is some danger in doing so because it can lead to multiple quorate partitions in a partitioned cluster. Because of this, e.v. should not be reduced automatically, but only by manually administration. With "cman_tool expected -e ", an administrator can lower e.v. to give the cluster quorum when it is known that there is no cluster partition, and reducing quorum won't disturb a partitioned majority cluster. Similarly, when manually joining nodes to the cluster, "cman_tool join -e " can be used to give the partially started cluster quorum when the state of the other cluster nodes is known. When shutting down a node, it is convenient to use "cman_tool leave remove" which will reduce e.v. by the votes of the node being shut down. This should be used when shutting down the entire cluster, or when shutting down a node that will be out of service for some time. Using "leave remove" more widely may also be acceptable, but would make multiple quorate partitions more likely given a cluster partition. More information: http://people.redhat.com/pcaulfie/docs/aiscman.pdf Related man pages: cman(5), cman_tool(8), cluster.conf(5), openais.conf(5) cpg ------------------------------------------------------------------------------- when to use a cpg In the context of this document, a cluster application is one where separate instances of the app are running in parallel on multiple nodes in the cluster. Simple apps can be run in parallel no differently than if one instance were running. For more complex apps, however, the instances need to be aware of each other, communicate, and operate synchronously with replicated state. Depending on the app, the instances may operate symmetrically as equal peers, or asymmetrically, with instances doing different jobs (the more symmetrical the design, the better it seems to fit with cpg.) A cpg, closed process group, is a group of cooperating processes that are aware of each other and can communicate. The processes can be running on any node in the cluster, and are identified to each other with a nodeid:pid pair. A cpg is designed to be used for cluster applications as described above, where each instance of the app joins the cpg when it starts. A cpg may be paired with the app directly, or the app may use multiple cpg's, pairing a separate cpg with each instance of a resource it is managing, for example. An app can join any number of cpg's. how to use a cpg A process joins a cpg through libcpg and provides the name of the cpg it wants to join. The processes that have joined the cpg are identified in the cpg's explicit list of members. A process is removed from the member list when it leaves, or if it exits/fails before leaving. Members are notified through a "configuration change" (confchg) callback when the member list changes. Each cpg is identified by a unique name which is used to join it. An app should use its own name as the name of the cpg to avoid collisions with other apps. Or, if the app uses multiple cpg's, it should combine the app name with the unique name of the resource it's associating with the cpg. When a process joins a cpg, the first callback it receives is the confchg showing it being added to the member list. The process is not a member of the cpg until this confchg arrives. When a process leaves, the final callback it receives is the confchg showing it being removed from the member list. The process remains a member of the cpg until this confchg arrives. After joining a cpg, the process can multicast messages to other cpg members. Actions that need to be synchronized (e.g. changes to state that needs to be in sync) are typically done by first sending a message describing the action to the group, and then acting only when the message is received. Since all members receive all messages in the same order, all instances of the app will take the actions in the same sequence. why to use a cpg The biggest benefit of a cpg are the guarantees about the delivery order of messages and confchgs. The notion of Virtual Synchrony applies to cpg's as explained in the earlier section. So, if a number of messages and configuration changes all happen at about the same time, all members are guaranteed to see the messages and confchgs occur in the same order. A cluster app should be designed to exploit this fact. For example, say that P1 and P2 are members of a cpg, and the following three things happen at the same time: P1 sends message X, P2 sends message Y, P3 joins the cpg. These three things could be seen in any of the following orders, but if P1 sees (a), then P2 and P3 are also guaranteed to see (a). a) recv X, recv Y, add P3 b) recv X, add P3, recv Y c) recv Y, recv X, add P3 d) recv Y, add P3, recv X e) add P3, recv X, recv Y f) add P3, recv Y, recv X (Note that P3 will only recv X or Y if they follow P3 being added.) using openais for cluster storage systems ------------------------------------------------------------------------------- openais as described above could be used as a foundation for many different clustering related programs/applications. The rest of the document focuses on four cluster applications built on openais, all centered around sharing storage in the cluster: an i/o fencing system, a distributed lock manager (dlm), a cluster volume manager (clvm), and a cluster file system (gfs). These are often used together, but they are independent subsystems that can also be used individually, or in combination with other cluster applications not covered here. . fencing+dlm may be used in combination with a cluster database . fencing+dlm may be used for HA application management (see rgmanager) . fencing+dlm+clvm may be used for managing shared storage with local fs's . fencing+dlm+clvm may be used with gfs, or a different cfs In the event that other cluster applications are used in conjunction with some of the subsystems above, it would be important for that app to also use openais as its source for cluster membership. groupd ------------------------------------------------------------------------------- The groupd daemon (and libgroup interface) is a compatibility layer that was created in cluster2 to help migrate fence/dlm/gfs subsystems to the new openais-based infrastructure. This layer translates between openais/cpg, and fence/dlm/gfs which still expect cluster1-like interfaces and behaviors. In the future, fence/dlm/gfs will no longer require this complex and cumbersome translation and will use the openais/cpg interfaces directly. An application uses the libgroup interface to join and leave a cpg. groupd translates group joins/leaves into cpg joins/leaves, and translates cpg configuration change callbacks into group callbacks. In general, groupd translates each cpg callback into a sequence of three callbacks to an app. An application provides a set of callback functions that groupd will use to notify it of changes to the group membership. There are three callbacks to the app for each group membership change: stop, start, and finish. The app needs to acknowledge the first two callbacks because groupd executes a barrier after them. Barrier 1 follows callback 1 (stop), and barrier 2 follows callback 2 (start). Callback 2 (start) won't happen until barrier 1 completes, and callback 3 (finish) won't happen until barrier 2 completes. The purpose of callback 1 (stop), is to allow the app to prepare for configuration changes in other subsystems and itself. The purpose of callback 2 (start) is to tell the app to begin recovery/reconfiguration. The new group membership is provided with the start callback. The reason for barrier 1 is that some apps need to suspend operations everywhere before any instance begins operating with a new configuration. The purpose of barrier 2 is so that apps can safely free any state related to recovery, knowing that it will no longer be needed. In the case where the change is due to a node failure, the first barrier (after stop) extends to all groups effected by the change, i.e. the first barrier requires the stop callback to be acked by all effected groups on all nodes. Also, all groups effected by a failure event are restarted in a fixed order after the full stop barrier, i.e. all groups at level 0 finish starting before any groups at level 1 are started, and all groups at level 1 finish starting before any groups at level 2 are started. Groupd defines fenced to be at level 0, dlm to be at level 1 and gfs to be at level 2. This simply allows the apps to avoid checking the state of a lower level app directly. Groupd buffers cpg events, and allows an app to finish processing a join or leave before it is notified about the next join or leave. Change events for node failures do interrupt the processing of any current event, though, because the completion of the current event may depend on the newly failed node. Groupd creates its own cpg through which it sends all messages, instead of sending messages through the cpg's for the individual groups. This is necessary because it processes the cpg events asynchronously; the cpg membership doesn't match the membership being applied to the application. Messages need to be sent to all app members, which may include nodes no longer in the cpg. Groupd does not send apps a callback 2 (start) until the cluster has quorum. This simply allows the apps to avoid checking and waiting for quorum themselves. group_tool The group_tool command displays information about groups that exist on the given node. A circular debug log from groupd can be printed with "group_tool dump". The following shows node1, node2, node3 and node4, with nodeids 1,2,3,4 are members of the dlm lockspace foo. [node1]# group_tool type level name id state dlm 1 foo 00010001 none [1 2 3 4] The -v option displays extra information related to join/leave/failure events (no events in progress in the following): [node1 ~]# group_tool -v type level name id state node id local_done dlm 1 foo 00010001 none [1 2 3 4] . type: "fence" for fence domain, "dlm" for lockspaces, "gfs" for mount groups . level: 0 for fence, 1 for dlm, 2 for gfs . name: name of the group, gfs uses fsname for the mount group and lockspace . id: a global unique id for this group, generated by groupd . state: the status of change events for the group, join/leave/fail . node: the nodeid of the node that initiated the change event . id: the event id . local_done: 1 if local app has acked the stop or start callback, 0 if not . the nodeids of the group members are shown between [ ] groupd logic The following is the sequence of internal states that groupd uses to manage changes to each group. There are JOIN, LEAVE and FAIL versions of each, depending on whether the change is due to a group join, leave or node failure. An "app" refers to fenced, dlm_controld, or gfs_controld running in the context of a particular group; "all apps" refers to the given app/group on all the nodes where it's running. - all apps running normally state: none - groupd notified on all nodes, through a cpg callback, of a join, leave, or failure in a specific group/cpg state: BEGIN all apps sent callback 1 (stop) - local app got callback 1 (stop), hasn't acked it yet state: STOP_WAIT, local_done = 0 The app may be trying to block/suspend its activity at this time. - local app has acked callback 1 state: STOP_WAIT, local_done = 1 groupd waiting on barrier 1 for all apps to ack cb1 groupd collecting APP_STOPPED messages from all group members The app block/suspend has completed on this node. Debug: look for other nodes in the group in STOP_WAIT, local_done = 0 where the app (fenced, dlm_controld, gfs_controld) has not yet acked callback 1 (stop), the app needs to be inspected to find out why. - all apps have acked callback 1 (stop) state: ALL_STOPPED all apps sent callback 2 (start) In the case of a node failure, groupd waits for quorum after ALL_STOPPED before sending callback 2 to apps. In the case of a node leaving the group, the app is sent a special "terminate" callback here instead of a start callback. This is the last callback it will receive. - local app got callback 2 (start), hasn't acked it yet state: START_WAIT, local_done = 0 The start callback is accompanied by a list of group members, which will include a new node if this is a join event, or will be missing a node if this is a leave or failure event. The app is generally doing some kind of recovery/reconfig during this time to adjust to the addition or removal of nodes in the group. In the case of a new node joining the group, this start callback is the first it receives, and indicates that it is now in the group. - local app has acked callback 2 state: START_WAIT, local_done = 1 groupd waiting on barrier 2 for all apps to ack cb2 groupd collecting APP_STARTED messages from all group members The app recovery/reconfig completed on this node. Debug: look for nodes in the group in START_WAIT, local_done = 0 where the app (fenced, dlm_controld, gfs_controld) has not yet acked callback 2 (start), the app needs to be inspected to find out why. - all apps have acked callback 2 (start) state: ALL_STARTED all apps sent callback 3 (finish) - all apps running normally state: none fencing ------------------------------------------------------------------------------- What, When, Why I/O fencing is forcibly preventing a machine from writing to shared storage. It is usually done by turning off the power to the machine, or turning off the switch port connecting the machine to storage. Fencing is needed in shared storage clusters where all machines modify shared storage, and uncontrolled or uncoordinated modifications could corrupt the shared storage. - failure fencing A node needs to be fenced when it is using shared storage, fails, and the remaining nodes want to recover the shared storage for it without waiting to confirm that it's been reset. Once they've done the recovery, they can continue using the shared storage. There are two situations where fencing would not be necessary following a node failure: . The remaining nodes are willing to wait to do recovery until they confirm the failed node is in a safe state. This is possible, but usually impractical, because it would often require waiting for an administrator to reset the failed node. (See manual fencing for actually using this option.) . The node failure is due to a hard, terminal failure which would make it impossible for the machine to make any further writes to storage. It's not possible to automatically distinguish this kind of failure from others without administrator intervention. Assuming the cluster does not want to wait indefinitely to confirm that a failed node has been reset, fencing becomes necessary because the failure could be due to one of the following which are indistinguishable from a hard, terminal failure: i) a machine hang ii) a cluster partition Given either of these, the shared storage could easily be corrupted if remaining nodes recover and use shared storage without first fencing. Examples of how corruption could happen in each case without fencing: i) Say that a machine hangs, the membership system declares that it has failed, and the remaining nodes recover for it without fencing. During recovery, the remaining nodes take over the distributed locks the failed node held, write to shared storage to recover what the node was doing, and then continue using shared storage. If the hang clears on the failed node, the node will continue running wherever it was, which could be in code modifying the shared storage. These modifications will be based on old, invalid state, and will corrupt the shared storage. ii) Say that a machine's network connection to the cluster is broken. This creates the most common type of network partition. The disconnected machine sees that all other nodes have failed, and loses quorum. The other nodes see the that the disconnected machine has failed, and retain quorum. Due to the loss of quorum, gfs on the disconnected node can no longer acquire new locks to modify new parts of the fs. However, gfs will continue to modify parts of the fs for locks it already holds. If the nodes in the quorate partition do recovery for the failed/partitioned node without fencing, then nodes in both partitions may be modifying the same shared storage differently. - start up fencing Besides node failure, there's another situation where fencing is needed. When nodes form a new cluster and want to begin using (initialize/recover) shared storage, they have no knowledge of the previous state of the cluster, and they need to fence any nodes that *may* have been in the previous cluster membership but are now in an unknown state (they are not currently cluster members.) The nodes with unknown state need to be fenced because they might be: i) in a hung state, having had gfs mounted in the previous cluster ii) in a cluster partition with gfs mounted Given either of these, shared storage could easily be corrupted if nodes in the new cluster begin using shared storage without fencing the unknown nodes. Examples of how corruption could happen in each case without fencing: (In both examples, nodes A, B and C are the only nodes listed in cluster.conf.) i) Say cluster nodes A,B,C all have gfs mounted. A,B are both reset at once, and at the same time C hangs. A,B come back and form a new quorate cluster. A and B know nothing about C. If A,B begin using gfs without first fencing C, the hang of C could clear allowing C to continue writing to gfs based on old, invalid state, corrupting the fs. ii) Say cluster nodes A,B,C all have gfs mounted. A,B are both reset at once, and at the same time are partitioned from C. A,B come back and form a new quorate cluster. A and B know nothing about C. If A,B begin using gfs without first fencing C, they will corrupt the fs because gfs on C may still be modifying parts of the fs it holds locks on. How The fencing system implementation has three parts: 1. fence domain: deciding who should be fenced and when implementation of the requirements described above 2. fence config: determining how to fence a given node closely related to the way fencing info is specified in cluster.conf 3. fence agents: doing the actual fencing programs/scripts specific to hardware devices The first two are a part of the fencing daemon, fenced. The third is done by a collection of "agent" programs and scripts, run by fenced, each of which interacts with specific hardware (network power switch, SAN switch). fence domain The fence domain initiates fencing operations against nodes; it decides who should be fenced and when. The domain is a group (cpg) of cluster nodes that want to participate in fencing: being fenced themselves if they fail, and able to fence others. A node joins the domain after joining the cluster, but before starting systems like gfs that use shared storage. To join the domain, a node starts fenced, then runs "fence_tool join" which tells fenced to join the domain. If a member of the domain fails, it will be fenced by another member, usually the member with the lowest nodeid. fenced is notified of node failures through callbacks in the domain group/cpg. A node can leave the domain with the "fence_tool leave" command, after which it will no longer be fenced if it fails. GFS requires fencing be used, so when a gfs mount is attempted, gfs verifies the node is a fence domain member; the fence domain does fencing on behalf of gfs. Similarly, fence_tool leave first checks that no shared storage systems like gfs are still in use. fence domain: quorum If the cluster does not have quorum, the fence domain will still keep track of failed domain members that need fencing, but it won't actually fence any machines until quorum is regained. This is a crucial application of quorum; it prevents a single malfunctioning node, or minority group of partitioned nodes from disrupting a majority cluster that is operating fine. When the cluster regains quorum, the domain will continue processing fencing operations for any domain members that had failed while there was no quorum. However, before actually fencing a node, fenced always checks whether the node has been reset and rejoined the cluster. If it has, there is no longer a reason to fence it, so the fencing operation is skipped. A common example is when a node in the domain fails, causing the cluster to lose quorum. The domain on the remaining nodes queue a fencing operation for the failed node, but wait to carry it out. If the failed node is reset and rejoins the cluster, causing quorum to be regained, the nodes with the pending fencing operation will continue processing it. However, they will find that the victim has now been reset and rejoined the cluster, so the fencing operation will just complete without any agent being run against the node. fence domain: startup fencing Startup fencing is done when a newly formed domain first acquires quorum. All possible nodes (those listed in cluster.conf) that are not currently cluster members are fenced. fenced uses a configurable delay between the addition of a new victim at startup, and actually carrying out the fencing. If the victim joins the cluster within this delay, it won't be fenced (see above). This avoids unnecessarily fencing nodes in the common scenario where a whole set of nodes are joining the cluster at once. (See post_join_delay.) fence domain: cluster application The following describes how fenced works. The fencing logic is driven by groupd callbacks in the domain (ultimately cpg configuration changes). - At startup, the list of current domain members is initialized to contain all nodes listed in cluster.conf. Startup fencing is the effect of doing this. - groupd callback 1 (stop) not used - groupd callback 2 (start) (groupd does not send this callback until the cluster has quorum, so fenced does not need to check for quorum itself.) Compare the previous list of domain members to the new list of domain members given in the start callback. If there are any nodes in the old list, but not in the new list, they have either left the domain or have failed. The nodes that have failed are added to a "victims" list of nodes to be fenced. Replace the existing list of domain members with the new list. If the local node does not have the low nodeid in the group, send groupd an ack for callback 2 (start), and do no more. If the local node has the low nodeid in the group, it is responsible to carry out fencing on the victims. It looks up the fence agent from cluster.conf, forks/execs the agent against a victim, waits for the result of the agent, retries until the victim has been successfully fenced, then repeats for the next victim. After all victims have been fenced, it sends groupd an ack for callback 2 (start). - groupd callback 3 (finish) By this, fenced knows that fencing has completed by the low nodeid. Clear the list of victims. If the low nodeid fails while fencing, the remaining nodes will not get a finish callback, but will get a stop callback followed by another start. They will already have nodes in their list of victims, and will add to it. The new low nodeid will then process its victims list which will contain victims from the previous start and the current start. A node could end up being fenced more than once if the node that fenced it fails before sending an ack to groupd. In this case, the next low nodeid will fence it again. The following is an example of this case: - cluster.conf contains nodes A,B,C,D,E with nodeids 1,2,3,4,5 respectively - A,B,C,D,E are all cluster members and fence domain members domain members = [A,B,C,D,E], victims = [-] - A fails - B,C,D,E get callback 2 (start) with a new list of domain members [B,C,D,E] fenced sees that A has been removed and adds A to the victims list domain members = [B,C,D,E], victims = [A] - C,D,E ack callback 2 (start) - B begins fencing A, but fails (whether B actually fences A before failing does not matter) - C,D,E get callback 2 (start) with a new list of domain members [C,D,E] fenced sees that B has been removed and adds B to the victims list domain members = [C,D,E], victims = [A,B] - D,E ack callback 2 (start) - C fences A, agent completes successfully - C fences B, agent completes successfully - C acks callback 2 (start) - C,D,E get callback 3 (finish) fenced knows that fencing of victims has completed and clears victims list domain members = [C,D,E], victims = [-] fence domain: optional configuration The optional section in cluster.conf can be used to configure the fenced behavior. . post_join_delay: seconds to delay fencing after a node joins . post_fail_delay: seconds to delay fencing after a node fails . clean_start: set to 1 to disable startup fencing (dangerous) fenced can also be run with corresponding command line options: . -j: post_join_delay . -f: post_fail_delay . -c: clean_start . -D: run fenced in foreground and print debugging info to stdout fence domain: notes 1. A node failure is defined as the cluster membership system declaring that a node is no longer a member of the cluster. It determines this by whatever membership/liveness protocol it uses. This protocol is not immune to false positives, where a node is declared to be failed when it really isn't. Fencing is often the first obvious consequence of a spurious node failure in the membership system. To reduce these unwarranted fencing/recovery events, the node failure detection in the membership system needs to reduce false positives. One common way of doing this is to increase the time that the membership system waits for an unresponsive node before declaring it failed. 2. The fencing what/when/why section, and it's implementation by the fence domain, is geared toward the requirements of shared storage clusters like gfs; it may not be ideal for other kinds of cluster apps that are interested in fencing. Other kinds of clusters, with different fencing requirements, may want to use an alternative to fence domain, although the fence config and fence agent parts may still be applicable. 3. After fencing a node, fenced tells cman about it, so that cman can display basic fencing information along with a node listing: cman_tool nodes -F 4. fenced keeps a circular debug log that can be dumped: fence_tool dump fence config The fencing configuration is by definition closely related to the structure and content of cluster.conf. When the domain decides to fence a given node, it needs to figure out which agent to use and what parameters to pass that agent so the given node is fenced. If the domain decides to fence node foo, it looks up the first device entry in the first method of foo's fence section. A node's device entry refers to a specific device in the section, and provides node-specific parameters that need to be used with that device. The referenced fencedevice entry contains the name of the fence agent to be used. fenced forks and execs the agent, combines the fencedevice entry and the node's device entry, and passes that data to the agent via stdin. fenced waits for the agent to exit, and uses the exit value to determine success or failure. fence config: multiple methods In advanced configurations, multiple fencing methods can be defined for a node. If fencing fails using the first method, fenced will try the next method, and continue to cycle through methods until one succeeds. fence config: dual path, redundant power Sometimes fencing a node requires disabling two power ports or two i/o paths. This is done by specifying two or more devices within a method. Each device is handled independently, and all must succeed for the fencing operation to be considered a success. In the following configuration, node foo has two i/o paths connecting it to shared storage, and the paths go through separate SAN switches. When using power switches to fence nodes with dual power supplies, the agents must be told to turn off both power ports before restoring power to either port. The default off-on behavior of the agent could result in the power never being fully disabled to the node. In the the following configuration, the agent is run four times: 1. powers off the first power port 2. powers off the second power port 3. powers on the first power port 4. powers on the second power port fence config: agent parameters Each agent will have specific parameters that are required in the fencedevice section and in each node's device section. Consult the man page for an agents for details on its cluster.conf configuration. fence agents The fencing agents interact with a specific piece of hardware to fence a given machine. They are stand-alone programs or scripts that can take all input parameters from the command line. They can also take input parameters from stdin, which is what fenced uses. The exit code of the program/script indicates the success or failure of the fencing operation. In the case of a failure, the agent should also write an error message (up to 256 bytes) to stdout. The message will be read by fenced and included in a syslog message from fenced for the failure. The agents should avoid using any other part of cluster infrastructure. The various types of fencing agents include: (Note: some of these agents may no longer even work, they are just given as examples of the different kinds of agents.) - network power switch A node logs into a network power switch, unfortunately this is often via a telnet interface, and turns of the specified power port. agents: fence_apc, fence_wti - machine management interfaces Systems management interfaces accessible via the network can often be used to power off a node: ILO (HP Integrated Lights Out card), DRAC (Dell Remote Access Card), RSA (IBM RSA II management interface). agents: fence_ilo, fence_drac, fence_rsa, fence_ipmi - SAN switch A node logs into a SAN switch and disables the port belonging to the node being fenced. agents: fence_brocade, fence_mcdata, fence_sanbox2 - SCSI persistent reservation Using SCSI persistent reservations for fencing can be complicated because commands must be run at start up time to set it up, and the specific SCSI devices need to be derived from the logical volumes that are being used as shared storage. agent: fence_scsi - network block storage server If the shared storage is being exported over the network by a server, the server software can potentially be contacted and told to ignore i/o from the node being fenced. agent: fence_gnbd, and an agent for a server exporting an iscsi target would be appropriate. - virtual machine Virtual machines can fence each other with the assistance of an external daemon that keeps track of virtual to physical machine mappings. agent: fence_xvm - manual override The fence_manual agent writes a syslog message telling the administrator to manually reset the failed node, and then run the fence_ack_manual program after it's reset. When fence_ack_manual is run, fence_manual will return to fenced with a successful result. If the failed node rejoins the cluster before fence_ack_manual has been run, fence_manual completes successfully without the ack. If no fencing method is defined for a node, it is the equivalent of manual fencing; fenced will wait indefinitely for the admin to override the fencing operation via the command "echo victimnodename >> /var/run/cluster/fenced_override", which should not be done until the failed node has been safely turned off or reset. fence_node The fence_node command is a small program that can be used to fence a specific node listed in cluster.conf, independent of the fence domain and without any involvement from fenced. It uses the same code as fenced to look up the appropriate agent and paramters in cluster.conf, and then runs the agent. It would be simple to also offer this fence_node(nodename) capability in the form of a library call from a future libfence library. If a different kind of openais/cman cluster has its own requirements for fencing, and its own method for deciding what node to fence and when to fence it, it may prefer to call fence_node(nodename) through a libfence API instead of via a command line "fence_node nodename" call. Parts 2 and 3 of our fencing system would remain the same, and the fence domain in fenced would simply call libfence:fence_node(). Part 1 of the fencing architecture, the fence domain, should be considered one of multiple applications that makes requests for nodes to be fenced. The domain does not provide fencing capabilities to other applications, but rather is a consumer of fencing capabilities from parts 2 and 3 of the architecture (or libfence above). The fence_node program is another consumer that parallels the domain. Yet another consumer, at the same level as the domain and fence_node, may be an application management system. The fence domain application consumes fencing services on behalf of dlm and gfs specifically. The style in which the fence domain consumes fencing for dlm/gfs (according to their requirements) *may* also be suitable, incidentally, for other applications. One should not assume, however, that this is the case. The fencing requirements of another application could easily be different enough from those of dlm/gfs, that the domain system would not be appropriate. Another method would needed to decide who is fenced and when, according to the app's requirements. In particular, if an application requires that a node always be fenced after it fails, or be fenced as soon as possible after it fails, the domain would not be suitable. dlm ------------------------------------------------------------------------------- The following document describes dlm concepts and the API: http://people.redhat.com/ccaulfie/docs/rhdlmbook.pdf The dlm is a heavy-weight solution for distributed locking. Other distributed locking services may be better suited for smaller, less demanding applications. (See the DLK AIS service.) An architectural overview of the dlm follows, dividing the dlm into three main parts: 1. dlm-kernel: the lock manager itself, which operates in the kernel 2. dlm_controld: the part of the dlm that interacts with the userspace cluster infrastructure and controls dlm-kernel according to cluster events 3. libdlm: the library through which userland applications use the dlm The mechanisms dlm_controld uses to interact with dlm-kernel will be covered in part 2, and the mechanisms libdlm uses to interact with dlm-kernel will be covered in part 3. dlm-kernel The dlm provides a VMS-style distributed locking service that cluster applications can use for synchronization. The dlm is implemented in the kernel. The implementation of the lock manager itself has not been documented, but can be found in the linux kernel source under linux/fs/dlm. The dlm API for kernel applications (like gfs) is defined in include/linux/dlm.h. The dlm is implemented in the kernel primarily because a cluster file system like gfs uses it so heavily. During normal operation, the dlm runs on its own; it doesn't make use of services from other cluster components. Internode communication by the dlm is handled by the dlm itself, using TCP or SCTP. IP addresses are provided by dlm_controld. Dlm locking takes place in the context of a "lockspace", which is a group of nodes that share a common namespace for locks. Separate lockspaces are typically used for different application instances or distinct resources like an individual file system. The lockspace membership is managed by dlm_controld in userspace. Any change in the lockspace membership requires the dlm to do lock recovery within the given lockspace. Recoveries interrupt the normal operation of a lockspace when nodes (processes) join or leave the lockspace. The recovery for a node removal (by leaving or failing, it doesn't matter), is usually longer that the recovery for a node addition. During recovery, calls into the dlm will block. dlm_controld The userspace dlm_controld daemon is an openais cluster application that uses cpg's (through groupd) to manage lockspace membership. There are three types of control events where dlm_controld and dlm-kernel need to interact: 1. a local app is joining a lockspace 2. a local app is leaving a lockspace 3. a remote app is joining or leaving a lockspace The interactions take place through debugfs and sysfs files, which should be considered internal dlm interfaces. For a lockspace named foo, lockspace control files exist under: . /sys/kernel/dlm/foo/ . /sys/kernel/config/dlm/cluster/spaces/foo/ dlm_controld creates a configfs directory entry to add a node to a lockspace and removes its directory entry to remove it from a lockspace. dlm_controld blocks normal locking activity in a lockspace by writing to a sysfs file for the lockspace. This happens before adjusting the membership on any of the nodes. Locking activity is resumed after the adjustments have been made. The lockspace's first step in resuming activity is doing recovery. local join - app calls into dlm-kernel new_lockspace function . creates the necessary kernel structures . sends an "online foo" uevent message to dlm_controld . blocks waiting for a result to be written to /sys/kernel/dlm/foo/event_done - dlm_controld . receives the "online foo" message and joins the group (cpg) "foo" . waits for the join to complete (gets start callback from groupd) . adds directory entries for each lockspace member under /sys/kernel/config/dlm/cluster/spaces/foo/nodes/ . writes "1" to /sys/kernel/dlm/foo/control which tells dlm-kernel to begin recovery in lockspace foo using the set of members under foo/nodes/ . writes "0" to /sys/kernel/dlm/foo/event_done which tells dlm-kernel that the lockspace join is complete - app blocked in dlm-kernel new_lockspace function . wakes up after "0" is written to event_done . returns a lockspace handle to the app that can now be used for standard locking operations - app begins using the dlm to do locking; calls into the dlm block until lockspace recovery is complete (recovery began above at write control, and may or may not be complete by the time the app begins using the dlm) local leave - app calls into dlm-kernel release_lockspace function . sends an "offline foo" uevent message to dlm_controld . blocks waiting for a result to be written to /sys/kernel/dlm/foo/event_done - dlm_controld . receives the "offline foo" message and leaves the group (cpg) "foo" . waits for the leave to complete (gets terminate callback from groupd) . removes all directory entries from /sys/kernel/config/dlm/cluster/spaces/foo/nodes/ . writes "0" to /sys/kernel/dlm/foo/event_done which tells dlm-kernel that the lockspace leave is complete - app blocked in dlm-kernel release_lockspace function . wakes up after "0" is written to event_done . frees structures for the lockspace . returns to app remote join/leave/fail - app using lockspace foo is running on three nodes with nodeids 1,2,3 . configfs entries for foo on all nodes are /sys/kernel/config/dlm/cluster/spaces/foo/nodes/1/ /sys/kernel/config/dlm/cluster/spaces/foo/nodes/2/ /sys/kernel/config/dlm/cluster/spaces/foo/nodes/3/ - app on node 3 exits (node does not fail) - dlm_controld (on nodes 1 and 2) . gets groupd callback 1 (stop) . writes "0" to /sys/kernel/dlm/foo/control which tells dlm-kernel to suspend operations in lockspace foo, the write does not return until all in-progress operations have completed . sends groupd an ack for callback 1 (stop) . groupd does standard barrier 1; it's important for operations in foo to be to be suspended on all nodes before any begin recovery . gets groupd callback 2 (start) . removes node 3 from lockspace, removes directory for nodeid 3 /sys/kernel/config/dlm/cluster/spaces/foo/nodes/3 . writes "1" to /sys/kernel/dlm/foo/control which tells dlm-kernel to begin recovery in lockspace foo using the set of members under foo/nodes/ . sends groupd an ack for callback 2 (start) . groupd does standard barrier 2 and sends callback 3 (finish) when it completes, but it is not used by dlm_controld - dlm-kernel (on nodes 1 and 2) . begins recovery of lockspace foo after "1" written to control . when recovery completes, lock operations from the app are resumed node comms Apart from the control-oriented events above, dlm_controld also provides dlm-kernel with information about cluster nodes, outside the context of a specific lockspace. In particular, the dlm needs to know the IP address of each nodeid present in any lockspace. For each node in the cluster (regardless of its activity in any lockspace) dlm_controld creates a directory under /sys/kernel/config/dlm/cluster/comms/. In the example above, each node would have these entries: /sys/kernel/config/dlm/cluster/comms/1/ /sys/kernel/config/dlm/cluster/comms/2/ /sys/kernel/config/dlm/cluster/comms/3/ Each nodeid directory contains three entries: nodeid, addr, local. The nodeid is written to nodeid (the same as the directory name). The IP address is written to addr (in the form of a binary sockaddr_storage structure). A "1" is written to local under the node's own entry. tuning Tuning parameters for dlm-kernel also exist in configfs: # ls /sys/kernel/config/dlm/cluster/ buffer_size lkbtbl_size recover_timer spaces/ toss_secs comms/ log_debug rsbtbl_size tcp_port dirtbl_size protocol scan_secs timewarn_cs The protocol, timewarn and log_debug settings can be placed in cluster.conf under , and dlm_controld will set them (the others need to be set manually.) . buffer_size . dirtbl_size . lkbtbl_size . log_debug . protocol . recover_timer . rsbtbl_size . scan_secs . tcp_port . timewarn_cs . toss_secs libdlm Userspace programs can use the dlm through the libdlm library. The library interface is defined in /usr/include/libdlm.h. The library uses devices to communicate with dlm-kernel. It passes operations to dlm-kernel by writing to the device and reads the device to get completion results. Writes to /dev/misc/dlm-control device are used to create and release lockspaces. A separate device, /dev/misc/dlm_, is created for locking operations in each lockspace. The structures used by libdlm to communicate with dlm-kernel are defined in /usr/include/linux/dlm_device.h. libdlm handles backward compatibility with previous kernel interfaces. Related man pages: libdlm(3), and a man page for each function defined in libdlm.h. dlm_tool The dlm_tool program can be used to display the resources and locks in a lockspace, and to join/leave a lockspace. The "dlm_tool lockdump foo" command displays all locks held by local processes in lockspace foo. The -M option adds "master copy" (MSTCPY) locks to the dump; locks owned by remote processes that are mastered locally. Lockdump reads the list of locks for lockspace foo from dlm-kernel through the debugfs file /sys/kernel/debug/dlm/foo_locks. The "dlm_tool lockdebug foo" command also dumps resources and locks, but in a different format that includes some additional debugging information. The "dlm_tool join foo" and "dlm_tool leave foo" commands can be used to create and release lockspace foo. The -m option can be used with join to specify the permission of the resulting lockspace device. This can be useful for creating lockspaces for applications without root permission, or to control the lifetime of a lockspace for some reason. configuration options Lockspaces usually use a resource directory to keep track of which node is the master of each resource. The dlm can operate without the resource directory, though, by statically assigning the master of a resource using a hash of the resource name. The nodir setting can be combined with node weights to create a configuration where select node(s) are the master of all resources/locks. These "master" nodes can be viewed as "lock servers" for the other nodes. or, Lock management will be partitioned among the available masters. There can be any number of masters defined. The designated master nodes will master all resources/locks (according to the resource name hash). When no masters are members of the lockspace, then the nodes revert to the common fully-distributed configuration. Recovery is faster, with little disruption, when a non-master node joins/leaves. There is no special mode in the dlm for this lock server configuration, it's just a natural consequence of combining the "nodir" option with node weights. When a lockspace has master nodes defined, the master has a default weight of 1 and all non-master nodes have weight of 0. Explicit non-zero weights can also be assigned to master nodes, e.g. In which case node01 will master 2/3 of the total resources and node02 will master the other 1/3. fencing requirements The dlm itself does not require fencing for correct operation. Fencing is defined as preventing a node from writing to shared storage, and the dlm does not use shared storage in any way. Shared-storage-related applications that use the dlm, however, may assume that the dlm will not grant locks from failed nodes before a failed node has been fenced. In this way, the dlm is used as a proxy for checking the fencing status of failed nodes. (This is not the case with gfs, which checks fencing explicitly.) For this reason, it is best for the dlm to not begin recovery for a failed node until fencing has occurred. alternate cluster management systems dlm_controld uses openais (cpg through groupd and cman) to drive dlm-kernel. An alternative daemon could be written to drive dlm-kernel from a different cluster management system. This alternative daemon would use the same configfs/sysfs control interfaces that dlm-kernel exposes. A second dlm_controld is expected in the future that will use openais API's directly, without the groupd/libgroup translation layer. gfs ------------------------------------------------------------------------------- Much like the dlm, gfs has a userland daemon, gfs_controld, that interacts with the cluster infrastructure and controls gfs-kernel through sysfs when necessary. The amount of control needed between gfs-kernel and gfs_controld is less than the dlm case. gfs_controld uses cpg's (through groupd) to manage the membership of each "mount group": which nodes have each filesystem mounted. Joining and leaving the mount group is done by mount.gfs and umount.gfs before/after gfs-kernel does the standard filesystem mount/unmount. gfs-kernel does not need to know the mount group membership (in contrast to dlm-kernel which needs to know the lockspace membership). Recovery is the only area where gfs-kernel requires control input from gfs_controld. Specifically, gfs_controld suspends/unsuspends gfs locking before/after recoveries, and tells gfs-kernel which journals need recovery. gfs-kernel activity needs to be suspended before dlm recovery begins for the given fs. DLM locks granted before gfs journal recovery has completed cannot be used for normal fs access, only for journal recovery. gfs_controld maintains state about each member of the mount group. When a node joins (i.e. mounts the fs), gfs_controld needs to sync the state of the existing members to the new member. This state includes, the mount options of each member, the journal id used by each member, the recovery status of each journal, and the first-mounter state of the fs. [References to gfs-kernel include both the gfs and lock_dlm kernel modules. The lock_dlm module is a part of gfs (despite the misleading name). It's a bridge between gfs and the dlm and between gfs and the userspace cluster infrastructure. GFS has historically isolated its interactions with specific locking/clustering systems within lock modules. This has allowed gfs to more readily switch between different external lock and cluster management systems.] mkfs When a gfs file system is created, gfs_mkfs requires two parameters related to the clustering/locking infrastructure. It stores these in the on-disk superblock: -p -t The lockproto is the lock module that the fs should use. Two currently exist, lock_nolock and lock_dlm (see above). If the fs will be used as a local fs, lock_nolock should be used. If the fs will be used by the cluster (the whole point of gfs), lock_dlm should be used. The locktable is not used by gfs directly, but is passed to the lock module. lock_nolock ignores the locktable. lock_dlm requires that the locktable be of the form :. clustername is the name of the cluster whose member's may mount this fs. A node that is not a member of a cluster, or is a member of a cluster with a different name will not be permitted to mount this fs. The clustername was defined cluster.conf: . fsname is a unique name chosen at the time of mkfs for this specific file system. This name will be used by the cluster infrastructure (gfs_controld/cpg) to manage mounts of this fs. mount The mount(8) command calls mount.gfs which joins the mount group for the fs before calling mount(2) where control passes to gfs-kernel for the traditional filesystem mount from disk. (When creating cpg's for gfs mount groups, groupd prefixes the name with 2, e.g. "2:foo". It uses a 1 prefix for dlm lockspace cpg's, e.g. "1:foo".) - an example mkfs for reference in the examples below, gfs_mkfs -p lock_dlm -t alpha:foo -j 4 /dev/foo - mount /dev/foo /mnt . mount(8) calls mount.gfs - mount.gfs . reads the lockproto and locktable from the superblock lockproto="lock_dlm" and locktable="alpha:foo" . asks gfs_controld to join the mount group for this fs; it connects to gfs_controld and sends it the message "join /mnt gfs lock_dlm alpha:foo rw /dev/foo" . reads a preliminary result from gfs_controld, "0" if the mount was accepted and has begun, "-EXYZ" if the mount was rejected in which case it prints an error message and exits . waits to receive the final result from gfs_controld, blocks reading on the connection to gfs_controld - gfs_controld . receives the join request from mount.gfs . checks that the local node is a member of the cluster named in the locktable (alpha), and that the local node is a member of the fence domain . joins the mount group (cpg) "foo" . waits for the join to complete . gets groupd callback 2 (start) . immediately sends groupd an ack for callback 2 (start) . syncs state for mount group foo with other nodes . picks an available journal id for the local mount to use . sends the result of the join back to mount.gfs if an error, the result message is "error: " if success, the result message is "hostdata=jid=X:id=Y:first=Z" X is the journal id that the local mount should use Y is a globally unique filesystem identifier used for posix locks Z is 1 if this is the first mount of the fs, 0 otherwise - mount.gfs . receives the join result, the hostdata string . appends the hostdata to the mount options it will pass to gfs-kernel . calls mount(2) - gfs-kernel . does kernel filesystem mount . calls dlm_new_lockspace() using fsname (foo) as the lockspace name . returns from mount(2) - mount.gfs . sends a message to gfs_controld with the result of the mount(2) . adds a line to /etc/mtab for this mount and exits If mount(2) returns an error, mount.gfs has to leave the group, similar to the unmounting procedure; this is a lot to do to back out at this point, so we want to avoid getting an error back from mount(2) if we can help it. unmount The umount(8) command calls umount.gfs which first calls umount(2) to do the traditional filesystem unmount in gfs-kernel. When the umount(2) completes, umount.gfs leaves the mount group for the fs. - umount /mnt . umount(8) calls umount.gfs . umount.gfs calls umount(2) - gfs-kernel . does kernel filesystem unmount . calls dlm_release_lockspace() . returns from umount(2) - umount.gfs . asks gfs_controld to leave the mount group for this fs; it connects to gfs_controld and sends it the message "leave /mnt foo 0" . waits to receive a reply message from gfs_controld - gfs_controld . sends a reply back to umount.gfs . leaves the mount group . waits for the leave to complete - umount.gfs . receives the reply from gfs_controld, a string containing "0" for ok, or "-EXYZ" for an error . removes the /etc/mtab line for this fs and exits - gfs_controld . gets groupd terminate callback indicating the leave has completed recovery - gfs-kernel, filesystem foo (on nodes 1, 2 and 3) . running normally . nodeid 1 using jid 0, nodeid 2 using jid 1, nodeid 3 using jid 2 - node 3 fails - gfs_controld (on nodes 1 and 2) . gets groupd callback 1 (stop) . writes "1" to /sys/fs/gfs/alpha:foo/lock_module/block which tells gfs-kernel to block any new lock requests and not use the any locks that are granted from requests in progress . sends groupd an ack for callback 1 (stop) . groupd does expanded barrier 1, covering all groups effected by failure . groupd completes all callbacks at fence and dlm levels . gets groupd callback 2 (start) . writes "2", the jid that was used by node 3, to /sys/fs/gfs/alpha:foo/lock_module/recover, which tells gfs-kernel that journal 2 needs recovery - gfs-kernel (on nodes 1 and 2) . both try to recover journal 2, gfs-kernel sorts out which node actually does the recovery . sends a "change foo" uevent message to gfs_controld indicating that recovery for the last write to "recover" has completed - gfs_controld (on nodes 1 and 2) . receives the "change foo" message and reads the result of recovery from: /sys/fs/gfs/alpha:foo/lock_module/recover_done - reports the jid to which the following recovery status applies (2 in this case) /sys/fs/gfs/alpha:foo/lock_module/recover_status - the result of recovery 308 (LM_RD_GAVEUP) means recovery failed or didn't happen 309 (LM_RD_SUCCESS) means recovery completed or was already done . if there are more journals to recover, repeats the steps above . sends the results of recovery to the group . receives own recovery results . sends groupd an ack for callback 2 (start) . gets groupd callback 3 (finish) . writes "0" to /sys/fs/gfs/alpha:foo/lock_module/block which tells gfs-kernel to unblock new lock requests - gfs-kernel (on nodes 1 and 2) . running normally again posix locks gfs-kernel sends posix lock operations to gfs_controld for processing. gfs-kernel and gfs_controld communicate through the /dev/misc/lock_dlm_plock device. The API is defined in include/linux/lock_dlm_plock.h. Processing plocks could potentially be done by a separate process; it is largely independent from the other activities in gfs_controld. Posix lock state is replicated on all nodes in the mount group. Every plock operation is multicast through a cpg, and processed on all nodes when it is received. On the node that originated the message/operation, the result from the processing is written back to gfs-kernel. When a new node mounts an fs and joins the mount group, the existing posix locks need to be copied to it. gfs_controld uses the checkpoint AIS service to do the syncing. The low nodeid in the mount group writes all posix lock state to a checkpoint, then notifies the new mounter, which reads the checkpoint and installs all the existing plock state. debugging The gfs_controld daemon keeps a circular buffer of debug messages that can be dumped with the 'group_tool dump gfs' command. The state of all posix locks for a specific fs can also be dumped from gfs_controld with the 'group_tool dump plocks ' command.