GFS1 Fast Statfs() Implementation wcheng@redhat.com 03/02/2007 Installation and Run Script =========================== Changes checked into CVS, archived at: https://www.redhat.com/archives/cluster-devel/2007-March/msg00124.html The usage is a little bit awkward - the "gfs_tool" command has to be run on each and every node after mount (to turn this option on). Upon any unclean umount (e.g. node crashes) and re-mount, the procedure has to be re-run again (i.e. run gfs_tool on each and every node). shell> gfs_tool settune statfs_fast 1 The old behavior can be dynamically brought back any time by: shell> gfs_tool settune statfs_fast 0 A quick test on a quiet cluster results: dhcp145 (1 cpu HP): old df took 0.875 seconds, new df 0.008 second dhcp146 (4 cpus DELL): old df took 0.808 seconds, new df 0.006 second. The Problem and GFS2's Approach =============================== GFS disk blocks are managed by its Resource Groups (RG) in a distributed manner. The filesystem is divided into 256MB-per-RG sections. Each RG manages its own disk blocks and stores the block usage statistics in its own control structures. The "statfs" system call (a frequently invoked function that has been used by some popular commands such as "df") goes thru all RGs to add the usage counts together. This implies in a 1-TB filesystem, each statfs() call would need to scan totally 4096 (1024*1024/256) RGs to obtain the required data. The most troublesome aspect of this implementation is that it has to obtain 4096 shared (read) RG locks across the cluster before this call can be completed. GFS2 alleviates the issue by writing the local (per node) statfs changes into a per node file upon disk block changes. Every "gt_statfs_quantum" seconds (a tunable, default to 30), the "quotad" daemon adds the local changes into a cluster-wide master file (one per filesystem) and subsequently zeros out its local copy. The original author commented: "The end effect is that a df can be completed without any network access (just a local spinlock) without affecting [de]allocation performance. What you give up is the ability to see statfs changes that have happened very recently on other nodes in the cluster. I believe it will be good enough for most uses." GFS1 Port ========= There are few compromises made while porting GFS2 approach over (to GFS1), mostly to avoid on-disk structure changes. Note that GFS2 allocates (number-of-nodes + 1) physical files into disk during mkfs time but GFS1 only has one extra space (the unused license file) for this purpose. We deviate from from GFS2 implementation by writing the local per-node changes into a memory buffer. This, in turns, creates a recovery issue - upon unclean shutdown (say, one node crashes before it can syncs the changes into the master file), the local in-memory changes will be lost. There are few possible approaches currently on the table to handle this. One of them is adding an on-disk version number (as part of the master file contents). With unclean umount, right after journal recovery, the on-disk version number is bumped up by one and the master copy is updated with the statfs data obtained from the old method. Whenever a node is ready for flushing its changes, seeing on-disk version number is higher than the local (saved) version number, instead of adding its local changes into the master file, it should zero out the local copy and bumps up its local version number, assuming the local changes have been incorporated into the data obtained via old method. The side effect of this approach is that after each unclean umount, the statistics will be off (hopefully) in a negligible scale from that point on. This "negligible" side effect is debatable but one could argue that under GFS's distributed nature (not having a centralized meta data server), no matter what we do, the statistics is always an approximation, even with current performance-plagued old method (where the lock is released asynchronously as soon as its RG data is read). Neverthelessly, the current code works as the following: 1. Upon each mount, the local copy is zeroed but fast statfs logic is not triggered. 2. Fast statfs is started by issuing "gfs_tool settune" command on each and every node after mount. shell> gfs_tool settune statfs_fast 1 Changes made by nodes that fail to have fast statfs started would not be collected (seen) by fast statfs system call on any node in the cluster. The start call: 2.1 Invokes old method to obtain the "almost-correct" statfs info. 2.2 Obtain master file exclusive glock. 2.3 Write the (from 2.1) statfs data into master file. 2.4 If everything goes well, set gt_statfs_fast flag to 1. 2.5 The local change starts to get picked up (based on gs_statfs_fast flag). 2.6 Local change (delta) is synced to disk whenever quota daemon is waked up and the (a tunable, default to 5 seconds). It is then subsequently zeroed out. 2.7 Repeat from step 2.5 as long as gt_statfs_fast is non-zero. 3. Whenever statfs() system call is invoked and if gt_statfs_fast is on, the call returned with the last round read-in master file contents, adjusted with its local (delta) changes. If gt_statfs_fast is zero, old method is invoked. 4. Upon node recovery (with unclean shutdown), "gfs_tool settune" can be invoked on each and every node to resume statfs activities. If this is not done on a relatively quiet (with negligible write activities) cluster, statfs data could be off to an unspecified degree. Note that each call into "gfs_tool settune" restarts the statistics collection by repeating the steps described in step 2. 5. Fast statfs can be turned off dynamically (anytime) by using gfs_tool command to get gt_statfs_fast to back zero on each node. shell> gfs_tool settune statfs_fast 0 6. On and off can be mixed and repeated after mounts. However, user is expected to understand how Step 2 works in order to fully interpret the fast statfs statistics. To Do Items ============ Research ways to implement a cman-base (or any other) command to start and stop the cluster wide fast statfs on one node. This should greatly reduce the awkward usage of this implementation. -------------- end of write-up