Requirements for pNFS SCSI

In order to use pNFS SCSI layouts, both the client and the server must have access to a SCSI block device containing an XFS filesystem. The SCSI device must support SCSI Persistent Reservations as described in the SCSI-3 Primary Commands specification. Before the server issues layouts to clients, it reserves the SCSI device to ensure that only registered clients may access the device.

Check for compatible SCSI device

Check for the proper SCSI device support from both your server and client:

[root@rhel7_pnfs ~]# sg_persist --in --report-capabilities -v /dev/sda
    inquiry cdb: 12 00 00 00 24 00 
  LIO-ORG   block_1           4.0
  Peripheral device type: disk
    Persistent Reservation In cmd: 5e 02 00 00 00 00 00 20 00 00 
Report capabilities response:
  Compatible Reservation Handling(CRH): 1
  Specify Initiator Ports Capable(SIP_C): 1
  All Target Ports Capable(ATP_C): 1
  Persist Through Power Loss Capable(PTPL_C): 1
  Type Mask Valid(TMV): 1
  Allow Commands: 1
  Persist Through Power Loss Active(PTPL_A): 1
    Support indicated in Type mask:
      Write Exclusive, all registrants: 1
      Exclusive Access, registrants only: 1
      Write Exclusive, registrants only: 1
      Exclusive Access: 1
      Write Exclusive: 1
      Exclusive Access, all registrants: 1

Specifically, you should ensure that the PTPL_A bit is set.

Server setup

The server should mount the XFS filesystem, and then be configured to export that filesystem via NFS. The server must be configured to serve NFS version 4.1 or higher. When exporting, ensure the ‘pnfs’ option is set on the export.

Client setup

Clients must not have the XFS filesystem mounted, but mount the filesystem via NFS from the server. NFS version 4.1 or higher must be used.

Check for proper operation of SCSI layouts

When the NFS client wishes to read or write from a file, it performs a LAYOUTGET operation. The server responds to this operation with the location of the file on the SCSI device. The client may need to perform an additional operation of GETDEVICEINFO to determine which SCSI device to use. If these operations work correctly, the client can issue I/O directly to the SCSI device instead of sending READ and WRITE operations to the server.

Sometimes, errors or contention between clients will cause the server to recall layouts or not issue them to the clients. In those cases, the clients will fall back to issuing READ and WRITE to the server instead of sending I/O directly to the SCSI device.

Because of this, we suggest several methods for monitoring pNFS SCSI layout functionality:

Checking pNFS SCSI Operation from the server

You can use the ‘nfsstat’ utility to monitor operations serviced from the NFS server. If the server is serving layouts, the ‘layoutget’, ‘layoutreturn’, and ‘layoutcommit’ counters will increment. If the clients are performing IO directly to the SCSI devices, the server’s ‘read’, and ‘write’ op counters will not increment.

[root@rhel7_pnfs_server ~]# watch -d "nfsstat -s | egrep -A1 read\|write\|layout"
Every 2.0s: nfsstat -s | egrep -A1 read\|write\|layout                    

putrootfh    read         readdir      readlink     remove	 rename
2         0% 0         0% 1         0% 0         0% 0         0% 0         0%
--
setcltidconf verify	  write        rellockowner bc_ctl	 bind_conn
0         0% 0         0% 0         0% 0         0% 0         0% 0         0%
--
getdevlist   layoutcommit layoutget    layoutreturn secinfononam sequence
0         0% 29        1% 49        1% 5         0% 0         0% 2435     86%

Checking pNFS SCSI operation via wire capture

Another way to test for pNFS SCSI operation is to capture wire traffic using wireshark or tshark. The typical NFS operation pattern for a simple open, read or write, close will show no READ or WRITE operations on the wire if the client is writing to the SCSI device. Instead, you should expect to see OPEN, LAYOUTGET, LAYOUTCOMMIT, CLOSE:

[root@rhel7_pnfs_server ~]# tshark -i eth0 -w/tmp/pcap -P port 2049
Running as user "root" and group "root". This could be dangerous.
Capturing on 'eth0'
0.000000000 192.168.122.110 -> 192.168.122.73 NFS 190 V4 Call SEQUENCE
0.000101340 192.168.122.73 -> 192.168.122.110 NFS 150 V4 Reply (Call In 1) SEQUENCE
0.000385732 192.168.122.110 -> 192.168.122.73 TCP 66 939 > nfs [ACK] Seq=125 Ack=85 Win=1394 Len=0 TSval=15546560 TSecr=15281808
3.250235477 192.168.122.110 -> 192.168.122.73 NFS 254 V4 Call ACCESS FH: 0xd909126a, [Check: RD LU MD XT DL]
3.250408390 192.168.122.73 -> 192.168.122.110 NFS 238 V4 Reply (Call In 4) ACCESS, [Allowed: RD LU MD XT DL]
3.250659904 192.168.122.110 -> 192.168.122.73 TCP 66 939 > nfs [ACK] Seq=313 Ack=257 Win=1393 Len=0 TSval=15549810 TSecr=15285058
3.250794941 192.168.122.110 -> 192.168.122.73 NFS 246 V4 Call ACCESS FH: 0xfad4f1c2, [Check: RD LU MD XT DL]
3.250879674 192.168.122.73 -> 192.168.122.110 NFS 238 V4 Reply (Call In 7) ACCESS, [Allowed: RD LU MD XT DL]
3.251282500 192.168.122.110 -> 192.168.122.73 NFS 250 V4 Call GETATTR FH: 0x8d1f26a0
3.251359826 192.168.122.73 -> 192.168.122.110 NFS 318 V4 Reply (Call In 9) GETATTR
3.251820209 192.168.122.110 -> 192.168.122.73 NFS 318 V4 Call OPEN DH: 0x8d1f26a0/
3.251922063 192.168.122.73 -> 192.168.122.110 NFS 394 V4 Reply (Call In 11) OPEN StateID: 0x8ebb
3.252480647 192.168.122.110 -> 192.168.122.73 NFS 290 V4 Call SETATTR FH: 0x8d1f26a0
3.261844152 192.168.122.73 -> 192.168.122.110 NFS 342 V4 Reply (Call In 13) SETATTR
3.301863674 192.168.122.110 -> 192.168.122.73 TCP 66 939 > nfs [ACK] Seq=1153 Ack=1285 Win=1394 Len=0 TSval=15549862 TSecr=15285069
3.391964840 192.168.122.110 -> 192.168.122.73 NFS 294 V4 Call LAYOUTGET
3.394264032 192.168.122.73 -> 192.168.122.110 NFS 266 V4 Reply (Call In 16) LAYOUTGET
3.394446036 192.168.122.110 -> 192.168.122.73 TCP 66 939 > nfs [ACK] Seq=1381 Ack=1485 Win=1393 Len=0 TSval=15549954 TSecr=15285202
4.896868762 192.168.122.110 -> 192.168.122.73 NFS 334 V4 Call LAYOUTCOMMIT
4.936073306 192.168.122.73 -> 192.168.122.110 TCP 66 nfs > 939 [ACK] Seq=1485 Ack=1649 Win=1374 Len=0 TSval=15286744 TSecr=15551456
6.901678513 192.168.122.73 -> 192.168.122.110 NFS 242 V4 Reply (Call In 19) LAYOUTCOMMIT
6.901991838 192.168.122.110 -> 192.168.122.73 TCP 66 939 > nfs [ACK] Seq=1649 Ack=1661 Win=1393 Len=0 TSval=15553462 TSecr=15288709
6.902175386 192.168.122.110 -> 192.168.122.73 NFS 330 V4 Call LAYOUTRETURN | CLOSE StateID: 0x8ebb
6.902203058 192.168.122.73 -> 192.168.122.110 TCP 66 nfs > 939 [ACK] Seq=1661 Ack=1913 Win=1382 Len=0 TSval=15288710 TSecr=15553462
6.902346274 192.168.122.73 -> 192.168.122.110 NFS 258 V4 Reply (Call In 23) LAYOUTRETURN | CLOSE
6.942075039 192.168.122.110 -> 192.168.122.73 TCP 66 939 > nfs [ACK] Seq=1913 Ack=1853 Win=1394 Len=0 TSval=15553502 TSecr=15288710

Checking pNFS SCSI operation on the client via tracepoints

On NFS clients, enabling the nfs4:nfs4_pnfs_read and nfs4:nfs4_pnfs_write tracepoints can show the results of the client attempting to perform pNFS IO. For example:

[root@rhel7_pnfs ~]# echo nfs4:nfs4_pnfs_{read,write} > /sys/kernel/debug/tracing/set_event
[root@rhel7_pnfs ~]# echo hi > /mnt/rhel7/scsi_lun_0/foo
[root@rhel7_pnfs ~]# echo 3 >/proc/sys/vm/drop_caches 
[root@rhel7_pnfs ~]# cat /mnt/rhel7/scsi_lun_0/foo
hi
[root@rhel7_pnfs ~]# cat /sys/kernel/debug/tracing/trace
# tracer: nop
#
# entries-in-buffer/entries-written: 2/2   #P:4
#
#                              _-----=> irqs-off
#                             / _----=> need-resched
#                            | / _---=> hardirq/softirq
#                            || / _--=> preempt-depth
#                            ||| /     delay
#           TASK-PID   CPU#  ||||    TIMESTAMP  FUNCTION
#              | |       |   ||||       |         |
	 kworker/2:1-4702  [002] .... 16101.588128: nfs4_pnfs_write: error=0 (OK) fileid=00:2a:103 fhandle=0x8d1f26a0 offset=0 count=3 stateid=1:0x4efae1ff
	 kworker/2:1-4702  [002] .... 16231.753310: nfs4_pnfs_read: error=0 (OK) fileid=00:2a:103 fhandle=0x8d1f26a0 offset=0 count=3 stateid=1:0x792411cd

Checking pNFS SCSI operation on the client via mountstats

On NFS clients, the per-mount operation counters can be viewed in mountstats. Monitoring READ, WRITE, and the LAYOUT operation counters for a given workload can indicate the use of pNFS:

[root@rhel7_pnfs ~]# cat /proc/self/mountstats | awk /scsi_lun_0/,/^$/ | egrep device\|READ\|WRITE\|LAYOUT
device 192.168.122.73:/exports/scsi_lun_0 mounted on /mnt/rhel7/scsi_lun_0 with fstype nfs4 statvers=1.1
    nfsv4:  bm0=0xfdffbfff,bm1=0x40f9be3e,bm2=0x803,acl=0x3,sessions,pnfs=LAYOUT_SCSI
            READ: 0 0 0 0 0 0 0 0
           WRITE: 0 0 0 0 0 0 0 0
        READLINK: 0 0 0 0 0 0 0 0
         READDIR: 0 0 0 0 0 0 0 0
       LAYOUTGET: 49 49 0 11172 9604 2 19448 19454
    LAYOUTCOMMIT: 28 28 0 7776 4808 0 24719 24722
    LAYOUTRETURN: 0 0 0 0 0 0 0 0
     LAYOUTSTATS: 0 0 0 0 0 0 0 0

Checking pNFS SCSI operation on the client via iostat

If your NFS client is performing IO directly to the SCSI block devices, you can expect the IO counters for that block device to change. The ‘iostat’ utility can show that your client is using the block device directly:

[root@rhel7_pnfs ~]# dd if=/dev/zero of=/mnt/rhel7/scsi_lun_0/foo bs=1M count=4096 
[root@rhel7_pnfs ~]# iostat 2 /dev/sda
Linux 3.10.0.ecb4e8c7c0 (rhel7_pnfs) 	05/08/2019 	_x86_64_	(4 CPU)

avg-cpu:  %user   %nice %system %iowait  %steal   %idle
           0.07    0.00    0.04    0.07    0.02   99.81

Device:            tps    kB_read/s    kB_wrtn/s    kB_read    kB_wrtn
sda               0.07         0.18       283.44       3128    4813304

avg-cpu:  %user   %nice %system %iowait  %steal   %idle
           0.00    0.00    2.40   14.77    0.13   82.70

Device:            tps    kB_read/s    kB_wrtn/s    kB_read    kB_wrtn
sda               0.00         0.00         0.00          0          0

...

4096+0 records out
4294967296 bytes (4.3 GB) copied, 42.1628 s, 102 MB/s

Releasing Server’s SCSI Reservation

Once the NFS server has reserved the SCSI device, many operations on that device will fail from clients that can issue commands to that device, but are not registered with the device. For example, the “blkid” command will fail to show the UUID of the XFS filesystem if the client has not been given a layout for that device.

The server will not remove its own persistent reservation. This protects the data within the filesystem on the device across restarts of clients and servers. In order to re-purpose the SCSI device, you may need to manually remove the NFS server’s persistent reservation.

Use the sg_persist command from sg3_utils package. You must remove the registration from the server, it cannot be removed from a different IT Nexus.

Example - using sg_persist to query an existing reservation:

[root@rhel7 ~]# sg_persist -r /dev/sda
  LIO-ORG   block_1           4.0
  Peripheral device type: disk
  PR generation=0x8, Reservation follows:
    Key=0x100000000000000
    scope: LU_SCOPE,  type: Exclusive Access, registrants only

Example - using sg_persist to remove an existing reservation:

[root@rhel7 ~]# sg_persist --out --release --param-rk=0x100000000000000 --prout-type=6 /dev/sda
  LIO-ORG   block_1           4.0
  Peripheral device type: disk