pNFS Setup on RHEL7
Requirements for pNFS SCSI
In order to use pNFS SCSI layouts, both the client and the server must have access to a SCSI block device containing an XFS filesystem. The SCSI device must support SCSI Persistent Reservations as described in the SCSI-3 Primary Commands specification. Before the server issues layouts to clients, it reserves the SCSI device to ensure that only registered clients may access the device.
Check for compatible SCSI device
Check for the proper SCSI device support from both your server and client:
[root@rhel7_pnfs ~]# sg_persist --in --report-capabilities -v /dev/sda
inquiry cdb: 12 00 00 00 24 00
LIO-ORG block_1 4.0
Peripheral device type: disk
Persistent Reservation In cmd: 5e 02 00 00 00 00 00 20 00 00
Report capabilities response:
Compatible Reservation Handling(CRH): 1
Specify Initiator Ports Capable(SIP_C): 1
All Target Ports Capable(ATP_C): 1
Persist Through Power Loss Capable(PTPL_C): 1
Type Mask Valid(TMV): 1
Allow Commands: 1
Persist Through Power Loss Active(PTPL_A): 1
Support indicated in Type mask:
Write Exclusive, all registrants: 1
Exclusive Access, registrants only: 1
Write Exclusive, registrants only: 1
Exclusive Access: 1
Write Exclusive: 1
Exclusive Access, all registrants: 1
Specifically, you should ensure that the PTPL_A bit is set.
Server setup
The server should mount the XFS filesystem, and then be configured to export that filesystem via NFS. The server must be configured to serve NFS version 4.1 or higher. When exporting, ensure the ‘pnfs’ option is set on the export.
Client setup
Clients must not have the XFS filesystem mounted, but mount the filesystem via NFS from the server. NFS version 4.1 or higher must be used.
Check for proper operation of SCSI layouts
When the NFS client wishes to read or write from a file, it performs a LAYOUTGET operation. The server responds to this operation with the location of the file on the SCSI device. The client may need to perform an additional operation of GETDEVICEINFO to determine which SCSI device to use. If these operations work correctly, the client can issue I/O directly to the SCSI device instead of sending READ and WRITE operations to the server.
Sometimes, errors or contention between clients will cause the server to recall layouts or not issue them to the clients. In those cases, the clients will fall back to issuing READ and WRITE to the server instead of sending I/O directly to the SCSI device.
Because of this, we suggest several methods for monitoring pNFS SCSI layout functionality:
Checking pNFS SCSI Operation from the server
You can use the ‘nfsstat’ utility to monitor operations serviced from the NFS server. If the server is serving layouts, the ‘layoutget’, ‘layoutreturn’, and ‘layoutcommit’ counters will increment. If the clients are performing IO directly to the SCSI devices, the server’s ‘read’, and ‘write’ op counters will not increment.
[root@rhel7_pnfs_server ~]# watch -d "nfsstat -s | egrep -A1 read\|write\|layout"
Every 2.0s: nfsstat -s | egrep -A1 read\|write\|layout
putrootfh read readdir readlink remove rename
2 0% 0 0% 1 0% 0 0% 0 0% 0 0%
--
setcltidconf verify write rellockowner bc_ctl bind_conn
0 0% 0 0% 0 0% 0 0% 0 0% 0 0%
--
getdevlist layoutcommit layoutget layoutreturn secinfononam sequence
0 0% 29 1% 49 1% 5 0% 0 0% 2435 86%
Checking pNFS SCSI operation via wire capture
Another way to test for pNFS SCSI operation is to capture wire traffic using wireshark or tshark. The typical NFS operation pattern for a simple open, read or write, close will show no READ or WRITE operations on the wire if the client is writing to the SCSI device. Instead, you should expect to see OPEN, LAYOUTGET, LAYOUTCOMMIT, CLOSE:
[root@rhel7_pnfs_server ~]# tshark -i eth0 -w/tmp/pcap -P port 2049
Running as user "root" and group "root". This could be dangerous.
Capturing on 'eth0'
1 0.000000000 192.168.122.110 -> 192.168.122.73 NFS 190 V4 Call SEQUENCE
2 0.000101340 192.168.122.73 -> 192.168.122.110 NFS 150 V4 Reply (Call In 1) SEQUENCE
3 0.000385732 192.168.122.110 -> 192.168.122.73 TCP 66 939 > nfs [ACK] Seq=125 Ack=85 Win=1394 Len=0 TSval=15546560 TSecr=15281808
4 3.250235477 192.168.122.110 -> 192.168.122.73 NFS 254 V4 Call ACCESS FH: 0xd909126a, [Check: RD LU MD XT DL]
5 3.250408390 192.168.122.73 -> 192.168.122.110 NFS 238 V4 Reply (Call In 4) ACCESS, [Allowed: RD LU MD XT DL]
6 3.250659904 192.168.122.110 -> 192.168.122.73 TCP 66 939 > nfs [ACK] Seq=313 Ack=257 Win=1393 Len=0 TSval=15549810 TSecr=15285058
7 3.250794941 192.168.122.110 -> 192.168.122.73 NFS 246 V4 Call ACCESS FH: 0xfad4f1c2, [Check: RD LU MD XT DL]
8 3.250879674 192.168.122.73 -> 192.168.122.110 NFS 238 V4 Reply (Call In 7) ACCESS, [Allowed: RD LU MD XT DL]
9 3.251282500 192.168.122.110 -> 192.168.122.73 NFS 250 V4 Call GETATTR FH: 0x8d1f26a0
10 3.251359826 192.168.122.73 -> 192.168.122.110 NFS 318 V4 Reply (Call In 9) GETATTR
11 3.251820209 192.168.122.110 -> 192.168.122.73 NFS 318 V4 Call OPEN DH: 0x8d1f26a0/
12 3.251922063 192.168.122.73 -> 192.168.122.110 NFS 394 V4 Reply (Call In 11) OPEN StateID: 0x8ebb
13 3.252480647 192.168.122.110 -> 192.168.122.73 NFS 290 V4 Call SETATTR FH: 0x8d1f26a0
14 3.261844152 192.168.122.73 -> 192.168.122.110 NFS 342 V4 Reply (Call In 13) SETATTR
15 3.301863674 192.168.122.110 -> 192.168.122.73 TCP 66 939 > nfs [ACK] Seq=1153 Ack=1285 Win=1394 Len=0 TSval=15549862 TSecr=15285069
16 3.391964840 192.168.122.110 -> 192.168.122.73 NFS 294 V4 Call LAYOUTGET
17 3.394264032 192.168.122.73 -> 192.168.122.110 NFS 266 V4 Reply (Call In 16) LAYOUTGET
18 3.394446036 192.168.122.110 -> 192.168.122.73 TCP 66 939 > nfs [ACK] Seq=1381 Ack=1485 Win=1393 Len=0 TSval=15549954 TSecr=15285202
19 4.896868762 192.168.122.110 -> 192.168.122.73 NFS 334 V4 Call LAYOUTCOMMIT
20 4.936073306 192.168.122.73 -> 192.168.122.110 TCP 66 nfs > 939 [ACK] Seq=1485 Ack=1649 Win=1374 Len=0 TSval=15286744 TSecr=15551456
21 6.901678513 192.168.122.73 -> 192.168.122.110 NFS 242 V4 Reply (Call In 19) LAYOUTCOMMIT
22 6.901991838 192.168.122.110 -> 192.168.122.73 TCP 66 939 > nfs [ACK] Seq=1649 Ack=1661 Win=1393 Len=0 TSval=15553462 TSecr=15288709
23 6.902175386 192.168.122.110 -> 192.168.122.73 NFS 330 V4 Call LAYOUTRETURN | CLOSE StateID: 0x8ebb
24 6.902203058 192.168.122.73 -> 192.168.122.110 TCP 66 nfs > 939 [ACK] Seq=1661 Ack=1913 Win=1382 Len=0 TSval=15288710 TSecr=15553462
25 6.902346274 192.168.122.73 -> 192.168.122.110 NFS 258 V4 Reply (Call In 23) LAYOUTRETURN | CLOSE
26 6.942075039 192.168.122.110 -> 192.168.122.73 TCP 66 939 > nfs [ACK] Seq=1913 Ack=1853 Win=1394 Len=0 TSval=15553502 TSecr=15288710
Checking pNFS SCSI operation on the client via tracepoints
On NFS clients, enabling the nfs4:nfs4_pnfs_read and nfs4:nfs4_pnfs_write tracepoints can show the results of the client attempting to perform pNFS IO. For example:
[root@rhel7_pnfs ~]# echo nfs4:nfs4_pnfs_{read,write} > /sys/kernel/debug/tracing/set_event
[root@rhel7_pnfs ~]# echo hi > /mnt/rhel7/scsi_lun_0/foo
[root@rhel7_pnfs ~]# echo 3 >/proc/sys/vm/drop_caches
[root@rhel7_pnfs ~]# cat /mnt/rhel7/scsi_lun_0/foo
hi
[root@rhel7_pnfs ~]# cat /sys/kernel/debug/tracing/trace
# tracer: nop
#
# entries-in-buffer/entries-written: 2/2 #P:4
#
# _-----=> irqs-off
# / _----=> need-resched
# | / _---=> hardirq/softirq
# || / _--=> preempt-depth
# ||| / delay
# TASK-PID CPU# |||| TIMESTAMP FUNCTION
# | | | |||| | |
kworker/2:1-4702 [002] .... 16101.588128: nfs4_pnfs_write: error=0 (OK) fileid=00:2a:103 fhandle=0x8d1f26a0 offset=0 count=3 stateid=1:0x4efae1ff
kworker/2:1-4702 [002] .... 16231.753310: nfs4_pnfs_read: error=0 (OK) fileid=00:2a:103 fhandle=0x8d1f26a0 offset=0 count=3 stateid=1:0x792411cd
Checking pNFS SCSI operation on the client via mountstats
On NFS clients, the per-mount operation counters can be viewed in mountstats. Monitoring READ, WRITE, and the LAYOUT operation counters for a given workload can indicate the use of pNFS:
[root@rhel7_pnfs ~]# cat /proc/self/mountstats | awk /scsi_lun_0/,/^$/ | egrep device\|READ\|WRITE\|LAYOUT
device 192.168.122.73:/exports/scsi_lun_0 mounted on /mnt/rhel7/scsi_lun_0 with fstype nfs4 statvers=1.1
nfsv4: bm0=0xfdffbfff,bm1=0x40f9be3e,bm2=0x803,acl=0x3,sessions,pnfs=LAYOUT_SCSI
READ: 0 0 0 0 0 0 0 0
WRITE: 0 0 0 0 0 0 0 0
READLINK: 0 0 0 0 0 0 0 0
READDIR: 0 0 0 0 0 0 0 0
LAYOUTGET: 49 49 0 11172 9604 2 19448 19454
LAYOUTCOMMIT: 28 28 0 7776 4808 0 24719 24722
LAYOUTRETURN: 0 0 0 0 0 0 0 0
LAYOUTSTATS: 0 0 0 0 0 0 0 0
Checking pNFS SCSI operation on the client via iostat
If your NFS client is performing IO directly to the SCSI block devices, you can expect the IO counters for that block device to change. The ‘iostat’ utility can show that your client is using the block device directly:
[root@rhel7_pnfs ~]# dd if=/dev/zero of=/mnt/rhel7/scsi_lun_0/foo bs=1M count=4096
[root@rhel7_pnfs ~]# iostat 2 /dev/sda
Linux 3.10.0.ecb4e8c7c0 (rhel7_pnfs) 05/08/2019 _x86_64_ (4 CPU)
avg-cpu: %user %nice %system %iowait %steal %idle
0.07 0.00 0.04 0.07 0.02 99.81
Device: tps kB_read/s kB_wrtn/s kB_read kB_wrtn
sda 0.07 0.18 283.44 3128 4813304
avg-cpu: %user %nice %system %iowait %steal %idle
0.00 0.00 2.40 14.77 0.13 82.70
Device: tps kB_read/s kB_wrtn/s kB_read kB_wrtn
sda 0.00 0.00 0.00 0 0
...
4096+0 records out
4294967296 bytes (4.3 GB) copied, 42.1628 s, 102 MB/s
Releasing Server’s SCSI Reservation
Once the NFS server has reserved the SCSI device, many operations on that device will fail from clients that can issue commands to that device, but are not registered with the device. For example, the “blkid” command will fail to show the UUID of the XFS filesystem if the client has not been given a layout for that device.
The server will not remove its own persistent reservation. This protects the data within the filesystem on the device across restarts of clients and servers. In order to re-purpose the SCSI device, you may need to manually remove the NFS server’s persistent reservation.
Use the sg_persist command from sg3_utils package. You must remove the registration from the server, it cannot be removed from a different IT Nexus.
Example - using sg_persist to query an existing reservation:
[root@rhel7 ~]# sg_persist -r /dev/sda
LIO-ORG block_1 4.0
Peripheral device type: disk
PR generation=0x8, Reservation follows:
Key=0x100000000000000
scope: LU_SCOPE, type: Exclusive Access, registrants only
Example - using sg_persist to remove an existing reservation:
[root@rhel7 ~]# sg_persist --out --release --param-rk=0x100000000000000 --prout-type=6 /dev/sda
LIO-ORG block_1 4.0
Peripheral device type: disk