DMMP BIO vs Request Performance

DMMP currently supports two different kernel IO interfaces: the BIO interface[1] (struct bio) and the Request interface[2] (struct request). By default DMMP uses the Request interface and over the years much work has been done test and improve the performance of the DMMP Request interface. DMMP can also be manually configured to use the BIO interface. The DMMP BIO interface is supported but little work has been done to test and improve its performance. DMMP is currently the only upstream component which continues to use the Request interfacefor submitting IO.

At the ALPSS 2024 conference last October we discussed the possibility of deprecating and eventually removing support the Request interface as kernel API. Such a change could impact DMMP so I was asked if Red Hat would be willing to support the effort by measuring the performance of DMMP's BIO interface[3] and comparing it to its Request based performance. Having such a comparative performance analysis would be very helpful in determining what further changes might be needed to move DMMP away from using the Request interface. This would help with the overall effort to improve BIO interface performance and eventually remove support for Request based IO as a kernel API.

In this presentation I will share the preliminary results of Red Hat's DMMP BIO vs Request performance tests[4] and discuss what the next possible steps could be for moving forward.

The tests and performance graphs in this presentation were developed and run by Samuel Petrovic (spetrovi@redhat.com). Credit goes to Samuel for creating these performance tests and many thanks to Benjamin Marzinski (bmarzins@redhat.com), Mikulas Patocka (mpatocka@redhat.com) and others on the Red Hat DMMP and Performance teams who contributed to this work.

[1] https://lwn.net/Articles/736534/
[2] https://lwn.net/Articles/738449/
[3] https://lore.kernel.org/linux-scsi/643e61a8-b0cb-4c9d-831a-879aa86d888e@redhat.com
[4] https://people.redhat.com/jmeneghi/LSFMM_2025/DMMP_BIOvsRequest/

Configuration Information

These tests were run on an intermediate sized host platform with a mid-tier storage array using a 32GB Fibre channel SAN one or more FCP LUNs.

Host Platform

root@rhel-storage-105:~# lsmem
RANGE                                 SIZE  STATE REMOVABLE BLOCK
0x0000000000000000-0x000000007fffffff   2G online       yes     0
0x0000000100000000-0x000000107fffffff  62G online       yes  2-32

Memory block size:         2G
Total online memory:      64G
Total offline memory:      0B
root@rhel-storage-105:~# lscpu
Architecture:             x86_64
  CPU op-mode(s):         32-bit, 64-bit
  Address sizes:          46 bits physical, 57 bits virtual
  Byte Order:             Little Endian
CPU(s):                   80
  On-line CPU(s) list:    0-79
Vendor ID:                GenuineIntel
  BIOS Vendor ID:         Intel
  Model name:             Intel(R) Xeon(R) Silver 4316 CPU @ 2.30GHz
    BIOS Model name:      Intel(R) Xeon(R) Silver 4316 CPU @ 2.30GHz  CPU @ 2.3GHz
    BIOS CPU family:      179
    CPU family:           6
    Model:                106
    Thread(s) per core:   2
    Core(s) per socket:   20
    Socket(s):            2
    Stepping:             6
    CPU(s) scaling MHz:   93%
    CPU max MHz:          3400.0000
    CPU min MHz:          800.0000
    BogoMIPS:             4600.00
    Flags:                fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc art arch_pe
                          rfmon pebs bts rep_good nopl xtopology nonstop_tsc cpuid aperfmperf pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid dca sse4_1 sse4_2 
                          x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm 3dnowprefetch cpuid_fault epb cat_l3 intel_ppin ssbd mba ibrs ibpb stibp ibrs_enhanced tpr_shado
                          w flexpriority ept vpid ept_ad fsgsbase tsc_adjust bmi1 avx2 smep bmi2 erms invpcid cqm rdt_a avx512f avx512dq rdseed adx smap avx512ifma clflushopt clwb intel_pt avx512cd s
                          ha_ni avx512bw avx512vl xsaveopt xsavec xgetbv1 xsaves cqm_llc cqm_occup_llc cqm_mbm_total cqm_mbm_local split_lock_detect wbnoinvd dtherm ida arat pln pts vnmi avx512vbmi u
                          mip pku ospke avx512_vbmi2 gfni vaes vpclmulqdq avx512_vnni avx512_bitalg avx512_vpopcntdq la57 rdpid fsrm md_clear pconfig flush_l1d arch_capabilities
Virtualization features:  
  Virtualization:         VT-x
Caches (sum of all):      
  L1d:                    1.9 MiB (40 instances)
  L1i:                    1.3 MiB (40 instances)
  L2:                     50 MiB (40 instances)
  L3:                     60 MiB (2 instances)
NUMA:                     
  NUMA node(s):           2
  NUMA node0 CPU(s):      0,2,4,6,8,10,12,14,16,18,20,22,24,26,28,30,32,34,36,38,40,42,44,46,48,50,52,54,56,58,60,62,64,66,68,70,72,74,76,78
  NUMA node1 CPU(s):      1,3,5,7,9,11,13,15,17,19,21,23,25,27,29,31,33,35,37,39,41,43,45,47,49,51,53,55,57,59,61,63,65,67,69,71,73,75,77,79
Vulnerabilities:          
  Gather data sampling:   Mitigation; Microcode
  Itlb multihit:          Not affected
  L1tf:                   Not affected
  Mds:                    Not affected
  Meltdown:               Not affected
  Mmio stale data:        Mitigation; Clear CPU buffers; SMT vulnerable
  Reg file data sampling: Not affected
  Retbleed:               Not affected
  Spec rstack overflow:   Not affected
  Spec store bypass:      Mitigation; Speculative Store Bypass disabled via prctl
  Spectre v1:             Mitigation; usercopy/swapgs barriers and __user pointer sanitization
  Spectre v2:             Mitigation; Enhanced / Automatic IBRS; IBPB conditional; RSB filling; PBRSB-eIBRS SW sequence; BHI SW loop, KVM SW loop
  Srbds:                  Not affected
  Tsx async abort:        Not affected

Test Configuration and Setup

You cannot switch queue modes on an existing multipath device. The easiest way to switch to bio mode is to:

1. Get your multipath device's information

If you have user_friendly_names set in /etc/multipath.conf (you probably
do.  It's the default). Then when you run "multipath -l" the top line
for the device will look like:

mpathX ("WWID") dm-Y "vendor","product"

If you don't have user_friendly_names set (and you didn't explicitly set
up an alias in /etc/multipath.conf), the device name is the same as the
device WWID, so the top line for the device will look like:

"WWID" dm-Y "vendor","product"

2. change features in /etc/multipath.conf

If you want to change all of the multipath devices on you machine to use
the bio queue_mode, then add

features "2 queue_mode bio"

to the defaults section of /etc/multipath.conf

If you have multipath devices that you don't want changed, you can
either set this by multipath device vendor and product by adding a devices
section that looks like:

devices {
	device {
		vendor "vendor"
		product "product"
		features "2 queue_mode bio"
	}
}

Or you can set this for specific multipath devices by added a
multipaths section for each device. For example:

multipaths {
	multipath {
		wwid "WWID"
		features "2 queue_mode bio"
	}
	multipath {
		wwid "ANOTHER_WWID"
		features "2 queue_mode bio"
	}
}

3. Delete the multipath device

To remove one multipath device:
# multipath -f "device_name"

To remove all multipath devices:
# multipath -F

4. reload the configuration for multipathd

# systemctl reload multipathd.sevice

This will also recreate the multipath device. If you run "multipath -l",
you should now see that it has "queue_mode bio" in the features line.

When you want to switch back to request mode, you can just comment out
that "features" line, like so:

# features "2 queue_mode bio"

And then do steps 3 & 4 again.

Test Results

The following are all preliminary test results and measurements from the Red Hat IO Performance lab.

Preliminary Tests

preliminary_raw_io_tests

In preliminary raw device testing we can see that bio is slightly to significantly worse than request based IO. The workload used sequential reads and sequential writes. We can see that using small blocks (4k) shows us 30% performance drop, while larger blocks offer 3-9% performance drop, which is only barely out of statistical error of this test.

preliminary_fs_tests

In our preliminary file systems tests we see the worst test cases for bio is a combination of single file test with small (4k) blocks which results in 50-70% loss. Single file test with larger blocks only see about 11-35% loss and tests with many files have no statistically significant difference.

Preliminary Test modifications

The following changes were made to the original tests to decrease the likelihood of storage array interference.

Storage Preparation:

  1. Eliminated the blkdiscard and overwrites
  2. Disable "discard" on all of our filesystems

FIO Test:

  1. Make each subtest a run as a single, longer FIO job. Rather than running three 1 minute jobs, for each subtest, run a single 6 minute job and allow some time between subtest runs for things to settle down on the storage array.
  2. Add -ramp_time 120s with all fio jobs
  3. Use --zero_buffers with all fio jobs

After making these changes the following tests have been run.

File systems Tests - baseline

This is report displays some Filesystems tests with very little merges:

baseline_file_system_tests

These results diplay what a person might see during a test of BIO vs. Request based IO with DMMP, but the workload barely invoked any merges and there's very little difference between the test runs. To see the real difference in DMMP performance we needed to come up with a workload that creates many more merge requests.

Raw IO tests.

Raw IO tests with the DMMP IO scheduler set to none:

raw_io_scheduler_none_tests

Raw IO tests with the DMMP IO scheduler set to mq-deadline:

raw_io_scheduler_mq-deadline_tests

As you can see with mq-deadline, the performance difference is reduced.

File systems tests

Request vs bio with the 'mq-deadline' scheduler - better for bio:

file_system_request_vs_bio_mq-deadline_tests

Request vs bio with the none scheduler - better for request:

file_system_request_vs_bio_none_tests

Bio none vs bio mq-deadline - provides a huge perfromance gain:

file_system_bio_vs_bio_mq-deadline_tests

Request none vs request mq-deadline - provides no difference:

file_system_request_vs_request_mq-deadline_tests

Next Steps

  1. More test developemt and improvements?
  2. Replace the Storage Array with a Linux soft target backed by /dev/null storage?
  3. Patches and improvements to DM/Bock layer
  4. Itterate on the Performance tests