In this technical write up, I explain how I came up with a method to benchmark the Open vSwitch (OvS) userspace datapath connection tracking performance and how this was used to characterize this recent patch series included in OvS 3.0.0 which aims at improving the multi-thread scalability.

What Is Connection Tracking?

Connection tracking (conntrack) is the process of keeping track of logical network connections (also named flows) and thereby identifying all packets that make up each flow so that they can be handled consistently together.

Conntrack is a requirement for Network Address Translation (NAT), for example in IP address masquerading (described in detail in RFC 3022). It is also required for stateful firewalls, load balancers, intrusion detection/prevention systems and deep packet inspection engines. More specifically, OvS conntrack rules are used to implement isolation between OpenStack virtual networks (A.K.A. security groups).

Connection tracking is usually implemented by storing known connections entries into a table. Indexed by a bi-directional 5-tuple (protocol, source address, destination address, source port, destination port). Each entry also has a state as seen from the connection tracking system. The state (new, established, closed, etc.) is updated every time a packet matching its 5-tuple is processed. If a received packet does not match any existing conntrack entry, a new one must be created and inserted into the table.

Performance Aspects

There are two aspects to consider when we measure conntrack performance.

Connection rate

How many new connections can be handled per second?

This is directly implied by:

  • What is the cost for looking up an existing connection entry for each received packet?
  • Can multiple threads insert/destroy conntrack entries concurrently?
  • What is the cost of creating one conntrack entry for new connections?
  • How many packets are exchanged per connection?

Maximum number of concurrent connections

How many concurrent connections can the system support?

This is directly implied by:

  • What is the size of the conntrack table?
  • What is the duration of each individual connection?
  • After a connection has been closed, for how long the conntrack entry lingers in the table until it is expunged to make room for new connections? What if the connection is not closed but no longer exchanges traffic (client or server crashed/disconnected).
  • What happens when the conntrack table is full?

These two aspects are somewhat connected since even a low rate of very long new connections will cause the conntrack table to fill up eventually.

In order to properly size the connection tracking table, one needs to know what will be the average number of new connections per second and their average duration. One also needs to tune the various timeout values of the conntrack engine.

Benchmarking Process

We need a way to simulate clients and servers: specify how many of each there are, how many connections per second they are creating, how long the connections are and how much data is exchanged in each connection.

There are a few commercial traffic generators that have these capabilities, more or less refined. Today, I will describe how to do this with an Open Source traffic generator based on the DPDK framework: TRex.

TRex has multiple modes of operation. I will focus on the Advanced Stateful (ASTF) mode which allows simulating lightweight TCP/UDP clients. I have tailored a script using the TRex Python API to perform RFC 2544-like benchmarks but focusing on the new connections per second performance.

Basically, this script connects to a running TRex server started in ASTF mode and creates TCP/UDP connection profiles. These profiles are state machines representing clients and servers with dynamic IP addresses and TCP ports. You can define the number of data exchanges, their size, add some arbitrary wait time to simulate network latency, etc. TRex takes care of translating this into real TCP traffic.

Here is a simplified example of a TCP connection profile:

client = ASTFProgram(stream=True)
server = ASTFProgram(stream=True)
for _ in range(num_messages):
    client.send(message_size * b"x")
    server.recv(message_size)
    if server_wait > 0:
        server.delay(server_wait * 1000)  # trex wants microseconds
    server.send(message_size * b"y")
    client.recv(message_size)

tcp_profile = ASTFTemplate(
    client_template=ASTFTCPClientTemplate(
        program=client,
        port=8080,
        cps=99, # base value which is changed during the binary search
        cont=True,
    ),
    server_template=ASTFTCPServerTemplate(
        program=server, assoc=ASTFAssociationRule(port=8080)
    ),
)

Setup

The Device Under Test (DUT) will run the Open vSwitch daemon (ovs-vswitchd) with the userspace DPDK datapath. The same kind of setup can be used to benchmark any connection tracking device. This is overly simplified and does not represent an actual production workload. However it will allow stressing the connection tracking code path without bothering about the external details.

topology

Base System

Both OvS userspace datapath and TRex use DPDK. These settings are common to both machines.

DPDK requires compatible network interfaces. In this example, I will be running on the last two ports of an Intel® X710 PCI network interface.

[root@* ~]# lscpu | grep -e "^Model name:" -e "^NUMA" -e MHz
NUMA node(s):        1
Model name:          Intel(R) Xeon(R) Gold 5118 CPU @ 2.30GHz
CPU MHz:             2700.087
NUMA node0 CPU(s):   0-23
[root@* ~]# grep ^MemTotal /proc/meminfo
MemTotal:       65373528 kB
[root@* ~]# lspci | grep X710 | tail -n2
18:00.2 Ethernet controller: Intel Corporation Ethernet Controller X710 for 10GbE SFP+ (rev 02)
18:00.3 Ethernet controller: Intel Corporation Ethernet Controller X710 for 10GbE SFP+ (rev 02)

The CPUs used by TRex and OvS need to be isolated to get the least disturbance from the other tasks running on Linux. I isolate CPUs from the NUMA node where the PCI NIC is connected. CPUs 0 and 12 are left to Linux.

dnf install -y tuned tuned-profiles-cpu-partitioning
cat > /etc/tuned/cpu-partitioning-variables.conf <<EOF
isolated_cores=1-11,13-23
no_balance_cores=1-11,13-23
EOF
tuned-adm profile cpu-partitioning

Finally, DPDK applications require huge pages. It is best to allocate them on boot to ensure that they are all mapped to contiguous chunks of memory.

cat >> /etc/default/grub <<EOF
GRUB_CMDLINE_LINUX="\$GRUB_CMDLINE_LINUX intel_iommu=on iommu=pt"
GRUB_CMDLINE_LINUX="\$GRUB_CMDLINE_LINUX hugepagesz=1G hugepages=32"
EOF
grub2-mkconfig -o /etc/grub2.cfg
dnf install -y driverctl
driverctl set-override 0000:18:00.2 vfio-pci
driverctl set-override 0000:18:00.3 vfio-pci
# reboot is required to apply isolcpus and allocate hugepages on boot
systemctl reboot

Traffic Generator

TRex needs to be compiled from source:

dnf install -y python3 git numactl-devel zlib-devel gcc-c++ gcc
git clone https://github.com/cisco-system-traffic-generator/trex-core ~/trex
cd ~/trex/linux_dpdk
./b configure
taskset 0xffffffffff ./b build

We will use the following configuration in /etc/trex_cfg.yaml:

- version: 2
  interfaces:
    - "18:00.2"
    - "18:00.3"
  rx_desc: 4096
  tx_desc: 4096
  port_info:
    - dest_mac: "04:3f:72:f2:8f:33"
      src_mac:  "04:3f:72:f2:8f:32"
    - dest_mac: "04:3f:72:f2:8f:32"
      src_mac:  "04:3f:72:f2:8f:33"

  c: 22
  memory:
    mbuf_64: 30000
    mbuf_128: 500000
    mbuf_256: 30717
    mbuf_512: 30720
    mbuf_1024: 30720
    mbuf_2048: 4096

  platform:
    master_thread_id: 0
    latency_thread_id: 12
    dual_if:
      - socket: 0
        threads: [
           1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11,
          13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23,
        ]
[root@tgen ~]# cd ~/trex/scripts
[root@tgen scripts]# ./t-rex-64 -i --astf
...

The TRex daemon will run in the foreground. The cps_ndr.py script will connect to it via the JSON-RPC API in a separate terminal.

Device Under Test

First, let’s compile and install DPDK:

dnf install -y git meson ninja-build gcc python3-pyelftools
git clone -b v21.11 https://github.com/DPDK/dpdk ~/dpdk
cd ~/dpdk
meson build
taskset 0xffffff ninja -C ~/dpdk/build install

And then, compile and install OVS. In the following console excerpt, I explicitly checkout version 2.17.2. Version 3.0.0 will be recompiled before running all tests again:

dnf install -y gcc-g++ make libtool autoconf automake
git clone -b v2.17.2 https://github.com/openvswitch/ovs ~/ovs
cd ~/ovs
./boot.sh
PKG_CONFIG_PATH="/usr/local/lib64/pkgconfig" ./configure --with-dpdk=static
taskset 0xffffff make install -j24
/usr/local/share/openvswitch/scripts/ovs-ctl start

Here I enable the DPDK user space datapath and configure a bridge with two ports. For now, there is only one RX queue per port and one CPU is assigned to poll them. I will update these parameters along the way.

I set the conntrack table size to a relatively large value (5M entries) to reduce the risk of it getting full during tests. Also, I configure the various timeout policies to match the traffic profiles I am about to send. These aggressive timeouts will help prevent the table from getting full. The default timeout values are very conservative and too long to achieve high numbers of connections per second without filling the conntrack table.

ovs-vsctl set open_vswitch . other_config:dpdk-init=true
ovs-vsctl set open_vswitch . other_config:pmd-cpu-mask="0x4"
/usr/local/share/openvswitch/scripts/ovs-ctl restart
ovs-vsctl add-br br0 -- set bridge br0 datapath_type=netdev
ovs-vsctl add-port br0 port0 -- \
    set interface port0 type=dpdk options:dpdk-devargs=0000:18:00.2
ovs-vsctl add-port br0 port1 -- \
    set Interface port1 type=dpdk options:dpdk-devargs=0000:18:00.3

ovs-appctl dpctl/ct-set-maxconns 5000000
# creating an empty datapath record is required to add a zone timeout policy
ovs-vsctl -- --id=@m create Datapath datapath_version=0 -- \
    set Open_vSwitch . datapaths:"netdev"=@m
ovs-vsctl add-zone-tp netdev zone=0 \
    udp_first=1 udp_single=1 udp_multiple=30 tcp_syn_sent=1 \
    tcp_syn_recv=1 tcp_fin_wait=1 tcp_time_wait=1 tcp_close=1 \
    tcp_established=30

cat > ~/ct-flows.txt << EOF
priority=1 ip ct_state=-trk                   actions=ct(table=0)
priority=1 ip ct_state=+trk+new in_port=port0 actions=ct(commit),normal
priority=1 ip ct_state=+trk+est               actions=normal
priority=0 actions=drop
EOF

Test Procedure

The cps_ndr.py script that I have written has multiple parameters to control the nature of the generated connections:

  • Ratio of TCP/UDP connections.
  • Number of data messages (request + response) exchanged per connection (excluding protocol overhead).
  • Size of data messages in bytes (to emulate TCP maximum segment size).
  • Time in milliseconds that the simulated servers wait before sending a response to a request.

Note: In the context of this benchmark, I will intentionally keep the data messages size fixed to 20 bytes to avoid limitation by the 10G bit bandwidth.

I will use these parameters to stress different parts of the connection tracking code path:

Short lived connections

40 data bytes per connection (1 request + 1 reply), no wait by the server before sending the replies. These will allow stressing the conntrack creation & destruction code path.

Example run:

[root@tgen scripts]# ./cps_ndr.py --sample-time 30 --max-iterations 8 \
>    --error-threshold 0.02 --udp-percent 1 --num-messages 1 \
>    --message-size 20 --server-wait 0 -m 1k -M 100k
... iteration #1: lower=1.0K current=50.5K upper=100K
▼▼▼ Flows: active 26.8K (50.1K/s) TX: 215Mb/s (345Kp/s) RX: 215Mb/s (345Kp/s) Size: ~4.5B
err dropped: 1.6K pkts (1.6K/s) ~ 0.4746%
... iteration #2: lower=1.0K current=25.8K upper=50.5K
▲▲▲ Flows: active 12.9K (25.7K/s) TX: 112Mb/s (179Kp/s) RX: 112Mb/s (179Kp/s) Size: ~4.5B
... iteration #3: lower=25.8K current=38.1K upper=50.5K
▲▲▲ Flows: active 19.1K (38.1K/s) TX: 166Mb/s (266Kp/s) RX: 166Mb/s (266Kp/s) Size: ~4.5B
... iteration #4: lower=38.1K current=44.3K upper=50.5K
▼▼▼ Flows: active 22.2K (44.2K/s) TX: 192Mb/s (307Kp/s) RX: 191Mb/s (307Kp/s) Size: ~4.5B
err dropped: 1.3K pkts (125/s) ~ 0.0408%
... iteration #5: lower=38.1K current=41.2K upper=44.3K
▲▲▲ Flows: active 20.7K (41.2K/s) TX: 178Mb/s (286Kp/s) RX: 178Mb/s (286Kp/s) Size: ~4.5B
... iteration #6: lower=41.2K current=42.8K upper=44.3K
▼▼▼ Flows: active 21.5K (42.6K/s) TX: 185Mb/s (296Kp/s) RX: 185Mb/s (296Kp/s) Size: ~4.5B
err dropped: 994 pkts (99/s) ~ 0.0335%
... iteration #7: lower=41.2K current=42.0K upper=42.8K
▼▼▼ Flows: active 21.0K (41.8K/s) TX: 181Mb/s (290Kp/s) RX: 181Mb/s (290Kp/s) Size: ~4.5B
err dropped: 877 pkts (87/s) ~ 0.0301%
... iteration #8: lower=41.2K current=41.6K upper=42.0K
▲▲▲ Flows: active 20.9K (41.4K/s) TX: 180Mb/s (289Kp/s) RX: 180Mb/s (289Kp/s) Size: ~4.5B

Long lived connections

20K data bytes per connection (500 requests + 500 replies) over 25 seconds. These will allow stressing the conntrack lookup code path.

Example run:

[root@tgen scripts]# ./cps_ndr.py --sample-time 120 --max-iterations 8 \
>    --error-threshold 0.02 --udp-percent 1 --num-messages 500 \
>    --message-size 20 --server-wait 50 -m 500 -M 2k
... iteration #1: lower=500 current=1.2K upper=2.0K
▼▼▼ Flows: active 48.5K (1.2K/s) TX: 991Mb/s (1.5Mp/s) RX: 940Mb/s (1.4Mp/s) Size: ~13.3B
err dropped: 1.8M pkts (30.6K/s) ~ 2.4615%
... iteration #2: lower=500 current=875 upper=1.2K
▲▲▲ Flows: active 22.5K (871/s) TX: 871Mb/s (1.3Mp/s) RX: 871Mb/s (1.3Mp/s) Size: ~13.3B
... iteration #3: lower=875 current=1.1K upper=1.2K
▼▼▼ Flows: active 33.8K (1.1K/s) TX: 967Mb/s (1.4Mp/s) RX: 950Mb/s (1.4Mp/s) Size: ~13.3B
err dropped: 621K pkts (10.3K/s) ~ 0.7174%
... iteration #4: lower=875 current=968 upper=1.1K
▲▲▲ Flows: active 24.9K (965/s) TX: 961Mb/s (1.4Mp/s) RX: 962Mb/s (1.4Mp/s) Size: ~13.3B
... iteration #5: lower=968 current=1.0K upper=1.1K
▼▼▼ Flows: active 29.8K (1.0K/s) TX: 965Mb/s (1.4Mp/s) RX: 957Mb/s (1.4Mp/s) Size: ~13.3B
err dropped: 334K pkts (5.6K/s) ~ 0.3830%
... iteration #6: lower=968 current=992 upper=1.0K
▼▼▼ Flows: active 25.5K (989/s) TX: 964Mb/s (1.4Mp/s) RX: 964Mb/s (1.4Mp/s) Size: ~13.3B
err dropped: 460 pkts (460/s) ~ 0.0314%
... iteration #7: lower=968 current=980 upper=992
▼▼▼ Flows: active 25.3K (977/s) TX: 962Mb/s (1.4Mp/s) RX: 962Mb/s (1.4Mp/s) Size: ~13.3B
err dropped: 397 pkts (397/s) ~ 0.0272%
... iteration #8: lower=968 current=974 upper=980
▲▲▲ Flows: active 25.1K (971/s) TX: 969Mb/s (1.5Mp/s) RX: 969Mb/s (1.5Mp/s) Size: ~13.3B

Results

Both the short-lived and long-lived connection profiles will be tested against OVS versions 2.17.2 and 3.0.0. Different configurations will be tested to check if the performance scales with the number of CPUs and receive queues. The actual numbers that I measured should be taken with a grain of salt. The connection tracking performance is highly dependent on hardware, traffic profile and overall system load. I only provide them to have a general idea of the improvement brought by OVS 3.0.0.

Traffic Generator Calibration

This is to demonstrate what is the maximum performance that TRex is able to achieve with this configuration and hardware. The tests were executed with a cable connected between port0 and port1 of the traffic generator machine.

Type Connection Rate Active Flows Bandwidth Packet Rate
Short-Lived 1.8M conn/s 1.7M 8.4G bit/s 12.7M pkt/s
Long-Lived 11.1K conn/s 898K 8.0G bit/s 11.4M pkt/s

1 CPU, 1 queue per port, without connection tracking

ovs-vsctl set open_vswitch . other_config:pmd-cpu-mask="0x4"
ovs-vsctl set Interface port0 options:n_rxq=1
ovs-vsctl set Interface port1 options:n_rxq=1
ovs-ofctl del-flows br0
ovs-ofctl add-flow br0 action=normal
Version Short-Lived Connections Active Flows Bandwidth Packet Rate Difference
2.17.2 1.0M conn/s 524.8K 4.5G bit/s 7.3M pkt/s
3.0.0 1.0M conn/s 513.1K 4.5G bit/s 7.1M pkt/s -1.74%
Version Long-Lived Connections Active Flows Bandwidth Packet Rate Difference
2.17.2 3.1K conn/s 79.9K 3.1G bit/s 4.7M pkt/s
3.0.0 2.8K conn/s 71.9K 2.8G bit/s 4.2M pkt/s -9.82%

There is a drop of performance without connection tracking enabled between v2.17.2 and v3.0.0. This is completely unrelated to the conntrack optimization patch series I am focusing on. It may be caused by some discrepancies in the test procedure but it might also have been introduced by another patch series between the two tested versions.

1 CPU, 1 queue per port

ovs-vsctl set open_vswitch . other_config:pmd-cpu-mask="0x4"
ovs-vsctl set Interface port0 options:n_rxq=1
ovs-vsctl set Interface port1 options:n_rxq=1
ovs-ofctl del-flows br0
ovs-ofctl add-flows br0 ~/ct-flows.txt
Version Short-Lived Connections Active Flows Bandwidth Packet Rate Difference
2.17.2 39.7K conn/s 20.0K 172.0M bit/s 275.8K pkt/s
3.0.0 48.2K conn/s 24.3K 208.9M bit/s 334.9K pkt/s +21.36%
Version Long-Lived Connections Active Flows Bandwidth Packet Rate Difference
2.17.2 959 conn/s 24.7K 956.6M bit/s 1.4M pkt/s
3.0.0 1.2K conn/s 31.5K 1.2G bit/s 1.8M pkt/s +28.15%

Already here, we can see that the patch series improves the single threaded performance of connection tracking, in both the creation & destruction and the lookup code paths. This should be kept in mind when looking at improvements in multi-threaded performance.

2 CPUs, 1 queue per port

ovs-vsctl set open_vswitch . other_config:pmd-cpu-mask="0x2002"
ovs-vsctl set Interface port0 options:n_rxq=1
ovs-vsctl set Interface port1 options:n_rxq=1
ovs-ofctl del-flows br0
ovs-ofctl add-flows br0 ~/ct-flows.txt
Version Short-Lived Connections Active Flows Bandwidth Packet Rate Difference
2.17.2 39.9K conn/s 20.0K 172.8M bit/s 277.0K pkt/s
3.0.0 46.8K conn/s 23.5K 202.7M bit/s 325.0K pkt/s +17.28%
Version Long-Lived Connections Active Flows Bandwidth Packet Rate Difference
2.17.2 885 conn/s 22.7K 883.1M bit/s 1.3M pkt/s
3.0.0 1.1K conn/s 28.6K 1.1G bit/s 1.7M pkt/s +25.19%

It is worth noting that assigning twice as much CPUs to do packet processing does not double the performance. Far from it, in fact. The numbers are exactly the same (if not lower) than with only one CPU.

This may be due to the fact that there only is one RX queue per port and each CPU processes a single port.

2 CPUs, 2 queues per port

ovs-vsctl set open_vswitch . other_config:pmd-cpu-mask="0x2002"
ovs-vsctl set Interface port0 options:n_rxq=2
ovs-vsctl set Interface port1 options:n_rxq=2
ovs-ofctl del-flows br0
ovs-ofctl add-flows br0 ~/ct-flows.txt
Version Short-Lived Connections Active Flows Bandwidth Packet Rate Difference
2.17.2 48.3K conn/s 24.3K 208.8M bit/s 334.8K pkt/s
3.0.0 65.9K conn/s 33.2K 286.8M bit/s 459.9K pkt/s +36.41%
Version Long-Lived Connections Active Flows Bandwidth Packet Rate Difference
2.17.2 1.1K conn/s 29.1K 1.1G bit/s 1.7M pkt/s
3.0.0 1.4K conn/s 37.0K 1.4G bit/s 2.2M pkt/s +26.77%

For short-lived connections, we begin to see improvement beyond the single threaded performance gain. Lock contention was reduced in the insertion/deletion of conntrack entries.

With two CPUs and two queues, if we take the single threaded performance out of the picture, there seems to be no improvement in conntrack lookup with multiple threads.

4 CPUs, 2 queues per port

ovs-vsctl set open_vswitch . other_config:pmd-cpu-mask="0x6006"
ovs-vsctl set Interface port0 options:n_rxq=2
ovs-vsctl set Interface port1 options:n_rxq=2
ovs-ofctl del-flows br0
ovs-ofctl add-flows br0 ~/ct-flows.txt
Version Short-Lived Connections Active Flows Bandwidth Packet Rate Difference
2.17.2 47.4K conn/s 23.9K 206.2M bit/s 330.6K pkt/s
3.0.0 49.1K conn/s 24.7K 212.1M bit/s 340.1K pkt/s +3.53%
Version Long-Lived Connections Active Flows Bandwidth Packet Rate Difference
2.17.2 981 conn/s 25.2K 977.7M bit/s 1.5M pkt/s
3.0.0 2.0K conn/s 52.4K 2.0G bit/s 3.1M pkt/s +108.31%

The short lived connection rate performance has dropped in 3.0.0. This is not a fluke, the numbers are consistent across multiple runs. This would warrant some scrutiny but it does not invalidate all the work that has been done.

4 CPUs, 4 queues per port

ovs-vsctl set open_vswitch . other_config:pmd-cpu-mask="0x6006"
ovs-vsctl set Interface port0 options:n_rxq=4
ovs-vsctl set Interface port1 options:n_rxq=4
ovs-ofctl del-flows br0
ovs-ofctl add-flows br0 ~/ct-flows.txt
Version Short-Lived Connections Active Flows Bandwidth Packet Rate Difference
2.17.2 66.1K conn/s 33.2K 286.4M bit/s 459.2K pkt/s
3.0.0 100.8K conn/s 50.6K 437.0M bit/s 700.6K pkt/s +52.55%
Version Long-Lived Connections Active Flows Bandwidth Packet Rate Difference
2.17.2 996 conn/s 25.9K 994.2M bit/s 1.5M pkt/s
3.0.0 2.6K conn/s 67.0K 2.6G bit/s 3.9M pkt/s +162.89%

8 CPUs, 4 queues per port

ovs-vsctl set open_vswitch . other_config:pmd-cpu-mask="0x1e01e"
ovs-vsctl set Interface port0 options:n_rxq=4
ovs-vsctl set Interface port1 options:n_rxq=4
ovs-ofctl del-flows br0
ovs-ofctl add-flows br0 ~/ct-flows.txt
Version Short-Lived Connections Active Flows Bandwidth Packet Rate Difference
2.17.2 62.2K conn/s 31.3K 269.8M bit/s 432.5K pkt/s
3.0.0 90.1K conn/s 45.2K 390.9M bit/s 626.7K pkt/s +44.89%
Version Long-Lived Connections Active Flows Bandwidth Packet Rate Difference
2.17.2 576 conn/s 17.1K 567.2M bit/s 852.5K pkt/s
3.0.0 3.8K conn/s 97.8K 3.8G bit/s 5.7M pkt/s +562.76%

8 CPUs, 8 queues per port

ovs-vsctl set open_vswitch . other_config:pmd-cpu-mask="0x1e01e"
ovs-vsctl set Interface port0 options:n_rxq=8
ovs-vsctl set Interface port1 options:n_rxq=8
ovs-ofctl del-flows br0
ovs-ofctl add-flows br0 ~/ct-flows.txt
Version Short-Lived Connections Active Flows Bandwidth Packet Rate Difference
2.17.2 50.6K conn/s 25.5K 219.5M bit/s 351.9K pkt/s
3.0.0 100.9K conn/s 50.7K 436.0M bit/s 698.9K pkt/s +99.36%
Version Long-Lived Connections Active Flows Bandwidth Packet Rate Difference
2.17.2 541 conn/s 14.0K 539.2M bit/s 810.3K pkt/s
3.0.0 4.8K conn/s 124.1K 4.8G bit/s 7.2M pkt/s +792.83%

Analysis

Scaling

short-lived-scaling.svg

Apart from the small blip with 4 CPUs and 2 queues per port, the conntrack insertion & deletion code path has improved consistently in OvS 3.0.0. The multi-threaded lock contention remains, albeit less noticeable than with OvS 2.17.2.

long-lived-scaling.svg

This is where the optimizations done in OvS 3.0.0 really shine. The reduction in multi-threaded lock contention with conntrack lookup makes the performance scale significantly better with the number of CPUs.

Profiling

Here is a stripped down perf report of both versions while 8 CPUs/8 RX queues under maximum long-lived connections load with conntrack flows enabled. Only the events of a single CPU were captured.

perf record -g -C 1 sleep 60
perf report -U --no-children | grep '\[[\.k]\]' | head -15 > profile-$version.txt
# also generate flame graphs for interactive visual inspection of hot spots
git clone https://github.com/brendangregg/FlameGraph
perf script | ./FlameGraph/stackcollapse-perf.pl |
./FlameGraph/flamegraph.pl --title "OvS $version Connection Tracking" \
  --subtitle "1 PMD thread shown out of 8" --minwidth 1 --width 2000 \
  --height 30 --fontsize 13 --fonttype sans-serif --hash --bgcolors grey \
  - > ovs-conntrack-$version.svg

I have manually annotated lines that are directly related to acquiring mutexes (starting with a * character). When a CPU is waiting for a mutex acquisition, it is not processing any network traffic but waiting for another CPU to release the lock.

2.17.2

The profiled CPU spends almost 40% of its cycles acquiring locks and waiting for other CPUs to release locks.

* 30.99%  pmd-c01/id:5  libc.so.6          [.] pthread_mutex_lock@@GLIBC_2.2.5
  12.27%  pmd-c01/id:5  ovs-vswitchd       [.] dp_netdev_process_rxq_port
   5.18%  pmd-c01/id:5  ovs-vswitchd       [.] netdev_dpdk_rxq_recv
   4.24%  pmd-c01/id:5  ovs-vswitchd       [.] pmd_thread_main
   3.93%  pmd-c01/id:5  ovs-vswitchd       [.] pmd_perf_end_iteration
*  3.63%  pmd-c01/id:5  libc.so.6          [.] __GI___pthread_mutex_unlock_usercnt
   3.62%  pmd-c01/id:5  ovs-vswitchd       [.] i40e_recv_pkts_vec_avx2
*  2.76%  pmd-c01/id:5  [kernel.kallsyms]  [k] syscall_exit_to_user_mode
*  0.91%  pmd-c01/id:5  libc.so.6          [.] __GI___lll_lock_wait
*  0.18%  pmd-c01/id:5  [kernel.kallsyms]  [k] __x64_sys_futex
*  0.17%  pmd-c01/id:5  [kernel.kallsyms]  [k] futex_wait
*  0.12%  pmd-c01/id:5  [kernel.kallsyms]  [k] entry_SYSCALL_64_after_hwframe
*  0.11%  pmd-c01/id:5  libc.so.6          [.] __GI___lll_lock_wake
*  0.08%  pmd-c01/id:5  [kernel.kallsyms]  [k] do_syscall_64
*  0.06%  pmd-c01/id:5  [kernel.kallsyms]  [k] do_futex

Full flame graph

3.0.0

It is obvious that 3.0.0 has much less lock contention and therefore will scale better with the number of CPUs.

  15.30%  pmd-c01/id:5  ovs-vswitchd       [.] dp_netdev_input__
   8.62%  pmd-c01/id:5  ovs-vswitchd       [.] conn_key_lookup
   7.88%  pmd-c01/id:5  ovs-vswitchd       [.] miniflow_extract
   7.75%  pmd-c01/id:5  ovs-vswitchd       [.] cmap_find
*  6.92%  pmd-c01/id:5  libc.so.6          [.] pthread_mutex_lock@@GLIBC_2.2.5
   5.15%  pmd-c01/id:5  ovs-vswitchd       [.] dpcls_subtable_lookup_mf_u0w4_u1w1
   4.16%  pmd-c01/id:5  ovs-vswitchd       [.] cmap_find_batch
   4.10%  pmd-c01/id:5  ovs-vswitchd       [.] tcp_conn_update
   3.86%  pmd-c01/id:5  ovs-vswitchd       [.] dpcls_subtable_lookup_mf_u0w5_u1w1
   3.51%  pmd-c01/id:5  ovs-vswitchd       [.] conntrack_execute
   3.42%  pmd-c01/id:5  ovs-vswitchd       [.] i40e_xmit_fixed_burst_vec_avx2
   0.77%  pmd-c01/id:5  ovs-vswitchd       [.] dp_execute_cb
   0.72%  pmd-c01/id:5  ovs-vswitchd       [.] netdev_dpdk_rxq_recv
   0.07%  pmd-c01/id:5  ovs-vswitchd       [.] i40e_xmit_pkts_vec_avx2
   0.04%  pmd-c01/id:5  ovs-vswitchd       [.] dp_netdev_input

Full flame graph

Final Words

I hope this gave you some ideas for benchmarking and profiling connection tracking with TRex and perf. Please let me know if you have any questions.

Kudos to Paolo Valerio and Gaëtan Rivet for their work on optimizing the user space OvS conntrack implementation.