In this technical write up, I explain how I came up with a method to benchmark
the Open vSwitch (OvS) userspace datapath connection tracking
performance and how this was used to characterize this recent patch
series included in OvS 3.0.0
which aims at improving the multi-thread
scalability.
What Is Connection Tracking?
Connection tracking (conntrack) is the process of keeping track of logical network connections (also named flows) and thereby identifying all packets that make up each flow so that they can be handled consistently together.
Conntrack is a requirement for Network Address Translation (NAT), for example in IP address masquerading (described in detail in RFC 3022). It is also required for stateful firewalls, load balancers, intrusion detection/prevention systems and deep packet inspection engines. More specifically, OvS conntrack rules are used to implement isolation between OpenStack virtual networks (A.K.A. security groups).
Connection tracking is usually implemented by storing known connections entries into a table. Indexed by a bi-directional 5-tuple (protocol, source address, destination address, source port, destination port). Each entry also has a state as seen from the connection tracking system. The state (new, established, closed, etc.) is updated every time a packet matching its 5-tuple is processed. If a received packet does not match any existing conntrack entry, a new one must be created and inserted into the table.
Performance Aspects
There are two aspects to consider when we measure conntrack performance.
Connection rate
How many new connections can be handled per second?
This is directly implied by:
- What is the cost for looking up an existing connection entry for each received packet?
- Can multiple threads insert/destroy conntrack entries concurrently?
- What is the cost of creating one conntrack entry for new connections?
- How many packets are exchanged per connection?
Maximum number of concurrent connections
How many concurrent connections can the system support?
This is directly implied by:
- What is the size of the conntrack table?
- What is the duration of each individual connection?
- After a connection has been closed, for how long the conntrack entry lingers in the table until it is expunged to make room for new connections? What if the connection is not closed but no longer exchanges traffic (client or server crashed/disconnected).
- What happens when the conntrack table is full?
These two aspects are somewhat connected since even a low rate of very long new connections will cause the conntrack table to fill up eventually.
In order to properly size the connection tracking table, one needs to know what will be the average number of new connections per second and their average duration. One also needs to tune the various timeout values of the conntrack engine.
Benchmarking Process
We need a way to simulate clients and servers: specify how many of each there are, how many connections per second they are creating, how long the connections are and how much data is exchanged in each connection.
There are a few commercial traffic generators that have these capabilities, more or less refined. Today, I will describe how to do this with an Open Source traffic generator based on the DPDK framework: TRex.
TRex has multiple modes of operation. I will focus on the Advanced Stateful (ASTF) mode which allows simulating lightweight TCP/UDP clients. I have tailored a script using the TRex Python API to perform RFC 2544-like benchmarks but focusing on the new connections per second performance.
Basically, this script connects to a running TRex server started in ASTF mode and creates TCP/UDP connection profiles. These profiles are state machines representing clients and servers with dynamic IP addresses and TCP ports. You can define the number of data exchanges, their size, add some arbitrary wait time to simulate network latency, etc. TRex takes care of translating this into real TCP traffic.
Here is a simplified example of a TCP connection profile:
client = ASTFProgram(stream=True)
server = ASTFProgram(stream=True)
for _ in range(num_messages):
client.send(message_size * b"x")
server.recv(message_size)
if server_wait > 0:
server.delay(server_wait * 1000) # trex wants microseconds
server.send(message_size * b"y")
client.recv(message_size)
tcp_profile = ASTFTemplate(
client_template=ASTFTCPClientTemplate(
program=client,
port=8080,
cps=99, # base value which is changed during the binary search
cont=True,
),
server_template=ASTFTCPServerTemplate(
program=server, assoc=ASTFAssociationRule(port=8080)
),
)
Setup
The Device Under Test (DUT) will run the Open vSwitch daemon
(ovs-vswitchd
) with the userspace DPDK datapath. The same kind of
setup can be used to benchmark any connection tracking device. This is overly
simplified and does not represent an actual production workload. However it
will allow stressing the connection tracking code path without bothering about
the external details.
Base System
Both OvS userspace datapath and TRex use DPDK. These settings are common to both machines.
DPDK requires compatible network interfaces. In this example, I will be running on the last two ports of an Intel® X710 PCI network interface.
[root@* ~]# lscpu | grep -e "^Model name:" -e "^NUMA" -e MHz
NUMA node(s): 1
Model name: Intel(R) Xeon(R) Gold 5118 CPU @ 2.30GHz
CPU MHz: 2700.087
NUMA node0 CPU(s): 0-23
[root@* ~]# grep ^MemTotal /proc/meminfo
MemTotal: 65373528 kB
[root@* ~]# lspci | grep X710 | tail -n2
18:00.2 Ethernet controller: Intel Corporation Ethernet Controller X710 for 10GbE SFP+ (rev 02)
18:00.3 Ethernet controller: Intel Corporation Ethernet Controller X710 for 10GbE SFP+ (rev 02)
The CPUs used by TRex and OvS need to be isolated to get the least disturbance from the other tasks running on Linux. I isolate CPUs from the NUMA node where the PCI NIC is connected. CPUs 0 and 12 are left to Linux.
dnf install -y tuned tuned-profiles-cpu-partitioning
cat > /etc/tuned/cpu-partitioning-variables.conf <<EOF
isolated_cores=1-11,13-23
no_balance_cores=1-11,13-23
EOF
tuned-adm profile cpu-partitioning
Finally, DPDK applications require huge pages. It is best to allocate them on boot to ensure that they are all mapped to contiguous chunks of memory.
cat >> /etc/default/grub <<EOF
GRUB_CMDLINE_LINUX="\$GRUB_CMDLINE_LINUX intel_iommu=on iommu=pt"
GRUB_CMDLINE_LINUX="\$GRUB_CMDLINE_LINUX hugepagesz=1G hugepages=32"
EOF
grub2-mkconfig -o /etc/grub2.cfg
dnf install -y driverctl
driverctl set-override 0000:18:00.2 vfio-pci
driverctl set-override 0000:18:00.3 vfio-pci
# reboot is required to apply isolcpus and allocate hugepages on boot
systemctl reboot
Traffic Generator
TRex needs to be compiled from source:
dnf install -y python3 git numactl-devel zlib-devel gcc-c++ gcc
git clone https://github.com/cisco-system-traffic-generator/trex-core ~/trex
cd ~/trex/linux_dpdk
./b configure
taskset 0xffffffffff ./b build
We will use the following configuration in /etc/trex_cfg.yaml
:
- version: 2
interfaces:
- "18:00.2"
- "18:00.3"
rx_desc: 4096
tx_desc: 4096
port_info:
- dest_mac: "04:3f:72:f2:8f:33"
src_mac: "04:3f:72:f2:8f:32"
- dest_mac: "04:3f:72:f2:8f:32"
src_mac: "04:3f:72:f2:8f:33"
c: 22
memory:
mbuf_64: 30000
mbuf_128: 500000
mbuf_256: 30717
mbuf_512: 30720
mbuf_1024: 30720
mbuf_2048: 4096
platform:
master_thread_id: 0
latency_thread_id: 12
dual_if:
- socket: 0
threads: [
1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11,
13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23,
]
[root@tgen ~]# cd ~/trex/scripts
[root@tgen scripts]# ./t-rex-64 -i --astf
...
The TRex daemon will run in the foreground. The cps_ndr.py
script
will connect to it via the JSON-RPC API in a separate terminal.
Device Under Test
First, let’s compile and install DPDK:
dnf install -y git meson ninja-build gcc python3-pyelftools
git clone -b v21.11 https://github.com/DPDK/dpdk ~/dpdk
cd ~/dpdk
meson build
taskset 0xffffff ninja -C ~/dpdk/build install
And then, compile and install OVS. In the following console excerpt,
I explicitly checkout version 2.17.2
. Version 3.0.0
will be recompiled
before running all tests again:
dnf install -y gcc-g++ make libtool autoconf automake
git clone -b v2.17.2 https://github.com/openvswitch/ovs ~/ovs
cd ~/ovs
./boot.sh
PKG_CONFIG_PATH="/usr/local/lib64/pkgconfig" ./configure --with-dpdk=static
taskset 0xffffff make install -j24
/usr/local/share/openvswitch/scripts/ovs-ctl start
Here I enable the DPDK user space datapath and configure a bridge with two ports. For now, there is only one RX queue per port and one CPU is assigned to poll them. I will update these parameters along the way.
I set the conntrack table size to a relatively large value (5M entries) to reduce the risk of it getting full during tests. Also, I configure the various timeout policies to match the traffic profiles I am about to send. These aggressive timeouts will help prevent the table from getting full. The default timeout values are very conservative and too long to achieve high numbers of connections per second without filling the conntrack table.
ovs-vsctl set open_vswitch . other_config:dpdk-init=true
ovs-vsctl set open_vswitch . other_config:pmd-cpu-mask="0x4"
/usr/local/share/openvswitch/scripts/ovs-ctl restart
ovs-vsctl add-br br0 -- set bridge br0 datapath_type=netdev
ovs-vsctl add-port br0 port0 -- \
set interface port0 type=dpdk options:dpdk-devargs=0000:18:00.2
ovs-vsctl add-port br0 port1 -- \
set Interface port1 type=dpdk options:dpdk-devargs=0000:18:00.3
ovs-appctl dpctl/ct-set-maxconns 5000000
# creating an empty datapath record is required to add a zone timeout policy
ovs-vsctl -- --id=@m create Datapath datapath_version=0 -- \
set Open_vSwitch . datapaths:"netdev"=@m
ovs-vsctl add-zone-tp netdev zone=0 \
udp_first=1 udp_single=1 udp_multiple=30 tcp_syn_sent=1 \
tcp_syn_recv=1 tcp_fin_wait=1 tcp_time_wait=1 tcp_close=1 \
tcp_established=30
cat > ~/ct-flows.txt << EOF
priority=1 ip ct_state=-trk actions=ct(table=0)
priority=1 ip ct_state=+trk+new in_port=port0 actions=ct(commit),normal
priority=1 ip ct_state=+trk+est actions=normal
priority=0 actions=drop
EOF
Test Procedure
The cps_ndr.py
script that I have written has multiple parameters to control
the nature of the generated connections:
- Ratio of TCP/UDP connections.
- Number of data messages (request + response) exchanged per connection (excluding protocol overhead).
- Size of data messages in bytes (to emulate TCP maximum segment size).
- Time in milliseconds that the simulated servers wait before sending a response to a request.
Note: In the context of this benchmark, I will intentionally keep the data messages size fixed to 20 bytes to avoid limitation by the 10G bit bandwidth.
I will use these parameters to stress different parts of the connection tracking code path:
Short lived connections
40 data bytes per connection (1 request + 1 reply), no wait by the server before sending the replies. These will allow stressing the conntrack creation & destruction code path.
Example run:
[root@tgen scripts]# ./cps_ndr.py --sample-time 30 --max-iterations 8 \
> --error-threshold 0.02 --udp-percent 1 --num-messages 1 \
> --message-size 20 --server-wait 0 -m 1k -M 100k
... iteration #1: lower=1.0K current=50.5K upper=100K
▼▼▼ Flows: active 26.8K (50.1K/s) TX: 215Mb/s (345Kp/s) RX: 215Mb/s (345Kp/s) Size: ~4.5B
err dropped: 1.6K pkts (1.6K/s) ~ 0.4746%
... iteration #2: lower=1.0K current=25.8K upper=50.5K
▲▲▲ Flows: active 12.9K (25.7K/s) TX: 112Mb/s (179Kp/s) RX: 112Mb/s (179Kp/s) Size: ~4.5B
... iteration #3: lower=25.8K current=38.1K upper=50.5K
▲▲▲ Flows: active 19.1K (38.1K/s) TX: 166Mb/s (266Kp/s) RX: 166Mb/s (266Kp/s) Size: ~4.5B
... iteration #4: lower=38.1K current=44.3K upper=50.5K
▼▼▼ Flows: active 22.2K (44.2K/s) TX: 192Mb/s (307Kp/s) RX: 191Mb/s (307Kp/s) Size: ~4.5B
err dropped: 1.3K pkts (125/s) ~ 0.0408%
... iteration #5: lower=38.1K current=41.2K upper=44.3K
▲▲▲ Flows: active 20.7K (41.2K/s) TX: 178Mb/s (286Kp/s) RX: 178Mb/s (286Kp/s) Size: ~4.5B
... iteration #6: lower=41.2K current=42.8K upper=44.3K
▼▼▼ Flows: active 21.5K (42.6K/s) TX: 185Mb/s (296Kp/s) RX: 185Mb/s (296Kp/s) Size: ~4.5B
err dropped: 994 pkts (99/s) ~ 0.0335%
... iteration #7: lower=41.2K current=42.0K upper=42.8K
▼▼▼ Flows: active 21.0K (41.8K/s) TX: 181Mb/s (290Kp/s) RX: 181Mb/s (290Kp/s) Size: ~4.5B
err dropped: 877 pkts (87/s) ~ 0.0301%
... iteration #8: lower=41.2K current=41.6K upper=42.0K
▲▲▲ Flows: active 20.9K (41.4K/s) TX: 180Mb/s (289Kp/s) RX: 180Mb/s (289Kp/s) Size: ~4.5B
Long lived connections
20K data bytes per connection (500 requests + 500 replies) over 25 seconds. These will allow stressing the conntrack lookup code path.
Example run:
[root@tgen scripts]# ./cps_ndr.py --sample-time 120 --max-iterations 8 \
> --error-threshold 0.02 --udp-percent 1 --num-messages 500 \
> --message-size 20 --server-wait 50 -m 500 -M 2k
... iteration #1: lower=500 current=1.2K upper=2.0K
▼▼▼ Flows: active 48.5K (1.2K/s) TX: 991Mb/s (1.5Mp/s) RX: 940Mb/s (1.4Mp/s) Size: ~13.3B
err dropped: 1.8M pkts (30.6K/s) ~ 2.4615%
... iteration #2: lower=500 current=875 upper=1.2K
▲▲▲ Flows: active 22.5K (871/s) TX: 871Mb/s (1.3Mp/s) RX: 871Mb/s (1.3Mp/s) Size: ~13.3B
... iteration #3: lower=875 current=1.1K upper=1.2K
▼▼▼ Flows: active 33.8K (1.1K/s) TX: 967Mb/s (1.4Mp/s) RX: 950Mb/s (1.4Mp/s) Size: ~13.3B
err dropped: 621K pkts (10.3K/s) ~ 0.7174%
... iteration #4: lower=875 current=968 upper=1.1K
▲▲▲ Flows: active 24.9K (965/s) TX: 961Mb/s (1.4Mp/s) RX: 962Mb/s (1.4Mp/s) Size: ~13.3B
... iteration #5: lower=968 current=1.0K upper=1.1K
▼▼▼ Flows: active 29.8K (1.0K/s) TX: 965Mb/s (1.4Mp/s) RX: 957Mb/s (1.4Mp/s) Size: ~13.3B
err dropped: 334K pkts (5.6K/s) ~ 0.3830%
... iteration #6: lower=968 current=992 upper=1.0K
▼▼▼ Flows: active 25.5K (989/s) TX: 964Mb/s (1.4Mp/s) RX: 964Mb/s (1.4Mp/s) Size: ~13.3B
err dropped: 460 pkts (460/s) ~ 0.0314%
... iteration #7: lower=968 current=980 upper=992
▼▼▼ Flows: active 25.3K (977/s) TX: 962Mb/s (1.4Mp/s) RX: 962Mb/s (1.4Mp/s) Size: ~13.3B
err dropped: 397 pkts (397/s) ~ 0.0272%
... iteration #8: lower=968 current=974 upper=980
▲▲▲ Flows: active 25.1K (971/s) TX: 969Mb/s (1.5Mp/s) RX: 969Mb/s (1.5Mp/s) Size: ~13.3B
Results
Both the short-lived and long-lived connection profiles will be tested against
OVS versions 2.17.2
and 3.0.0
. Different configurations will be tested to
check if the performance scales with the number of CPUs and receive queues. The
actual numbers that I measured should be taken with a grain of salt. The
connection tracking performance is highly dependent on hardware, traffic
profile and overall system load. I only provide them to have a general idea of
the improvement brought by OVS 3.0.0
.
Traffic Generator Calibration
This is to demonstrate what is the maximum performance that TRex is able to
achieve with this configuration and hardware. The tests were executed with
a cable connected between port0
and port1
of the traffic generator machine.
Type | Connection Rate | Active Flows | Bandwidth | Packet Rate |
---|---|---|---|---|
Short-Lived | 1.8M conn/s | 1.7M | 8.4G bit/s | 12.7M pkt/s |
Long-Lived | 11.1K conn/s | 898K | 8.0G bit/s | 11.4M pkt/s |
1 CPU, 1 queue per port, without connection tracking
ovs-vsctl set open_vswitch . other_config:pmd-cpu-mask="0x4"
ovs-vsctl set Interface port0 options:n_rxq=1
ovs-vsctl set Interface port1 options:n_rxq=1
ovs-ofctl del-flows br0
ovs-ofctl add-flow br0 action=normal
Version | Short-Lived Connections | Active Flows | Bandwidth | Packet Rate | Difference |
---|---|---|---|---|---|
2.17.2 | 1.0M conn/s | 524.8K | 4.5G bit/s | 7.3M pkt/s | |
3.0.0 | 1.0M conn/s | 513.1K | 4.5G bit/s | 7.1M pkt/s | -1.74% |
Version | Long-Lived Connections | Active Flows | Bandwidth | Packet Rate | Difference |
---|---|---|---|---|---|
2.17.2 | 3.1K conn/s | 79.9K | 3.1G bit/s | 4.7M pkt/s | |
3.0.0 | 2.8K conn/s | 71.9K | 2.8G bit/s | 4.2M pkt/s | -9.82% |
There is a drop of performance without connection tracking enabled between
v2.17.2
and v3.0.0
. This is completely unrelated to the conntrack
optimization patch series I am focusing on. It may be caused by some
discrepancies in the test procedure but it might also have been introduced by
another patch series between the two tested versions.
1 CPU, 1 queue per port
ovs-vsctl set open_vswitch . other_config:pmd-cpu-mask="0x4"
ovs-vsctl set Interface port0 options:n_rxq=1
ovs-vsctl set Interface port1 options:n_rxq=1
ovs-ofctl del-flows br0
ovs-ofctl add-flows br0 ~/ct-flows.txt
Version | Short-Lived Connections | Active Flows | Bandwidth | Packet Rate | Difference |
---|---|---|---|---|---|
2.17.2 | 39.7K conn/s | 20.0K | 172.0M bit/s | 275.8K pkt/s | |
3.0.0 | 48.2K conn/s | 24.3K | 208.9M bit/s | 334.9K pkt/s | +21.36% |
Version | Long-Lived Connections | Active Flows | Bandwidth | Packet Rate | Difference |
---|---|---|---|---|---|
2.17.2 | 959 conn/s | 24.7K | 956.6M bit/s | 1.4M pkt/s | |
3.0.0 | 1.2K conn/s | 31.5K | 1.2G bit/s | 1.8M pkt/s | +28.15% |
Already here, we can see that the patch series improves the single threaded performance of connection tracking, in both the creation & destruction and the lookup code paths. This should be kept in mind when looking at improvements in multi-threaded performance.
2 CPUs, 1 queue per port
ovs-vsctl set open_vswitch . other_config:pmd-cpu-mask="0x2002"
ovs-vsctl set Interface port0 options:n_rxq=1
ovs-vsctl set Interface port1 options:n_rxq=1
ovs-ofctl del-flows br0
ovs-ofctl add-flows br0 ~/ct-flows.txt
Version | Short-Lived Connections | Active Flows | Bandwidth | Packet Rate | Difference |
---|---|---|---|---|---|
2.17.2 | 39.9K conn/s | 20.0K | 172.8M bit/s | 277.0K pkt/s | |
3.0.0 | 46.8K conn/s | 23.5K | 202.7M bit/s | 325.0K pkt/s | +17.28% |
Version | Long-Lived Connections | Active Flows | Bandwidth | Packet Rate | Difference |
---|---|---|---|---|---|
2.17.2 | 885 conn/s | 22.7K | 883.1M bit/s | 1.3M pkt/s | |
3.0.0 | 1.1K conn/s | 28.6K | 1.1G bit/s | 1.7M pkt/s | +25.19% |
It is worth noting that assigning twice as much CPUs to do packet processing does not double the performance. Far from it, in fact. The numbers are exactly the same (if not lower) than with only one CPU.
This may be due to the fact that there only is one RX queue per port and each CPU processes a single port.
2 CPUs, 2 queues per port
ovs-vsctl set open_vswitch . other_config:pmd-cpu-mask="0x2002"
ovs-vsctl set Interface port0 options:n_rxq=2
ovs-vsctl set Interface port1 options:n_rxq=2
ovs-ofctl del-flows br0
ovs-ofctl add-flows br0 ~/ct-flows.txt
Version | Short-Lived Connections | Active Flows | Bandwidth | Packet Rate | Difference |
---|---|---|---|---|---|
2.17.2 | 48.3K conn/s | 24.3K | 208.8M bit/s | 334.8K pkt/s | |
3.0.0 | 65.9K conn/s | 33.2K | 286.8M bit/s | 459.9K pkt/s | +36.41% |
Version | Long-Lived Connections | Active Flows | Bandwidth | Packet Rate | Difference |
---|---|---|---|---|---|
2.17.2 | 1.1K conn/s | 29.1K | 1.1G bit/s | 1.7M pkt/s | |
3.0.0 | 1.4K conn/s | 37.0K | 1.4G bit/s | 2.2M pkt/s | +26.77% |
For short-lived connections, we begin to see improvement beyond the single threaded performance gain. Lock contention was reduced in the insertion/deletion of conntrack entries.
With two CPUs and two queues, if we take the single threaded performance out of the picture, there seems to be no improvement in conntrack lookup with multiple threads.
4 CPUs, 2 queues per port
ovs-vsctl set open_vswitch . other_config:pmd-cpu-mask="0x6006"
ovs-vsctl set Interface port0 options:n_rxq=2
ovs-vsctl set Interface port1 options:n_rxq=2
ovs-ofctl del-flows br0
ovs-ofctl add-flows br0 ~/ct-flows.txt
Version | Short-Lived Connections | Active Flows | Bandwidth | Packet Rate | Difference |
---|---|---|---|---|---|
2.17.2 | 47.4K conn/s | 23.9K | 206.2M bit/s | 330.6K pkt/s | |
3.0.0 | 49.1K conn/s | 24.7K | 212.1M bit/s | 340.1K pkt/s | +3.53% |
Version | Long-Lived Connections | Active Flows | Bandwidth | Packet Rate | Difference |
---|---|---|---|---|---|
2.17.2 | 981 conn/s | 25.2K | 977.7M bit/s | 1.5M pkt/s | |
3.0.0 | 2.0K conn/s | 52.4K | 2.0G bit/s | 3.1M pkt/s | +108.31% |
The short lived connection rate performance has dropped in 3.0.0
. This is not
a fluke, the numbers are consistent across multiple runs. This would warrant
some scrutiny but it does not invalidate all the work that has been done.
4 CPUs, 4 queues per port
ovs-vsctl set open_vswitch . other_config:pmd-cpu-mask="0x6006"
ovs-vsctl set Interface port0 options:n_rxq=4
ovs-vsctl set Interface port1 options:n_rxq=4
ovs-ofctl del-flows br0
ovs-ofctl add-flows br0 ~/ct-flows.txt
Version | Short-Lived Connections | Active Flows | Bandwidth | Packet Rate | Difference |
---|---|---|---|---|---|
2.17.2 | 66.1K conn/s | 33.2K | 286.4M bit/s | 459.2K pkt/s | |
3.0.0 | 100.8K conn/s | 50.6K | 437.0M bit/s | 700.6K pkt/s | +52.55% |
Version | Long-Lived Connections | Active Flows | Bandwidth | Packet Rate | Difference |
---|---|---|---|---|---|
2.17.2 | 996 conn/s | 25.9K | 994.2M bit/s | 1.5M pkt/s | |
3.0.0 | 2.6K conn/s | 67.0K | 2.6G bit/s | 3.9M pkt/s | +162.89% |
8 CPUs, 4 queues per port
ovs-vsctl set open_vswitch . other_config:pmd-cpu-mask="0x1e01e"
ovs-vsctl set Interface port0 options:n_rxq=4
ovs-vsctl set Interface port1 options:n_rxq=4
ovs-ofctl del-flows br0
ovs-ofctl add-flows br0 ~/ct-flows.txt
Version | Short-Lived Connections | Active Flows | Bandwidth | Packet Rate | Difference |
---|---|---|---|---|---|
2.17.2 | 62.2K conn/s | 31.3K | 269.8M bit/s | 432.5K pkt/s | |
3.0.0 | 90.1K conn/s | 45.2K | 390.9M bit/s | 626.7K pkt/s | +44.89% |
Version | Long-Lived Connections | Active Flows | Bandwidth | Packet Rate | Difference |
---|---|---|---|---|---|
2.17.2 | 576 conn/s | 17.1K | 567.2M bit/s | 852.5K pkt/s | |
3.0.0 | 3.8K conn/s | 97.8K | 3.8G bit/s | 5.7M pkt/s | +562.76% |
8 CPUs, 8 queues per port
ovs-vsctl set open_vswitch . other_config:pmd-cpu-mask="0x1e01e"
ovs-vsctl set Interface port0 options:n_rxq=8
ovs-vsctl set Interface port1 options:n_rxq=8
ovs-ofctl del-flows br0
ovs-ofctl add-flows br0 ~/ct-flows.txt
Version | Short-Lived Connections | Active Flows | Bandwidth | Packet Rate | Difference |
---|---|---|---|---|---|
2.17.2 | 50.6K conn/s | 25.5K | 219.5M bit/s | 351.9K pkt/s | |
3.0.0 | 100.9K conn/s | 50.7K | 436.0M bit/s | 698.9K pkt/s | +99.36% |
Version | Long-Lived Connections | Active Flows | Bandwidth | Packet Rate | Difference |
---|---|---|---|---|---|
2.17.2 | 541 conn/s | 14.0K | 539.2M bit/s | 810.3K pkt/s | |
3.0.0 | 4.8K conn/s | 124.1K | 4.8G bit/s | 7.2M pkt/s | +792.83% |
Analysis
Scaling
Apart from the small blip with 4 CPUs and 2 queues per port, the conntrack
insertion & deletion code path has improved consistently in OvS 3.0.0
. The
multi-threaded lock contention remains, albeit less noticeable than with OvS
2.17.2
.
This is where the optimizations done in OvS 3.0.0
really shine. The reduction
in multi-threaded lock contention with conntrack lookup makes the performance
scale significantly better with the number of CPUs.
Profiling
Here is a stripped down perf report
of both versions while 8 CPUs/8
RX queues under maximum long-lived connections load with conntrack flows
enabled. Only the events of a single CPU were captured.
perf record -g -C 1 sleep 60
perf report -U --no-children | grep '\[[\.k]\]' | head -15 > profile-$version.txt
# also generate flame graphs for interactive visual inspection of hot spots
git clone https://github.com/brendangregg/FlameGraph
perf script | ./FlameGraph/stackcollapse-perf.pl |
./FlameGraph/flamegraph.pl --title "OvS $version Connection Tracking" \
--subtitle "1 PMD thread shown out of 8" --minwidth 1 --width 2000 \
--height 30 --fontsize 13 --fonttype sans-serif --hash --bgcolors grey \
- > ovs-conntrack-$version.svg
I have manually annotated lines that are directly related to acquiring mutexes
(starting with a *
character). When a CPU is waiting for a mutex acquisition,
it is not processing any network traffic but waiting for another CPU to release
the lock.
2.17.2
The profiled CPU spends almost 40% of its cycles acquiring locks and waiting for other CPUs to release locks.
* 30.99% pmd-c01/id:5 libc.so.6 [.] pthread_mutex_lock@@GLIBC_2.2.5
12.27% pmd-c01/id:5 ovs-vswitchd [.] dp_netdev_process_rxq_port
5.18% pmd-c01/id:5 ovs-vswitchd [.] netdev_dpdk_rxq_recv
4.24% pmd-c01/id:5 ovs-vswitchd [.] pmd_thread_main
3.93% pmd-c01/id:5 ovs-vswitchd [.] pmd_perf_end_iteration
* 3.63% pmd-c01/id:5 libc.so.6 [.] __GI___pthread_mutex_unlock_usercnt
3.62% pmd-c01/id:5 ovs-vswitchd [.] i40e_recv_pkts_vec_avx2
* 2.76% pmd-c01/id:5 [kernel.kallsyms] [k] syscall_exit_to_user_mode
* 0.91% pmd-c01/id:5 libc.so.6 [.] __GI___lll_lock_wait
* 0.18% pmd-c01/id:5 [kernel.kallsyms] [k] __x64_sys_futex
* 0.17% pmd-c01/id:5 [kernel.kallsyms] [k] futex_wait
* 0.12% pmd-c01/id:5 [kernel.kallsyms] [k] entry_SYSCALL_64_after_hwframe
* 0.11% pmd-c01/id:5 libc.so.6 [.] __GI___lll_lock_wake
* 0.08% pmd-c01/id:5 [kernel.kallsyms] [k] do_syscall_64
* 0.06% pmd-c01/id:5 [kernel.kallsyms] [k] do_futex
3.0.0
It is obvious that 3.0.0
has much less lock contention and therefore will
scale better with the number of CPUs.
15.30% pmd-c01/id:5 ovs-vswitchd [.] dp_netdev_input__
8.62% pmd-c01/id:5 ovs-vswitchd [.] conn_key_lookup
7.88% pmd-c01/id:5 ovs-vswitchd [.] miniflow_extract
7.75% pmd-c01/id:5 ovs-vswitchd [.] cmap_find
* 6.92% pmd-c01/id:5 libc.so.6 [.] pthread_mutex_lock@@GLIBC_2.2.5
5.15% pmd-c01/id:5 ovs-vswitchd [.] dpcls_subtable_lookup_mf_u0w4_u1w1
4.16% pmd-c01/id:5 ovs-vswitchd [.] cmap_find_batch
4.10% pmd-c01/id:5 ovs-vswitchd [.] tcp_conn_update
3.86% pmd-c01/id:5 ovs-vswitchd [.] dpcls_subtable_lookup_mf_u0w5_u1w1
3.51% pmd-c01/id:5 ovs-vswitchd [.] conntrack_execute
3.42% pmd-c01/id:5 ovs-vswitchd [.] i40e_xmit_fixed_burst_vec_avx2
0.77% pmd-c01/id:5 ovs-vswitchd [.] dp_execute_cb
0.72% pmd-c01/id:5 ovs-vswitchd [.] netdev_dpdk_rxq_recv
0.07% pmd-c01/id:5 ovs-vswitchd [.] i40e_xmit_pkts_vec_avx2
0.04% pmd-c01/id:5 ovs-vswitchd [.] dp_netdev_input
Final Words
I hope this gave you some ideas for benchmarking and profiling connection tracking with TRex and perf. Please let me know if you have any questions.
Kudos to Paolo Valerio and Gaëtan Rivet for their work on optimizing the user space OvS conntrack implementation.