VIP failover and TCP persist

Background

Let me start with a statement that everything that follows hinges on:

Suppose a host has an interface with IP address X. Suppose this host has an established TCP session open on socket S, which is bound to X. Now, remove X from the host. If data is written to S, it is ACKed and the window is set to 0.

If someone can definitively confirm this, I would appreciate it. Because of the nature of the problem, I am having a difficult time getting hard data. I can't observe this behavior via tcpdump, which I think is because the packets never truly traverse any interfaces. Instead, it's happening under the hood where I can't directly observe it.

Instead, I can only provide observations that seem to confirm this. Here's my reproducer:

  1. Add a "VIP" for our server to bind to:

    [root@rhel7 ~]# ip addr add 192.168.202.254/24 dev eth1
    [root@rhel7 ~]# ip a
    ...
    5: eth1:  mtu 1500 qdisc pfifo_fast state UP qlen 1000
        link/ether 52:54:00:58:02:4c brd ff:ff:ff:ff:ff:ff
        inet 192.168.202.235/24 brd 192.168.202.255 scope global dynamic eth1
           valid_lft 3544sec preferred_lft 3544sec
        inet 192.168.202.254/24 scope global secondary eth1
           valid_lft forever preferred_lft forever
          
  2. Start a server:

    [root@rhel7 ~]# socat tcp-listen:12345,bind=192.168.202.254 -
          
  3. Poke a hole in the firewall:

    [root@rhel7 ~]# iptables -I INPUT 1 -p tcp -d 192.168.202.254 -j ACCEPT
          
  4. Connect a client from another host:

    [root@mac5254002a5804 ~]# socat tcp:192.168.202.254:12345 -
          

    (Side note: if you connect from the same host, you can hang both ends of the connection!)

  5. Confirm connection:

    [root@rhel7 ~]# ss -ntpo src :12345
    State       Recv-Q Send-Q       Local Address:Port         Peer Address:Port 
    ESTAB       0      0          192.168.202.254:12345     192.168.202.161:33242  users:(("socat",12107,4))
          
  6. Remove IP which server is bound to:

    [root@rhel7 ~]# ip addr del 192.168.202.254/24 dev eth1
    [root@rhel7 ~]# ip addr
    ...
    5: eth1:  mtu 1500 qdisc pfifo_fast state UP qlen 1000
        link/ether 52:54:00:58:02:4c brd ff:ff:ff:ff:ff:ff
        inet 192.168.202.235/24 brd 192.168.202.255 scope global dynamic eth1
           valid_lft 3147sec preferred_lft 3147sec
          
  7. Write some data from the server:

    [root@rhel7 ~]# fg
    socat tcp-listen:12345,bind=192.168.202.254 -
    writing anything here will trigger the persist timer!
          
  8. Confirm connection is now in persist timer:

    [root@rhel7 ~]# ss -ntpo src :12345
    State       Recv-Q Send-Q       Local Address:Port         Peer Address:Port 
    ESTAB       0      54         192.168.202.254:12345     192.168.202.161:33242  timer:(persist,15sec,7) users:(("socat",12107,4))
    

At this point, the connection is hung, until either the VIP comes back, or the persist timer runs out of retries. This is a function of the TCP RTO and the value of the tcp_retries2 sysctl, and generally takes some tens of minutes.

The fact that the persist timer is observed triggering is what leads me to believe the original problem statement is true. As far as I know, zero window ACKs are the only way for this to happen. Granted, internal to the kernel, there might not be an actual ACK, but it's functionally the same thing here.

How this manifests in the real world

We are using pacemaker and HAProxy in RHEL-OSP to provide HA openstack services. We are also using TCP keepalives to help detect failures and close connections. Here's what usually happens in a VIP failover case for an idle session:

     client           haproxy          backend
       |<----ESTB------->|<----ESTB------>|
       |                 |                |
                  (VIP failover)
 ("client" henceforth is really local kernel on haproxy)
                                         
       |                 |                |
DROP   |<---KEEPALIVE----|                |
...    |       ...       |                |
DROP   |<---KEEPALIVE----|                |
                (Keepalive Timeout)
       |connection close |                |
       |                 |-----FIN,etc--->|
  

In this scenario, the connection back to the client gets closed in under 10 seconds due to the aggressive keepalives we've configured. There is one very important thing to note here -- TCP keepalive packets do not trigger zero window ACKs and the persist timer, because they have no payload attached. This works exactly as expected.

However, if there is an active request during the VIP failover, it is likely that the backend will respond before the keepalive timeout is reached. This causes a non-zero-length payload to be written back to the client socket, triggering the persist timer:

     client           haproxy          backend
       |<----ESTB------->|<----ESTB------>|
       |----request----->|                |
       |                 |----request---->|
                  (VIP failover)
 ("client" henceforth is really local kernel on haproxy)
       |                 |                |
DROP   |<---KEEPALIVE----|                |
       |                 |<---response----|
       |<---response-----|                |
       |---ACK win=0---->|                |
                  (persist timer)
DROP   |<--window-probe--|                |
...    |       ...       |                |
DROP   |<--window-probe--|                |
               (Window Probe Timeout)
       |connection close |                |
       |                 |-----FIN,etc--->|                                   
  

In this scenario, the connection may take 30 minutes to finally close. While the client connection is hung, haproxy will hold open the backend connection as well. This has particularly bad implications when the backend server is RabbitMQ. Queues can be declared as auto-delete, which means they are deleted when all clients consuming from them disconnect. This can lead to the (probably very poor) assumption that if a queue is present, then there exists at least one consumer which is processing messages on that queue. If the consumer happens to be in the state described here, it can lead to queues hanging around for quite some time with nobody processing messages from them.