How long does it take for OS to detect disconnect between EMS client and server?

How long does it take for OS to detect disconnect between EMS client and server?

book

Article ID: KB0093739

calendar_today

Updated On:

Products Versions
TIBCO Enterprise Message Service -
Not Applicable -

Description

Resolution:

When the network is disconnected between client and server or when the machine is powered down, the length of time it take for the OS to detect the disconnect is controlled by the retransmission timeout.


When a client or server does not receive an ack from the remote machine after sending a packet, the OS starts to retransmit the packet. When the retransmission's timeout is reached, or the OS receives a reply from a remote machine (i.e. RST), it notifies the application that the connection is broken. The retransmission timeout may be different on different OS.

Retransmission logic is bulit in the OS.  A timer called Retransmission Timeout (RTO) has an initial value of three seconds.  After each retransmission the value of the RTO is doubled.  So the waiting time for the sender is usually 3, 6, 12, 24, etc until 15 times of retries is reached. The TCP protocol may also take RTT times between two machines in consideration. The whole restransmission duration may be different when two machines are close.

On Linux, the default value of retries is 15, which corresponds to a duration of approximately between 13 to 30 minute as mentioned below. The actual duration can be determined by running tcpdump (Linux) or Wireshark/Windump (Windows), and unplugging the network cable between machines.

For example, on Linux:

======
tcp_retries1 (integer; default: 3; since Linux 2.2)


The number of times TCP will attempt to retransmit a packet on an established connection normally, without the extra effort of getting the network layers involved. Once we exceed this number of retransmits, we first have the network layer update the route if possible before each new retransmit. The default is the RFC specified minimum of 3.


tcp_retries2 (integer; default: 15; since Linux 2.2)


The maximum number of times a TCP packet is retransmitted in established state before giving up. The default value is 15, which corresponds to a duration of approximately between 13 to 30 minutes, depending on the retransmission timeout. The RFC 1122 specified minimum limit of 100 seconds is typically deemed too short.

======

For example, on Redhat Linux 6.7:

======
sysctl -a | grep retries

net.ipv4.tcp_syn_retries = 5
net.ipv4.tcp_synack_retries = 5
net.ipv4.tcp_retries1 = 3
net.ipv4.tcp_retries2 = 15
net.ipv4.tcp_orphan_retries = 0
====

In summary, the OS can notify EMS whether a connection is broken under the following situations:

- OS receives a RST packets from the remote machine.  For example, if the OS retransmits a packet, if network resumes, remote machine may send RST. The longer time it takes for the network to resume, the longer time it takes for OS to send next retransmission.

- Retransmission timeout is reached. For example, remote machine is powered down.
 
To enable the client or server to detect whether the connection is alive faster, configure heartbeats/timeout in tibemsd.conf. Refer to KB 31944, and EMS User's Guide.

Issue/Introduction

How long does it take for OS to detect disconnect between EMS client and server?

Additional Information

http://linux.die.net/man/7/tcp
KB 33918
KB 31944
EMS User's Guide