When the primary EMS host experiences a network problem. The connection between the primary EMS server and NAS server, as well as from the primary to secondary host is partitioned, while the secondary EMS host to the NAS connection is stable.
Assuming ft_heartbeat / ft_activation has been set with the default value of 3/10, after 10 seconds of no heartbeat, the secondary EMS server will try to activate itself. This will not succeed until after some time which is dependent on two variables, the NFS server lease time and when is the last time the active EMS host renewed the lease from NFSv4. It is the NFSv4 client which conducts the renewal. The mainstream NFS/NAS vendor sets the default lease time to 90 seconds.
As stated in NFSv4 Spec (rfc3530):
Lease Renewal
The purpose of a lease is to allow a server to remove stale locks that are held by a client that has crashed or is otherwise unreachable. It is not a mechanism for cache consistency and lease renewals may not be denied if the lease interval has not expired.
The following events cause implicit renewal of all of the leases for a given client (i.e., all those sharing a given clientid). Each of these is a positive indication that the client is still active and that the associated state held at the server, for the client, is still valid.
......
Ideally, system calls such as read/write/fcntl, will automatically renew the lease. Suppose the last interaction from the primary EMS host to the NFS server is 10 seconds before the network fails. The standby server will be able to acquire the lock of the shared datastore file 80 seconds after the primary EMS host network failure. The logic is -- 80 + 10 = 90, it is expected as the NFS server shall reclaim the lock if it does not receive any renewal request from the current tenant after the least time elapsed. In other words, the standby server should wait no longer than 90 seconds before it becomes active. However, after the secondary server successfully becomes the active server, the previous active server recovered from the network loss, it also continues as an active server, hence the "Dual Active" situation.
The problem lies in the implementation of the NFSv4 Client on Redhat Linux. The NFSv4 client shall check its state upon network recovery, which is expected to return an error on any read() or write() to the shared datastore file since the lease has already expired.
This issue has been fixed in kernel 2.6.32-431.28.1.el6 or above. The EMS instance that lose the lease will exit on write() error, the new kernel will now return the EIO to the caller of write().
Note: This problem should not occur with Red Hat Enterprise Linux 7. For more information, see https://access.redhat.com/solutions/1179643 or contact Redhat Support.