Products | Versions |
---|---|
TIBCO Enterprise Message Service | - |
Not Applicable | - |
Resolution:
Description:
The file share implementation needs to provide the capability of locking the file in a distributed environment where lock on a single file is requested by processes running on different nodes. Explicit tests or guarantees from the vendor are needed to ensure that when the file is being locked by one process that is running, no other process on the same or a different machine can obtain the lock on that file.
Test 1: Only One Process Can Lock the File
1. Configure two EMS servers on different systems with the same data store located on a NAS (Network-attached storage). The second EMS server to start has an "ft_active" parameter configured to a non-existing EMS server URL.
2. Start the first EMS server and wait until it becomes active.
3. Start the second EMS server which will try to obtain a lock on the data store file upon startup.
Expected Result:
The file-locking attempt by the second server should fail and it should not become active as the first EMS server on a different machine already locked the file.
The storage system should release the file lock when the owner process has terminated. A simple test such as killing the primary EMS server process can be done to verify the failover to the standby server.
Test 2: File Lock Release Upon Process Failure
1. Run two EMS servers as a primary and standby pair on two nodes with proper configuration and data store on NAS.
2. Terminate the primary EMS server
Expected Result:
The standby server detects the failure of the primary server and succeeds in taking over the active role.
However, in a network environment, loss of temporary network connectivity can occur between the NAS device and EMS system machine. In this case, the lock owner, the primary EMS server, is still operational. Thus, the correct behavior is for the NAS device to maintain the file lock for the primary EMS server and prevent the standby EMS server from obtaining the lock.
Test 3: Maintain File Lock in Network Disconnectivity
1. Run two EMS servers as a fault-tolerant primary/standby pair with proper configuration and data store on NAS.
2. Unplug the network cable of the machine on which the primary EMS server is running.
Expected Result:
The standby EMS server on another machine will try to obtain the file lock after missing heartbeat. The correct behavior is for the storage file system to maintain the lock even if the primary EMS server becomes unreachable for the moment so that the standby server will fail to acquire the lock and remain as standby.
The file lock needs to be released when there is a hardware failure on the machine of the primary EMS server. However, with file sharing protocol protocols such as NFS, the NFS server may not release the lock until the failing NFS client node is rebooted, prohibiting the standby server to take over as long as the primary server machine is down. In general, NAS storage systems may have problems distinguishing between file-client unreachability due to network disconnection and file-client hardware failure. In the former case, the NAS device needs to maintain the file lock as in test 3, but in the latter, the lock needs to be released.
Test 4: Release File Lock in Case of Hardware Failure
1. Run two EMS servers as a fault-tolerant primary/standby pair with the proper configuration and data store on NAS.
2. Cause an ungraceful shutdown of the machine on which the primary EMS server is running,
Expected Result:
The standby EMS server on another machine acquires the lock and becomes active without the primary server machine being rebooted.
A NAS/SAN device reboot or failover is required to provide an all-or-nothing recovery policy for all open file metadata such as file handle and locks. If the storage system is configured to recover file handles and attributes after the reboot or failover, then it should also guarantee the recovery of the file locks. I.e., the storage system should either support recovery of all open file metadata information including file lock, or none, but not a partial subset.
Test 5: File Lock Upon NAS/SAN Failover or Reboot
1. Run an EMS server with FT settings and data store located on the NAS device.
2. Perform a reboot or failover on the NAS system.
3. If this EMS server encounters an I/O error after the NAS reboot and the storage system could not recover open file properties, then we complete the test. If the first EMS server still operates correctly, start a second EMS server on another machine with the "ft_active" parameter configured to a non-existing EMS server URL.
Expected Result:
This second server would try to obtain the lock on the data store file upon start. The expected behavior is that the storage and the file sharing system still maintain the file lock after reboot preventing the standby server to succeed in becoming active while the primary server is still running.