Products | Versions |
---|---|
TIBCO Silver Fabric | - |
Not Applicable | - |
Because of the large number of files that need to be shared among Silver Fabric Brokers and Engines it is tempting to consider using a shared file system (such as NFS) to reduce the space and time lag associated with downloading the files locally to every Engine. However, NFS can be unreliable on congested networks, and applications like the Silver Fabric Broker and its Engines have not been implemented to handle the complexities of recovering from disk I/O failures involving underlying transient network problems. The problems can be compounded when files that were there a moment ago suddenly disappear only to reappear seconds later by which time the damage is done and logs are filling with messages or worse components are being deactivated..
The reliability of physical disks is such that they either work or fail catastrophically. This differs from network based disks which can temporarily fail due to a transient network issue and then suddenly work again. In the case of a physical disk failure, the machine in which it resides needs a repair, and if that machine is where Silver Fabric is running, then it will also be down along with whatever other applications run on that machine. Everyone will know it is broken and in need of manual intervention. Due to the all-or-nothing reliability profile of physical disks, applications don’t normally include extra logic to retry failed I/O operations as this responsibility is delegated to the physical disk hardware and the software driver. With a network disk, this responsibility is delegated to the network file system driver. Due to the inherent nature of networks (packets are lost, connections are dropped, cables are cut mistakenly by large machines working on the street) the network file system driver is much more likely to bubble up errors to the application level aka the Silver Fabric broker or engine. While in theory the physical disk driver can bubble errors up to the application layer too the reality is that physical disk reliability is such that it almost never does and when it does it is usually the last gasp before it catastrophically fails and you find yourself installing a new disk and restoring from backup.
As stated, an individual I/O on a physical disk can, in theory, fail all the way back to the application level to allow the application to recover possibly by retrying the operation and hoping for a better outcome. In reality, this is not done because if the underlying system call made by the application cannot succeed with all the retry logic in the disk controller, and layers of driver software then retrying from the application is pointless, such is the nature of physical disk. With shared network disk technology such as NFS this is not so, because the network read or write operation that timed out a second ago due to lost or delayed packets resulting from a transient network condition may now succeed if the operation is retried. The NFS driver is largely transparent as it exposes network endpoints as if they were local physical disks to the application but because network transmissions are not as fast, consistent or reliable as physical disk electronics the NFS driver occasionally returns potentially recoverable I/O errors to the application. The issue then is that if the application is not implemented with this in mind, it is likely to consider any such I/O error as catastrophic or in worst case misinterpret their result.
Due to this behavior of NFS filesystems and the fact that the Silver Fabric software has not been implemented to counter NFS unreliability we cannot recommend its use with Silver Fabric. If one chooses to disregard this recommendation then they should at least consider the implications of NFS transient failures and plan and/or tune the software and systems accordingly to minimize the impact.