Products | Versions |
---|---|
TIBCO ActiveSpaces | - |
Not Applicable | - |
Using the "member timeout" and "cluster suspend threshold" features in AS 2.1.5HF3 to surviving machines and processes being suspended for extended periods of time, as can happen in virtual environments.
By default, "member timeout" and "cluster suspend threshold" are designed to be reactive to faults as they would happen in a physical environment. When it does not hear from a peer node for more than 30 seconds, it considers the node "lost" and moves on without waiting longer in case the node was just "disconnected" instead, and will come back. In physical environments, typically not more than one machine crashes at exactly the same time. If replication is set to one (for example), declaring a single node "lost" is not a problem since replication means no data is lost and the cluster might as well "move on" without the lost host and re-distribute/re-replicate, assuming it crashed. The situation can be a bit different in virtualized environments. VMs can become "suspended" for extended periods of time (longer than 30 seconds) because of Administrative operations at the VM layer, namely "snapshots" and "VMotion". Because you may have more than one VM per physical host, some operations can affect all VMs on a particular physical host.
What this means is that in VM
deployments, you can have a number of machines suddenly disappear from
the network at the same time and then reappear minutes later. Those machines and their processes do not get
restarted as the processes are still running and have all of their data
in memory. Because they are paused for some period of time, you do not want to
consider a host as "lost" for ever after just 30 seconds. As there can be more than one host being suspended at the same
time, you would have to increase the replication degree to survive a number of nodes being considered lost at once, to a point that may not be
practical.
Because of this, new
attributes were introduced in AS 2.1.4 and 2.1.5 that would
allow you to configure a Metaspace such that if some nodes disappear, instead of just considering them lost after 30 seconds, the Metaspace
can be "suspended" for some period of time while waiting for
those nodes to come back. This means that if the nodes were indeed
just suspended rather than crashed, no redistribution or
re-replication is needed to increase the replication degree to a high
value when VMs are temporarily suspended.
This also also means that while the Metaspace is suspended, operations on some keys and queries can be blocked until either the suspended node(s) come back (if they are Seeders), a member timeout is reached and the cluster decides to consider it actually lost and moves on, or an Administrator intervenes.
================================
The following explains how these new settings work.
- member_timeout: How long the cluster will remain "suspended" after one or more nodes/hosts is deemed "suspect" because it has not been heard from in over 30 seconds.
- cluster_suspend_threshold: When the member_timeout is reached and there are some "suspect" nodes/hosts, if the number of those suspect hosts/nodes is less than or equal to the threshold value, those members are then considered lost and the Metaspace take its losses (if any) and resumes itself. If the number is higher than the threshold value, then the cluster remains suspended until either the suspect hosts/nodes come back, or the Administrator forces the cluster to consider them lost and unsuspended.
When a cluster is "suspended", no members can join/leave the cluster and operations on some of the keys (if suspect nodes are seeders for those keys) can block until they time out or the cluster is unsuspended.
“extended version” :
The first thing that AS monitors is that there is an active connection between all of the directly connected Metaspace members. There are two ways that AS can detect that a node is possibly down or being suspended: the TCP connection gets closed (typically when the process dies) or there is a heartbeat timeout with no data flowing. When either the whole machine the process is on dies, there is a network disconnection or the process (or the machine it’s running on) gets suspended.
- Note that this initial "heartbeat timeout" is around 30 seconds and is currently not adjustable.
Once the node is considered as "suspect", i.e. potentially down (it may have crashed or been suspended and will come back later) which can take between 0 and 30 seconds depending on the scenario, the Metaspaces goes into "suspended" mode.
A "suspended" Metaspace means:
- Operations on the space still go through and are still being serviced, with the exception described below.
- If the "suspect" node(s) is a Seeder, operations that it should service will be blocked and could eventually timeout depending on the operation timeout. This means that some put/get/take/operations on the space will go through, but others may be blocked and that queries are most likely going to block.
- No new directly connected processes can join or leave the Metaspace.
Things that can take a Metaspace out of "suspension" are:
- A "suspect" node(s) with which communication was lost comes back and reconnects to the Metaspace (for example, after the VM suspend or VMotion completes). In that case, there is no data-loss, no redistribution operations that were blocked are automatically resumed unless they timed out because of the operation timeout being reached.
- A node with the same member name as the "suspect" but a different instance number joins the space. In this case, we know that the "suspect" node was not suspended but was instead restarted, therefore the "suspect" node is processed as a member leaving suddenly or crashing. This means there could be redistribution or data loss and the new instance of the process can rejoin the Metaspace.
- The administrative command "cluster resume" is issued to forcefully resume the Metaspace in which case the "suspect" members are considered lost, which could trigger redistribution and data loss.
Once
the Metaspace goes into suspended mode because of loss of contact with a
member node, the "member_timeout" timer gets started. If the Metaspace is still suspended when the "memberTimeout" timer expires the “cluster_suspend_threshold” value, then:
- If the number of "suspect" nodes is greater than the cluster_suspend_threshold, the Metaspace remains suspended until either enough of the "suspect" nodes rejoin the cluster or the cluster is resumed administratively.
- If the number of "suspect" node(s) is less than or equal to the cluster threshold value, then the cluster processes those suspect nodes as lost nodes and resumes/unsuspends itself. Important note: If you are using "host aware distribution" then you should replace "node" above with "host". For example, if you loose two agents on the same host (e.g., “host1.agent1” and “host1.agent2”) it is like only loosing one "node" with regard to the cluster_suspend_threshold.
Examples:
- If cluster_suspend_threshold is set to -1 (infinite) and member_timeout to 60, after between 0 and 30 seconds, the Metaspace will become "suspended" and the node will be considered as "suspect". If the node gets un-suspended within the next 60 seconds, operations resume without any impact. If the node was restarted rather than suspended, or after 60 seconds, the "suspect" node is considered as crashed and the Metaspace gets un-suspended.
- If cluster_suspend_threshold is set to 1 and member_timeout to 0, if you suspend a node, after about 30 seconds the node will be considered suspect. Since member_timeout is 0, the cluster_suspend_threshold goes into effect immediately. Since only one node is suspect, which is less than or equal to the threshold value, the node is considered lost (this may trigger redistribution or data loss) and the cluster never gets suspended.
- If cluster_suspend_threshold is set to 1 and member_timeout to 60, if you suspend a node, after about 30 seconds the node will be considered suspect and the cluster will become "suspended" for the next 60 seconds. After 60 seconds, the cluster_suspend_threshold goes into effect and since only one node is suspect (which is less than or equal to the threshold value) the node is considered lost (this may trigger redistribution or data loss) and the cluster never gets suspended.
- If you did set cluster_suspend_threshold to 0 and member_timeout to 0, if you suspend a single node, after about 30 seconds the node will be considered suspect, and since member_timeout is 0 the cluster_suspend_threshold goes into effect immediately. Since one node is suspect (greater than the threshold value), the cluster gets suspended and stays suspended until either the node rejoins the cluster, i.e. is un-suspended, or the administrator uses the "cluster resume" administrative command.
- If you did set cluster_suspend_threshold to 1 and member_timeout to 0, if you suspend members called “A.1” and “A.2”, the cluster will not become suspended after the two members are marked as "suspect". This is because you are using a host aware distribution and though you suspended two members, they are both part of the same host and only one host will be affected. Since that is less than or equal to the threshold value, the cluster does not get suspended. The default value is -1 which means that when the member_timeout value is reached, the cluster automatically declares the suspect members as lost and unsuspends, no matter how many nodes are lost.