Recovery of system space, such as MasterId or user space, fails with "SYS_ERROR Not_Enough_Hosts" when restarting Cache cluster without any changes.

Products	Versions
TIBCO BusinessEvents Enterprise Edition	-
Not Applicable	-

Description

Resolution:

When restarting the cluster, the following exception is reported:

Caused by: com.tibco.as.space.ASException: SYS_ERROR (recovery_failed - Recovery for space repl-unlimited-<Cluster_Name>--MasterId failed due to not_enough_hosts, current = 1, hosts/seeders at shutdown = 2

Current = <Agent1_Name>

Seeders at shutdown = <Agent1_Name>,<Agent2_Name>

To force recovery, issue recover with dataloss option)

at Native.waitForLoading(SpaceStateSpace.cpp:698)

at Native.waitForLoading(SpaceStateSpace.cpp:699)

at Native.recoverSpace(RecoveryManager.cpp:145)

at Native.recoverSpace(SpaceManager.cpp:1577)

at Native.recoverSpace(Metaspace.cpp:1155)

at Native.API_Metaspace_RecoverSpace(ApiMetaspace.cpp:484)

at Native.Java_com_tibco_as_space_impl_NativeImpl_metaspaceRecoverSpace(MetaspaceMessage.cpp:202)

at com.tibco.as.space.impl.NativeImpl.metaspaceRecoverSpace(Native Method)

at com.tibco.as.space.impl.ASMetaspace.recoverSpace(ASMetaspace.java:877)

at com.tibco.cep.as.kit.map.SpaceMapCreator.performSpaceRecovery(SourceFile:231)

at com.tibco.cep.runtime.service.dao.impl.tibas.ASControlDao.waitUntilReady(SourceFile:367)

at com.tibco.cep.runtime.service.dao.impl.tibas.ASControlDao.start(SourceFile:358)

... 8 more

This error indicates that the cluster configuration at startup was not matching the cluster configuration when the cluster was last shutdown, or not enough engines were started to complete recovery without data_loss. It is not mandatory to match the cluster configuration for recovery to be successful, especially for recovering system spaces. There are certain cases where this mismatch may occur.

Error seen on System Space.

============================

When seeing the "NOT_ENOUGH_HOSTS" error, observe the seeders listed in "Current" and "Seeders At Shutdown". This will point to what is expected for successful startup of the recovery.

Example: You have two cache agents and one Inference agent in the cluster on two machines and replication is set to 0. You stop the BE cluster and the inference agent continues to run even after the cache agents are down. As this is system space, the Inference agent registers as seeder. When you stop the cluster, you stop all cache agents together and the Inference agent a little later. When you start the cluster, you start only one cache agent. Sometimes, the not_enough_hosts is seen with this startup sequence. To resolve the issue, either match the startup sequence to the shutdown sequence or match the "Seeders at Shutdown" in the error message.

Example:

- Start all three cache agents or start two agents listed in the error message .

- Start with DATA_LOSS recovery mode. To do that, set the " property be.engine.cluster.recovery.distributed.strategy" at the cluster level in CDD and set the value to "data_loss".

- Refer to the BE 5.x Release Notes for all the recovery options and detailed descriptions.

Error seen on User Space.

==========================

If the not_enough_hosts error is seen on User Spaces, this is more straightforward and can be explained. Consider the scenario where three cache and one Inference and replication is set to 1. In this case, each agent hosts has it's own copy of the entity and one replicated copy as there are three cache agents. There are chances or entities which are not replicated across all three cache agents. In the case where replication is set to 1, there is chance for data loss. By default, BE cache agents start with a NO_DATA_LOSS recovery policy. If you stop all three cache agents together and start only one cache, the not_enough_hosts error will be seen as there are not enough seeders to recover the space without data_loss. This is expected behavior.

To resolve it, do the following.

- Match the seeder count at startup to the seeder count at shutdown as indicated by the error.

- At least start two cache agents, with replication set to 1. Starting two caches should satisfy the no_data_loss policy.

- Start all three caches together.

General Information:

======================

- To avoid issues of this sort, make sure your cluster is set up for proper replication. This will provide flexibility to start any number of agents. For instance, for three cache clusters, if you set replication to 2 then you can start one, two or three cache agents after shutdown without any data_loss.

- You can also set the quorum to a larger value so that you are forced to maintain the cache quorum. This will enable more agents to participate in recovery, reducing the recovery time.

Issue/Introduction

Recovery of system space, such as MasterId or user space, fails with "SYS_ERROR Not_Enough_Hosts" when restarting Cache cluster without any changes.

Additional Information

SR 691220, SR 681447

Welcome to "KB Articles"