How to approach a rolling restart of a RVCMQ without losing messages.

Products	Versions
TIBCO Rendezvous	-
Not Applicable	-

Description

Resolution:
Description:
= = = = = = =

1).  If you are using RVCMQ and reliable messages, it is possible that you may lose messages. If you do not want to lose messages,  use RVCM.

2). Reliable messages could be lost when using RVCMQ:

a). Reliable messages could be lost before they reach the RVCMQ scheduler.

b). If a worker exits and RVCM is not in place, the scheduler will not reassign the uncompleted task - the messages were accepted by the worker but were not completed may be lost.

c).  If the scheduler exits or loses network communication, another member replaces it as the active scheduler and RVCM is not in place, the unaccepted messages and the messages sent during no scheduler period could be lost.

Solution:
= = = = = =

1). Using RVCM messages.

2)   To reduce dataloss possibility with RVCMQ when doing a rolling restart, the following is suggested:

a). Rolling restarts should be limited to off-hours or a scheduled shutdown when data loss is not going to cause operational problems.

b). Make your applications more intelligent.

The member that is the scheduler can know this via the QUEUE.SCHEDULER.ACTIVE advisory, e.g., if you had a more sophisticated shutdown signaling mechanism the scheduler could ignore the shutdown request.

Implement your own protocol using RV messaging to co-ordinate the shutdown, i.e., making sure data stops flowing and outstanding work is done before applications begin exiting.

Make sure to call tibrvcmTransport_DestroyEx (C API), TibrvCmTransport.destroyEx() ( Java API) before shutting down the scheduler. This API should allow to gracefully remove a scheduler from a DQ. Refer to the corresponding API  Reference for more details.

c). The scheduler heartbeatInterval/activationInterval interval are used for activating new schedulers. That is when the time since the last heartbeat from the scheduler reaches this activation interval, Rendezvous fault tolerance software instructs the ranking inactive member to activate as the new scheduler. In most of the cases, the workers can identify the scheduler existing sooner from  other RV protocol messages.

You could use shorter heartbeatInterval/activationInterval to reduce the possibility of message loss. However, you should run tests for the proper value in your target environment, as low activationInterval may result duplicated schedulers and hence may cause duplicated messages. In general, we do not suggest our customer modify this value for dataloss purpose. We recommend an activation interval no less than 3 seconds though Rendezvous fault tolerance software accepts lower values. If your application is distributed across a WAN, we recommend an activation interval no less than 10 seconds.

Issue/Introduction

Welcome to "KB Articles"