Solving error "Found unpartitioned instance...when installing mapper"

Solving error "Found unpartitioned instance...when installing mapper"

book

Article ID: KB0078128

calendar_today

Updated On:

Products Versions
TIBCO Streaming 10

Description

When starting two nodes that share a Query Table in Transactional Memory and the application loads data into the table, occasionally at startup one or more nodes fail with error (example):
A.X/logs/default-engine-for-com.example.MyApp.log:2019-05-07 11:16:18.049000-0400 [18480:EventFlow Fragment] 
ERROR com.tibco.ep.sb.rt.launcher.EventFlowFragment: java.lang.Exception: Fragment com.example.MyApp terminated 
with status -1: Aborting transaction default: com.kabira.platform.ResourceUnavailableException: Found 
unpartitioned instance 'default.QueryTable:992 (1544710558:3463062280:3232918196964:992)' when installing mapper 
for type default.QueryTable: Found unpartitioned instance 'default.QueryTable:992 
(1544710558:3463062280:3232918196964:992)' when installing mapper for type default.QueryTable

The nodes which did not report the error may or may not be functional. Subsequent use of epadmin commands 'stop node', 'terminate node', 'remove node' or 'kill node' on any node in the cluster may hang and not complete. If any epadmin command hangs, the node processes 'dtm-engine', 'swcoord', and 'swcooordadmin' must be terminated individually using Operating System administrative commands.

How can we avoid this error?

Issue/Introduction

Administrative guidance

Resolution

This is a known issue caused by a race condition where a node starts executing the EventFlow before the shared query tables in transactional memory are fully connected to the other nodes. This is often because the other nodes are also still starting up.

Starting all the nodes at the same time using command "epadmin servicename={cluster-name-only} start node" may trigger this problem.

Avoid this problem by starting each node to completion, waiting for the 'epadmin start node' command to report "Node started", before attempting to start another node, as so:
$ epadmin servicename=A.X start node
[A.X]   Starting node
[A.X]           Engine application::default-engine-for-com.example.MyApp started
[A.X]           Loading node configuration
[A.X]           Auditing node security
[A.X]           Host name mysystem
[A.X]           Administration port is 55139
[A.X]           Discovery Service running on port 54321
[A.X]           Service name is A.X
[A.X]   Node started

This issue is most often triggered when the application in the node starts writing to the shared Query Table immediately after startup, for example when loading historical data using a CSV File Reader adapter set to start reading immediately, or table-write actions triggered by a Once Operator. These activities should be delayed until the node has joined the existing cluster. 

You may also see the warning:
[A.X] default-engine-for-com.example.MyApp:2019-05-07 14:50:11.000000-0400 [10624] WARN  
com.tibco.ep.dtm.highavailability.distribution: (csmarshal.cpp:2175) Request from remote node B.X failed. 
Version mismatch detected for partition default-cluster-wide-availability-zone_VP_26, active version 
(v13382222546198132), object version (v13382209213472372), for operation 'create' on object 
'default.QueryTable:459 (1544710558:3463062280:3266951585314:459)', concurrent migration detected.

This indicates that there were concurrent changes to be reconciled as the new node's table object was created. This is resolved by the platform without error and is reported in the log to assist with troubleshooting other errors (if any).