The main resource to examine are the node logs.
How to Get a Snapshot of Node Logs
For a stopped node with unavailable or stopped cluster services: epadmin [debug] create snapshot destination=. includeclasspaths=true installpath=.\A.XFor a failing node with available cluster services: epadmin [debug] servicename=A.X create snapshot destination=.This produces a ZIP file of many node log files.
The "debug" switch is optional. The destination must be to a file-system path local to where 'epadmin' is being run, but if cluster services are running, the node may be remote on the network.
It is best if the same user account running the node is used to run the 'create snapshot' command so that most permissions problems are avoided. This user also needs write permission to the specified snapshot output file destination.
If the 'create snapshot' command does not produce a useful snapshot, try the command again with the "debug" option to see if an 'epadmin' error is preventing creating the snapshot effectively. The error messages should be clear enough to identify a network, memory, user permission, or disk space problem which should be corrected first.
If you contact Support, please provide the node snapshot ZIP file.
How To Use a Snapshot ZIP Archive
To examine a snapshot ZIP archive:
1. Unzip it to a new folder location.
2. Find in all files the lines containing the word
ERROR or
WARN.
3. Find the timestamp of the word "
Listening" since this indicates the application has started.
4. Find the timestamp of the word "
shut down" or "
shutdown" since this indicates when shut-down started. Error messages after shut-down starts are typically not part of the underlying problem, but are side-effects.
5. See what ERROR and WARN messages occurred first in the session, as these may indicate a condition which is fundamental to a more severe error that happens later.
The Main Log Files
logs/default-engine-for-com.example.project.log
- This is the main console log for the server. Examine this file first.
logs/System_swcoordadmin.log
- This reports getting the node services started and typically has no further information after startup.
logs/System_administration.log
- This reports ongoing transactional memory partition handling throughout the runtime of the node.
logs/audit.log
- This reports the configurations used and changed throughout the runtime of the node.
logs/deadlock.log
- This file is normally empty. It reports transactional memory deadlocks and their resolution. These are not JVM thread deadlocks, but distributed transaction deadlocks which may occur as nodes enter and leave the cluster (forcing partition changes) and are resolved after a timeout of a few minutes at most.
The other log files are useful if there is a platform internal failure identified in one of the main log files and additional information is needed for Engineering.
How to Understand an Error Message in the Console Log
Within the default-engine-for-*.log, an error will look similar to this:
(1)2018-10-01 10:00:00.155000-0400 (2)[29577:OperatorThread(BusCommon.consumeSensorRawValue.KafkaConsumer)] ERROR (3)consumeSensorRawValue: (4) Action: Convert Object: raw_equipment_sensorvalue_tempsensor_v1 Message: Error converting message for topic 'raw_equipment_sensorvalue_tempsensor_v1': Null value for field process, keys cannot have null values
(5)com.kabira.platform.KeyNullValueError: (6)Null value for field process, keys cannot have null values
(7) at com.kabira.platform.KeyFieldValueRangeList.add(KeyFieldValueRangeList.java:48)
at com.streambase.sb.qt.dtm.runtime.DTMIndex.addRangeField(DTMIndex.java:379)
at com.streambase.sb.qt.dtm.runtime.DTMIndex._rangeIterator(DTMIndex.java:439)
at com.streambase.sb.qt.dtm.runtime.DTMIndex.rangeIterator(DTMIndex.java:481)
at com.streambase.sb.qt.dtm.runtime.DTMRangeCursor.createManagedObjectIterator(DTMRangeCursor.java:50)
at com.streambase.sb.qt.dtm.runtime.DTMCursor.next(DTMCursor.java:98)
at com.streambase.sb.qt.runtime.SynchronizedCursor.next(SynchronizedCursor.java:31)
at com.streambase.sb.qt.runtime.ChainingCursor.next(ChainingCursor.java:74)
at com.streambase.sb.qt.runtime.PredicateCursorBase.next(PredicateCursorBase.java:46)
at com.streambase.sb.qt.runtime.ChainingCursor.next(ChainingCursor.java:64)
at com.streambase.sb.qt.runtime.LimitCursor.next(LimitCursor.java:39)
at com.streambase.sb.qt.runtime.ChainingCursor.next(ChainingCursor.java:64)
at com.streambase.sb.qt.runtime.MaterializableCursor.next(MaterializableCursor.java:76)
(8) at cm.__Bus_Common_.com.example.dataprocessor.common.feed.BusCommon.op__get_Current_Batch__0(Unknown Source)
at cm.__Bus_Common_.com.example.dataprocessor.common.feed.BusCommon.op__get_Current_Process__0(Unknown Source)
at cm.__Bus_Common_.com.example.dataprocessor.common.feed.BusCommon.op___Clean_Kafka__0(Unknown Source)
at cm.__Bus_Common_.com.example.dataprocessor.common.feed.BusCommon.op___Split__0(Unknown Source)
at cm.__Bus_Common_.com.example.dataprocessor.common.feed.BusCommon$oplocal__java_1consume_Sensor_Raw_Value_1care_Taker_._handle_Outputs_10_(Unknown Source)
at cm.__Bus_Common_.com.example.dataprocessor.common.feed.BusCommon$oplocal__java_1consume_Sensor_Raw_Value_1care_Taker_.sendOutputImplOther(Unknown Source)
...
Legend:(1) The timestamp of the error
(2) The process-ID and thread-name
(3) The name of the main operator of the thread (may not be the operator that failed)
(4) The messages associated with the error
(5) The Java class of the error
(6) The specific error message associated with this class
(7) The beginning of the stack trace and method call which raised the error, and is typically part of the "error handler" and not where the error originated.
Note: Classes which start with
com.kabira,
com.tibco, and
com.streambase are platform internal classes.
(8) The classes and methods associated with specific EventFlow operators invoked in the thread leading up to the error.
The Parts of an EventFlow Class Name
Given an EventFlow class name:
cm.__Bus_Common_.com.example.dataprocessor.common.feed.BusCommon.op__get_Current_Process__0The following are the parts...
Class Name PartsPart Name | Example |
---|
Project | cm.__Bus_Common_ |
---|
Package | com.example.dataprocessor |
---|
Main module | common |
---|
Child module | feed |
---|
Child module | BusCommon |
---|
Operator | op__get_Current_Process__ |
---|
Instance | 0 |
---|
The actual operator name is "getCurrentProcess".
General Troubleshooting Process
After identifying the area of the system, product, or project associated with the error, these steps often help confirm what caused the problem:
A) Exclude that area from running and attempt to reproduce the problem.
B) Instrument that area with additional logging to confirm its participation in the problem.
C) Choose different settings for the features used in that area and see if they have a material effect on the problem.
D) Change the resource (if any) identified in the error messages and monitor for behavior changes.
Additional notes:
- Most operator errors will only cause the operator to drop the one tuple which it was working on.
- Connectivity errors are usually resolved by a reconnect event.
- Platform internal errors may have no operators in the stack.
- Later errors are often side-effects of a previous problem condition so pay more attention to errors early in the log, even if later errors appear more severe.