System reboot due to failure in engine_tcp_parser in TIBCO LogLogic LMI

Products	Versions
TIBCO LogLogic Log Management Intelligence	all versions up to 5.1.0

Description

Engine_tcp_parser is used to process data received via LogLogic TCP (LLTCP). LLTCP is used to forwarded data in file format from one LMI appliance to another. This is typically for file-based event data but it can be used for syslog-based events as well. As with all engines, there is a heartbeat timeout determine when it has stopped responding to mtask. Occasionally, when the upstream LogLogic appliance forwards a large amount of data down to the destination appliance in a small timespan, the destination appliance becomes overwhelmed in parsing the data and engine_tcp_parser appears to be unresponsive as it works through the large volume of events. After several attempts at restarting engine_tcp_parser (and some associated engines), the system will eventually reboot to try to clear the condition.

Note: This article does not apply to LMI versions after 5.1.0 because engine_tcp_parser functionality was moved to engine_rcollector starting in LMI 5.2.0. This issue only affects one version of LMI EVA because the EVA had only been released for LMI 5.1.0 before the engine was removed in 5.2.0.

Resolution

One suggested remedy is to try to increase the timeout allowed by mtask to engine_tcp_parser for processing data. The default is 120 seconds which can be stepped up to 300 or 600 seconds to mitigate the issue.

This involves editing a system file in vi via shell access (logged in under toor).

1. SSH to the appliance and login as ”toor”.
2. Make a backup copy of /loglogic/conf/node_config.xml:
$ cp /loglogic/conf/node_config.xml /loglogic/conf/node_config.xml_bak

3. Next, edit node_config.xml with vi:
$ vi /loglogic/conf/node_config.xml

4. Find the section that matches your appliance platform then find sub-section for engine_tcp_parser. Note: Make sure to edit the proper engine_tcp_parser section that is in the section matching your appliance model. The appliance model can be found in /etc/platform. It will look like the following:
    <service
        group="BACKEND"
        start_cmd="/loglogic/bin/engine_tcp_parser"
        startup_timeout="600"
        shutdown_timeout="30"
        heartbeat_timeout="120"
        escalation="GROUP_RESTART,NODE_REBOOT,GROUP_DISABLE"
        runlevel="5"/>
5. Edit the heartbeat_timeout value from "120" to "300" (or "600" if 300 wasn't sufficient).
6. Save and exit from vi.
7. Then run:
$ mtask stop; mtask start

Note: For an HA pair, make sure to edit each file individually. Make sure that as you restart the mtask on each system that you do $ mtask stop on the standby first, then $ mtask stop; mtask start on the master, and finish with $ mtask start on the standby. That will prevent an unnecessary failover event, though there will be a few moments of down time for data collection.

Issue/Introduction

This article explains how to mitigate the issue of engine_tcp_parser failures causing system reboots.

Welcome to "KB Articles"