How to troubleshoot the Spotfire Statistica Monitoring and Alerting server issues

How to troubleshoot the Spotfire Statistica Monitoring and Alerting server issues

book

Article ID: KB0137867

calendar_today

Updated On:

Products Versions
Spotfire Statistica 14.0 and higher
Spotfire Statistica - All Servers 14.0 and higher
Spotfire Data Science - Workbench 14.0 and higher

Description

For MAS (Monitoring and Alerting Server) troubleshooting, the MAS log files are typically located in "C:\Windows\System32\LogFiles\MAS\" and will correspond to logging level set in the "MAS Configuration" tool. This article gives an example of how to read MAS logs to identify the process ID number of a particular taskset that experiences timeout issue. However, those tips could also be applicable to other scenarios of analyzing MAS log.

Environment

OS:  Windowsworksp

Resolution

At below, it displays a MAS log example where intermediate rows that are not related have been removed:
-----------------------------------------------------------------------------
INFO     2019-10-31 03:30:03.333  7448  4848  Created runner process id 5548
INFO     2019-10-31 03:30:06.146  5548  6836  Runner: Running Monitor workspace1 Process Unit Data in Taskset taskset1 Process Unit Data
INFO     2019-10-31 04:30:00.161  7448  4848  ROS: Shutting down runner process 5548
INFO     2019-10-31 04:30:06.161  7448  4848  ROS: runner process 5548 did not shut down in a timely fashion, killing the process
ERROR    2019-10-31 04:30:06.161  7448  4848  Exception detected in CPooledObj::ExecuteRunningObj. HRESULT = 0x80004005 Description = [Task Execution timed out; most likely exceeded the configured Taskset maximum run time]
-----------------------------------------------------------------------------

Note:  Monitor refers to an analysis, usually a workspace.


Log analysis:
The first number after the timestamp is the process id(PID) and the second is the thread id.

The sequence of the logged activities:
1. The MASMonitorService process sends a task to the COM+ app in the dllhost process.

2. The dllhost process sends the task to the MASTaskRunner process.

3. To look for a particular task in a taskset, find the Task (Monitor) name (e.g. "workspace1") and taskset name being run in the Runner (MASTaskRunner), which is the process with "PID 5548" in this case.

4. Search for "PID 5548" to track down the log entries from that task runner. It turns out that the dllhost process 7448 is the one that created the runner process 5548.

5. And the timeout error is reported in the dllhost process 7448. This shows how to tie the relevant entries together (considering that there are other unrelated log entries interspersed throughout.)

Similarly, after something has timed out,  the following steps are recommended in order to collect relevant dump filse from Task manager/Process Explore:
1. Start from the "Task Execution timed out" error in the MAS log/windows event log and  get the PID number, e.g. '1234'.

2. Search backwards for it in the MAS log to find the runner process id that is created by PID '1234'.

3. Obtain the runner process PID e.g. '5678' from logs.

4. Collect a dump of the relevant MASTaskRunner.exe processes (PID '5678') for the long running tasks which times out.

Suggestions for how to capture dumps of a long running task that appears likely to time out and may be hung:

1. Submit the concerned taskset and wait for it to show as running in the dashboard.

2. Look in log file for the task(s) and taskset name.

3. Trace each task you expect to possibly timeout to the MASTaskRunner PID as described.

4. If it hasn't finished in the expected time, start collecting dumps for those task runner processes.
Notes :Debug Diag can help here once you know the MASTaskRunner process ids - on the processes tab :

4.a. Locate the desired MASTaskRunner.exe (can sort by process name here.) and right click to choose Create Userdump Series.(only mini dumps and uncheck the options about full dumps)

4.b. Set as every 10 minutes (600 secs.) Max of 10 dumps is fine since it'll time out in 60 minutes after a max of 6 dumps.

4.c.  Go back to logs and see if any tasks timed out. Delete any dumps for tasks that did not time out.
 

Note that if using DebugDiag as described, there is no need to wait for the expected run time; simply get the task runner PIDs and set up the user dump series and leave it. Come back later and see if any of those actually timed out, and keep the dumps for those that did time out.

Finally, note that the log entries about "runner process 5548 did not shut down in a timely fashion, killing the process" in conjunction with a following "Task Execution timed out" error is expected. When the timeout error is thrown, but before it is logged, the relevant MASTaskiRunner will be shutdown. Since it is still busy running the task, it does not respond to the shutdown request and so it is killed and then logged the timeout error.

Other MAS tips:

1.  Frequent "Activation Timeout" errors typically occurs when all pool instances are busy running other tasks and more tasks are trying to run. This could be indication of MAS server is heavily loaded with tasksets.

2. Timeout error followed by  "[taskset/task] was skipped because MAS failed to create an instance of the running object server. Error code: 0x8004E024." could also indicate MAS is overloaded with all the pool.  Increasing the Pool size and spacing out the schedule more is recommended. See relevant KB about increasing MAS pool size. 

3. When MAS logs indicate that the "Activation Limit" for the COM+ app may be set to 1. This would mean that after running a single task, the dllhost and MAStaskrunner would be recycled and new ones would have to be started. Throughput and performance could be improved by setting this to 0 (unlimited) so that these processes can be reused. Here's a description of these settings: https://docs.microsoft.com/en-us/windows/win32/cossdk/configuring-com--application-recycling-values.

4.  "Activation Limit" sets the number of times each task runner object is activated.  After any one task runner has run 2 tasks, that object will be recycled and new one started.  Recycling and starting anew instance can affect performance.  Thus is typically best to have a higher activation limit.  


 

 

 

Issue/Introduction

This article is intended to document some tips of reading MAS logs and the likely causes of some errors.

Additional Information