Identifying poor EMS server performance issues.

Products	Versions
TIBCO Enterprise Message Service	-
Not Applicable	-

Description

Resolution:
Description:
= = = = = = =
Identifying poor EMS server performance

Environment:
= = = = = = =
ALL

Symptoms:
= = = = = =
Slow response to EMS client/admin requests. Low server message outbound rate.

Causes:
= = = = = = = = =

There are several reasons that could cause poor EMS server performance issues which include high system/EMS server process CPU usage. High System/EMS server process disk read/write rates or other disk problems. These could be caused when there are pending messages in the EMS server while starting consumers using message selectors. The following are known issues regarding high CPU usage and high disk read/write rates.

1). When using EMS versions greater than 5.1.0, KB 30513 discusses this issue which centers on the trade off of higher disk usage rates versus lower memory requirements by the EMS server.

2). A known defect – 1-AB5HEJ entitled, "EMS server had high CPU usage and a high read rate when using a queue browser while messages were being consumed from that queue.” This defect was fixed in EMS 4.4.3 hotfix12 and EMS 5.1.4. If you run into this defect, you should notice that the disk read rate is much higher compare to the outbound messages rate. Something similar to the following will be seen in the server stack trace:

Thread 15 (Thread 0x40146940 (LWP 28158)):
#0  0x0000003660a0d2cb in read () from /lib64/libpthread.so.0
#1  0x000000000049982f in _tibemsFile_Read ()
#2  0x000000000049674d in _tibemsDb_GetRecord ()
#3  0x00000000005436b0 in _emsdFileStore_ReadMsg ()
#4  0x00000000004e7030 in _emsdMsg_SwapIn ()
#5  0x00000000005222db in ?? ()   _handleBrowserRetrieve
#6  0x00000000004c74c9 in ?? ()
#7  0x000000000049bcd6 in ?? ()
#8  0x000000000049be92 in ?? ()
#9  0x0000000000481234 in _tibemsEventQueue_Dispatch ()
#10 0x0000000000480224 in _tibemsEvm_IOThread ()
#11 0x0000003660a06367 in start_thread () from /lib64/libpthread.so.0
#12 0x000000365fed30ad in clone () from /lib64/libc.so.6

Additional reasons for poor performance include:

Large memory footprint resulting in increased memory paging which impacts performance.

Network related issues accompany large message transfers. EMS is a multi-threaded application. Generally, only one or two threads are busy. The busiest thread should be the network I/O thread for primary client network I/O processing. The other busiest thread should be for disk I/O processing.

Expiration review cycles on EMS servers evaluating all pending messages can have a noticeable performance impact on those servers where there are large amounts of pending messages. The disk reads could be expensive and the datastore could become fragmented if messages are swapped to disk.

Trouble Shooting:
= = = = = = = =

1). To identify if the problem is caused by high disk read/write rates or other disk problems, look at the “disk read rate”/”disk write rate” in the tibemsadmin “info” results and OS “iostat –x” results.

2). You should be able to obtain CPU/memory usage information from top/prstat/ps on Unix and the perfmon log on Microsoft Windows platforms. You will need to verify if there is one thread of the EMS server which has high CPU usage. For multi CPU hosts, you will need to verify %cpu * (number of cups).

3). Tcpdump/windump raw packet capture is the best information for trouble shooting network issues. Running “netstat –s” several times and examining the output can also be helpful.

4). Take an overall look at the server status to see if anything is abnormal. If possible, send the following tibemsadmin command results to Tibco Support for review.

#################################################
- time on
- timeout 120
- info
- show connections full
- show topics
- show queues
- show durables
- show routes
- show consumers full
- show stat producers
- show bridges
- show db
#################################################

Solution:
= = = = =

1).  Check the EMS Release Notes to see if there are any defects which have been addressed which match the symptoms you are seeing.

2).  If you notice that the EMS server has a large memory foot print, refer to KB 25181 entitled, “Why is my EMS server footprint very large?”.

3).  If you noticed high CPU utilization, several thread trace captures may help to identify which threads are busy and what they are doing. This may narrow down identifying the problem.

4).  If you noticed high disk read/write rate or high CPU usage, verify if any of the following are occurring:

-  High message load? If this is the case, try:

  a). Increasing the max memory size and turn message swapping off to prevent disk reads.

  b). Expire the messages quicker to avoid too many pending messages.

5). If you noticed a lot of publishers publishing messages at a high rate, consider the following:

a). Reducing the number of publishers and rate limit them in their code, or enable flow control or destination limits.

b). Consider some clients connecting to a different EMS server.

6). Are selectors being used?

Selectors can be expensive for queue consumers. If messages are pending on a queue and you start a message consumer with a selector, the server may have to evaluate all pending messages to see if the consumer can receive them.

  a). Create a bridge using selectors and create consumers without selectors.

b). Use long time consumers.

c). Disable swapping.

7). Check to see if there are a large number of pending and/or expired messages.

If the client connects to the EMS server across a WAN or the network RTT time is relatively long, verify if the client is using transacted sessions or non transacted sessions with ACK mode being AUTO, CLIENT or EXPLICIT_CLIENT for durable subscribers, queue receivers or consumers with FT connection string. If this is the case, try setting DUP_OK ACK+ mode to see if this resolves the problem.

8). If you can not identify the root cause of the problem, collect the following data from the EMS server box.

pstack
tcpdump data
strace/dtrace

Try using the attached utility (Filename: Slow_Clock_Collector.zip) to collect the the above data.

Keywords
= = = = = =
EMS Performance

Issue/Introduction

Identifying poor EMS server performance issues.

Attachments

Identifying poor EMS server performance issues. get_app

Welcome to "KB Articles"