The typical EMS Admin commands are not that helpful in determining the cause of slow clock tick. A stack trace of the EMS process (via the pstack command or equivalent) is most intuitive in revealing what the EMS process is doing. However, we need to capture the information, whether pstack, strace or tcpdump, during the slow clock tick period, otherwise the information will not be helpful for analysis. In the majority of cases, it can be difficult to anticipate when the next slow clock tick will occur. It is not practical to keep running those commands continuously in the hope it will occur in the near future. It is best to have a probe mechanism that actively pings the EMS server and can collect information when the response time from the EMS process exceeds a certain threshold (for example 2 seconds) in real time.
An attached Java program (Filename: Slow_Clock_Collector.zip) was developed to utilize the EMS Java Admin API which will periodically run the "info" command and time the response. Before it runs, the "info" command will lock an internal object which shall be unlocked after the "info" call returns. In the meantime, in another thread we periodically wake up and try to lock the same internal object. If this lock acquire attempt does not succeed after a certain time we call an external shell script to collect the pstack and tcpdump data.
Instructions for Linux:
Extract the content of the attached zip file (Filename: Slow_Clock_Collector.zip) to the host where EMS is running. The program must be run on the same host.
1). Switch to the root account as tcpdump requires root privilege. If you feel pstack is sufficient then tcpdump can be commented out in the shell script. There will be no need to run as the root account.
2). Run "source setup.sh".
3). Open col_linux.sh and change the following line:
tcpdump -i any -s 0 -w $tcpdump_log$formula($curtime) tcp port 7222 &
Replace 7222 with your EMS listening port.
4). Run the following command to start the process:
nohup java SlowClockDetectorLinux -server <server_url> -user <admin_usr> -password <password> -script col_linux.sh -pid <ems_pid> >>EMSMonitor.log 2>&1 &
replace <server_url> <admin_usr> <password> <ems_pid] respectively
Note:
The threshold is hard set to two seconds. This means if the "info" command does not return in two seconds, the shell script will be called. You can change the threshold accordingly.
The attached script (Filename: Slow_Clock_Collector.zip) will do the following:
- Get five consecutive pstack of the EMS process, one second apart.
You can add additional OS commands to check CPU, memory and disk IO utilization before the pstack. For example, you can add the following commands:
top -H -n 1 -p $pid
vmstat 1 5
iostat -d -x 1 5
iostat -c 1 5
Get 15 seconds of tcpdump capture on any interface. The tcpdump capture will be saved in the current working directory of the Java code.
It is possible to run the same on other *NIX system with slight changes.
For Solaris:
The equivalent command of tcpdump on Solaris is snoop. Snoop cannot capture on "ANY" interface so an interface name needs to be specified explicitly:
snoop -d e1000g0 -s 0 -o $tcpdump_log$curtime tcp port 7222 &
Here "e1000g0" will be replaced with the interface name on your host. Similarly the 7222 has to be changed to the EMS listen port.
The way the Java code calls an external shell script works differently from Linux to Solaris, specifically the "sh" command is in a different path.
The SlowClockDetectorSolaris is provided for Solaris.
The command to run would be:
java SlowClockDetectorSolaris -server <server_url> -user <admin_usr> -password <password> -script col_sol.sh -pid <ems_pid> >>EMSMonitor.log 2>&1
You can change the Java program and the shell script as needed if there is a requirement to run this on HPUX or AIX. Keep in mind that pstack command is not available on both HP-UX and AIX.
Revision: [2015-09-28] Added timestamp for each probe | Added optional threshold for "slow" determination, which previously defaults to two seconds.