Spotfire Server in a cluster crashes due to long garbage collection (GC) pauses

Spotfire Server in a cluster crashes due to long garbage collection (GC) pauses

book

Article ID: KB0070105

calendar_today

Updated On:

Products Versions
Spotfire Server 7.5 and higher

Description

In a cluster, the Spotfire Server may go offline due to long garbage collection (GC) pauses. In the catalina.log you may see a warning like:
WARNING [jvm-pause-detector-worker] org.apache.ignite.logger.java.JavaLogger.warning Possible too long JVM pause: 816 milliseconds.
In the server.log you would see the following errors:
WARN 2019-03-22T18:08:56,678-0400 [] discovery.tcp.TcpDiscoverySpi: Timed out waiting for message delivery receipt (most probably, the reason is in long GC pauses on remote node; consider tuning GC and increasing 'ackTimeout' configuration property). Will retry to send message with increased timeout [currentTimeout=10000, rmtAddr=TIBCO.Spotfire.Server/xx.xx.xx.xx:xxxx, rmtPort=xxxx]
WARN 2019-03-22T18:08:56,691-0400 [] discovery.tcp.TcpDiscoverySpi: Failed to send message to next node [msg=TcpDiscoveryConnectionCheckMessage [super=TcpDiscoveryAbstractMessage [sndNodeId=null, id=160cc2f6961-299208f8-08f8-44d8-a99d-a4a3a5df7537, verifierNodeId=null, topVer=0, pendingIdx=0, failedNodes=null, isClient=false]], next=TcpDiscoveryNode [id=b80724c4-1a62-4149-b616-56284fe4a6f8, addrs=[xx.xx.xx.xx], sockAddrs=[TIBCO.Spotfire.Server/xx.xx.xx.xx:xxxx], discPort=5702, order=6, intOrder=4, lastExchangeTime=1552305360552, loc=false, ver=2.5.0#20180523-sha1:86e110c7, isClient=false], errMsg=Failed to send message to next node [msg=TcpDiscoveryConnectionCheckMessage [super=TcpDiscoveryAbstractMessage [sndNodeId=null, id=160cc2f6961-299208f8-08f8-44d8-a99d-a4a3a5df7537, verifierNodeId=null, topVer=0, pendingIdx=0, failedNodes=null, isClient=false]], next=ClusterNode [id=b80724c4-1a62-4149-b616-56284fe4a6f8, order=6, addr=[10.209.129.158], daemon=false]]]
WARN 2019-03-22T18:08:56,693-0400 [] discovery.tcp.TcpDiscoverySpi: Local node has detected failed nodes and started cluster-wide procedure. To speed up failure detection please see 'Failure Detection' section under javadoc for 'TcpDiscoverySpi'
...
WARN 2019-03-22T18:08:56,977-0400 [] discovery.tcp.TcpDiscoverySpi: Node is out of topology (probably, due to short-time network problems).
INFO 2019-03-22T18:08:56,982-0400 [] discovery.tcp.TcpDiscoverySpi: Finished serving remote node connection [rmtAddr=/xx.xx.xx.xx:xxxx, rmtPort=xxxx
WARN 2019-03-22T18:08:56,994-0400 [] managers.discovery.GridDiscoveryManager: Local node SEGMENTED: TcpDiscoveryNode [id=b80724c4-1a62-4149-b616-56284fe4a6f8, addrs=[10.209.129.158], sockAddrs=[TIBCO.Spotfire.Server/xx.xx.xx.xx:xxxx], discPort=xxxx, order=6, intOrder=4, lastExchangeTime=1553292536994, loc=true, ver=2.5.0#20180523-sha1:86e110c7, isClient=false]
ERROR 2019-03-22T18:08:57,014-0400 [] : Critical system error detected. Will be handled accordingly to configured handler [hnd=class o.a.i.failure.StopNodeOrHaltFailureHandler, failureCtx=FailureContext [type=SYSTEM_WORKER_TERMINATION, err=java.lang.IllegalStateException: Thread tcp-disco-srvr-#3%TIBCO-Spotfire% is terminated unexpectedly.]]
java.lang.IllegalStateException: Thread tcp-disco-srvr-#3%TIBCO-Spotfire% is terminated unexpectedly.
	at org.apache.ignite.spi.discovery.tcp.ServerImpl$TcpServer.body(ServerImpl.java:5686) ~[ignite-core.jar:2.5.0]
	at org.apache.ignite.spi.IgniteSpiThread.run(IgniteSpiThread.java:62) ~[ignite-core.jar:2.5.0]
ERROR 2019-03-22T18:08:57,014-0400 [] : JVM will be halted immediately due to the failure: [failureCtx=FailureContext [type=SYSTEM_WORKER_TERMINATION, err=java.lang.IllegalStateException: Thread tcp-disco-srvr-#3%TIBCO-Spotfire% is terminated unexpectedly.]]

Apache Ignite is sensitive to long GC pauses (i.e. a few seconds) and this, high CPU utilization, high memory utilization, or network communication issues can cause cluster segmentation issues and cluster disconnects which in turn causes the Spotfire Server to shut down.
 

Issue/Introduction

Spotfire Server in a cluster crashes due to long garbage collection (GC) pauses

Resolution

To help avoid this, the following two changes are recommended:

To avoid the GC overhead with resizing the heap, set the minimum heap size and maximum heap to the same value in the CATALINA_OPTS settings. Instructions:

For Spotfire Server 10.5 and lower versions running as a Windows service: 
  1. Stop Spotfire Server service.
  2. On the command line, go to the <Spotfire Server installation directory>/tomcat/bin directory.
  3. Enter below command
    service.bat remove
  4. Open <Spotfire Server installation directory>/tomcat/bin/service.bat file.
  5. Locate '--JvmMs' and '--JvmMx'  entries and change '--JvmMs' value matching to current '--JvmMx' value.  For example if --JvmMx is 4096 then change --JvmMs to 4096
    if "%JvmMs%" == "" set JvmMs=4096
    if "%JvmMx%" == "" set JvmMx=4096
  6. Save and close the file.
  7. Enter below command in command prompt
    service.bat install
  8. Start Spotfire Server service.
For Spotfire Server 10.6 and higher versions running as a Windows service: 
  1. Stop the Spotfire Server service.
  2. On the command line, go to the <installation dir>/tomcat/bin directory.
  3. Enter the following command: 
    service.bat remove
  4. Open the <installation dir>/tomcat/bin/setenv.bat file in a text editor.
  5. Locate the following entries and change the numbers to suitable memory values (in MB):
    JvmMs=512
    JvmMx=4096
  6.  Save and close the file.
  7.  Enter the following command: 
    service.bat install
  8. Start the Spotfire Server service.

For Spotfire Server not running as a Windows service:
  1. Go to <Spotfire Server installation Directory>\tomcat\bin folder.
  2. Open setenv.sh file if the Spotfire Server is installed on Linux machine (or) Open setenv.bat file if Spotfire Server is installed on Windows machine.
  3. Change heap size values  
    • For 10.2 and lower versions : At the end of CATALINA_OPTS attribute add -Xms and -Xmx and set them both to the same value, matching the current JAVA_OPTS -Xmx value. For example if JAVA_OPTS -Xmx=4096M then set CATALINA_OPTS to -Xms4096M -Xmx4096M. For example:
      set JAVA_HOME=C:\tibco\tss\7.11.0\jdk
      set JRE_HOME=C:\tibco\tss\7.11.0\jdk\jre
      set JAVA_OPTS=-server -XX:+DisableExplicitGC -Xms4096M -Xmx4096M
      set CATALINA_OPTS=-Dcom.sun.management.jmxremote -Dorg.apache.catalina.session.StandardSession.ACTIVITY_CHECK=true -DLog4jContextSelector=org.apache.logging.log4j.core.async.AsyncLoggerContextSelector -Xms4096M -Xmx4096M
    • For 10.3 , 10.4  and 10.5 versions : For  CATALINA_OPTS attribute, change -Xms value matching to current -Xmx value. For example if -Xmx is 4096M then change -Xms to 4096M
      set JAVA_HOME=C:\tibco\tss\10.5.0\jdk
      set JRE_HOME=C:\tibco\tss\10.5.0\jdk\jre
      
      rem Uncomment the line below to enable GC logging
      set GC_LOG=-XX:+PrintGCDetails -XX:+PrintAdaptiveSizePolicy -XX:+PrintGCTimeStamps -XX:+PrintGCDateStamps -XX:+UseGCLogFileRotation -XX:NumberOfGCLogFiles=5 -XX:GCLogFileSize=25M -Xloggc:%CATALINA_HOME%\logs\gc-%%t.log
      
      set JAVA_OPTS=-server -XX:+AlwaysPreTouch -XX:+UseG1GC -XX:+ScavengeBeforeFullGC -XX:+DisableExplicitGC %GC_LOG%
      set CATALINA_OPTS=-Xms4096M -Xmx4096M -Dcom.sun.management.jmxremote -Dorg.apache.catalina.session.StandardSession.ACTIVITY_CHECK=true -DLog4jContextSelector=org.apache.logging.log4j.core.async.AsyncLoggerContextSelector -Djava.library.path="%PATH%;C:\tibco\tss\10.5.0\tomcat\spotfire-lib;C:\tibco\tss\10.5.0\tomcat\custom-ext"
    • For 10.6 and higher versions: Change JvmMs value matching to current JvmMx value. For example if JvmMx is 4096 then change JvmMs to 4096.
      set JAVA_HOME=C:\tibco\tss\10.7.0\jdk
      set JRE_HOME=C:\tibco\tss\10.7.0\jdk\jre
      set JvmMs=4096
      set JvmMx=4096
      rem Uncomment the line below to enable GC logging
      rem set GC_LOG=-XX:+PrintGCDetails -XX:+PrintAdaptiveSizePolicy -XX:+PrintGCTimeStamps -XX:+PrintGCDateStamps -XX:+UseGCLogFileRotation -XX:NumberOfGCLogFiles=5 -XX:GCLogFileSize=25M -Xloggc:%CATALINA_HOME%\logs\gc-%%t.log
      
      set JAVA_OPTS=-server -XX:+AlwaysPreTouch -XX:+UseG1GC -XX:+ScavengeBeforeFullGC -XX:+DisableExplicitGC %GC_LOG%
      set CATALINA_OPTS=-Xms%JvmMs%M -Xmx%JvmMx%M -Dcom.sun.management.jmxremote -Dorg.apache.catalina.session.StandardSession.ACTIVITY_CHECK=true -DLog4jContextSelector=org.apache.logging.log4j.core.async.AsyncLoggerContextSelector -Djava.library.path="%PATH%;C:\tibco\tss\10.7.0\tomcat\spotfire-lib;C:\tibco\tss\10.7.0\tomcat\custom-ext"
  4. Restart Spotfire Server service.

To resolve cluster segmentation issues which occurred due to short-term communication issues, increase "clustering.apacheignite.timeouts.failure-detection-timeout" server configuration property value to 60000. Instructions:

For Spotfire Server 10.3.0 and higher:
  1. Open a command prompt and go to <Spotfire Server installation Directory>\tomcat\spotfire-bin
  2. Export the Spotfire Server configuration using command "config export-config":
    <Spotfire Server installation Directory>\tomcat\spotfire-bin> config export-config
  3. Increase the failure detection timeout using command "config set-config-prop":
    <Spotfire Server installation Directory>\tomcat\spotfire-bin> config set-config-prop  --name="clustering.apacheignite.timeouts.failure-detection-timeout" --value=60000
  4. Import the configuration using command "config import-config":
    <Spotfire Server installation Directory>\tomcat\spotfire-bin> config import-config -c "increased cluster failure detection timeout "
  5. Restart the Spotfire Server service.

For Spotfire Server Versions 7.11.0, 7.12.0, 7.13.0 and 7.14.0:
  1. Open a command prompt and go to <Spotfire Server installation Directory>\tomcat\bin
  2. Export the Spotfire Server configuration using command "config export-config":
    <Spotfire Server installation Directory>\tomcat\bin> config export-config
  3. Increase the failure detection timeout using command "config set-config-prop":
    <Spotfire Server installation Directory>\tomcat\bin> config set-config-prop  --name="clustering.apacheignite.timeouts.failure-detection-timeout" --value=60000
  4. Import the configuration using command "config import-config":
    <Spotfire Server installation Directory>\tomcat\bin> config import-config -c "increased cluster failure detection timeout "
  5. Restart the Spotfire Server service.

Additional Information

Doc: export-config command (exporting a Spotfire server configuration file from database), Spotfire Server and Environment - Installation and Administration manual Doc: set-config-prop command (setting the value of a specific configuration property), Spotfire Server and Environment - Installation and Administration manual Doc: import-config command (importing a Spotfire server configuration file to database), Spotfire Server and Environment - Installation and Administration manual