There are three stages to a high-throughput application using a JVM that employs HotSpot just-in-time-compilation (JIT):
1. Before compilation every Java bytecode is interpreted every time it is run.
2. After a Java method has been called 10,000 times, the JIT compiler compiles that method and optimizes it, creating a cached machine-code version.
3. After JIT is complete on a method, the cached machine-code is run exclusively.
The cached machine code is typically faster than interpreted bytecode, so performance readings before JIT compilation are not reflective of later performance numbers after compilation and should be discarded. For applications that handle 1000 messages per second, after ten seconds at that rate JIT compilation will trigger and the rest of the run will be at the highest possible performance (using default settings).
At startup the Java bytecode may be fast enough to process the message volume. Later the Java JIT compiler may briefly take up CPU resources performing compilation and optimization, reducing the amount of available CPU and possibly causing messages to buffer. Following, the JIT compiler has finished compiling all frequent processing paths and there is sufficient CPU for the service to catch up, working through any brief backlog of messages. This behavior is common to all Java applications and is a characteristic of the Java platform.
There are a three ways to make sure the application is running at its best possible performance for a test which can be used individually or in combination:
A. Start the application early and provide dummy data which exercises the main code paths. Care should be taken to use fictitious identifiers so real data is not affected during later processing. One way to do this is to provide a way to clear old cached data within the application without stopping and restarting the application.
B. Do not stop and restart the application from day to day so you can make use of the optimizations created by the JIT compiler during the previous day.
C. make sure a CPU core remains free so that the CPU use by JIT compilation does not block normal processing.
To see what parts of the application are being compiled and when, modify the sbd.sbconf by adding
-XX:+PrintCompilation to the JVM arguments setting in the server configuration file:
StreamBase 7.x sbd.sbconf (example):
<param name="jvm-args" value="-Xms128m -Xmx900m -XX:+PrintCompilation"/>
TIBCO Streaming 10.x and later (example):
StreamBaseEngine = {
jvmArgs = [
"-Xmx2048m"
"-Xms512m"
"-XX:+UseG1GC"
"-XX:MaxGCPauseMillis=500"
"-XX:ConcGCThreads=1"
"-XX:+PrintCompilation"
]
}
Then run the application capturing sdout and stderr to a file as so:
StreamBase 7.x:
sbd -f sbd.sbconf app.sbapp >sbd.log 2>&1
TIBCO Streaming 10.x:
epadmin servicename=A.X start node
The output is written into to node
logs/ directory,
default-engine-for-*.log file.
Also look for "COMPILE SKIPPED" messages as an indication of code which, although exercised frequently, could not be optimized effectively and so will not be cached. StreamBase Support can help interpret these results. If the
sbprofile application does not also indicate that the affected operators are bottlenecks in your app, then this message is typically not a concern.
Monitor CPU use to determine how much CPU is normally used by the application when it is keeping up with the data and compare to when it is slower to estimate how much additional CPU should be provided. On Windows, use
perfmon.msc and on Linux use
sysstat.
Also watch the application using
sbmonitor to see how much backlog is present and what the tuple-per-second throughput is when you see the application slowing down and catching up.
You might think changing
-XX:CompileThreshold=10000 to something small (say "1") will cause the application to be optimized much earlier and later slowdowns can be avoided. The CompileThreshold sets the number of times a method needs to be called to trigger compilation to machine-code. A setting of "1" is bad because every method would then be compiled with no profiling data, and if a rarely used code-path is hit at any point in later processing for the first time, it would cause the JIT compiler to trigger instead of using the JVM implementation. The lower the threshold is set, the earlier (and more) code in the application will be compiled, but the generated code will be of poorer quality since the compiler has much less profiling data to work with to choose optimization strategies (best optimization tends to level off at 10000, which is why this is the default). Note that in recent (Java 6) HotSpot implementations, profiling continues after the initial compilation and further improves performance over the life of the JVM, so a lower initial setting (
-XX:CompileThreshold=3000) is reasonable for shorter priming periods on low throughput code-paths.
Note that in a simple application with one stream that is exercised continuously, the JIT compilation will tend to occur all at the same time for the operators that make up that path, regardless of the value that CompileThreshold is set to. Whenever compilation triggers, it is best to have extra CPU available so that the compilation thread will not impact normal data processing.
To add CPU for a StreamBase app, run the application on a system with more and/or faster CPU cores. Also monitor CPU use by core and if one or more cores go to 100%-used while others are idle, then investigate whether making the busiest operators concurrent in your design allows the workload to be distributed better without harming your algorithm. Note that when you make an operator concurrent, you abandon deterministic tuple ordering. One tuple can pass another if it takes a different stream when one of the streams uses concurrency.