For this discussion we'll define a CPU as a processing unit which can run one thread at a time. For example, a system with a single CPU chip may have 4 CPU cores with hyper-threading, so 8 concurrent running threads are possible. We'll say this system has 8 CPUs.
The number of CPUs which a TIBCO Streaming node uses is based on design choices made by the programmer.
Contents:
- TIBCO Streaming Thread Architecture
- Using Concurrency and Multiplicity
- Troubleshooting Over-Subscribed Applications
- ParallelSequence Queues
- Operator CPU Time
TIBCO Streaming Thread Architecture
A node may consist of one or more engines each of which is a separate Java Virtual Machine (JVM). Most of the many threads that make up an engine are lightweight, run briefly, and are otherwise idle. A StreamBase or LiveView application may consist of hundreds of mostly idle threads waiting for input data to trigger activity. In contrast, the main processing thread is responsible for all data flow into and out of an application is often very busy, as are any input adapter threads since these enable live streaming data connections to external services.
At a minimum, expect to need:
- 1 CPU remaining free for Operating System tasks.
- 1 CPU for the StreamBase main thread,
- 1 CPU for each very busy adapter
For the simplest case, assume 3 CPUs worth of load when operating at capacity.
For each active LiveView Table the active thread count is:
- 1 CPU for the main table thread (snapshot-parallelism = 1, snapshot-concurrency=0)
or
- N CPUs equal to snapshot-parallelism * (snapshot-concurrency + 1)
This is described in the product documentation here:
TIBCO Streaming > LiveView Admin Guide > Advanced Tasks in LiveView, "Using Parallelism and Concurrency"LiveView applications typically have a StreamBase data-source or publisher which adds a SB main thread and adapter thread.
For a minimal one-table LiveView instance with default snapshot-parallelism=1 and snapshot-concurrency=0 and its StreamBase data-source, to operate at max load, the machine will need 4 CPUs. If the system provides fewer CPUs, then some operations will block waiting for others to complete, and the overall throughput will be lower than that application would be expected to deliver. Of course, if the application does not need to operate at capacity, the moment-by-moment demand may be lower.
Using Concurrency and Multiplicity
The maximum demand for CPU time will increase if the application designer has explicitly added Concurrency (sections of EventFlow which may use its own thread), and Multiplicity (sections of EventFlow that are instantiated more than once to split the load across multiple threads to avoid blocking), snapshot-parallelism greater than one, or snapshot-concurrency greater than zero.
Concurrency and Multiplicity add:
- 1 CPU for additional concurrent region
- 1 CPU for each additional multiple above "1".
Additional threads may be introduced by adapter libraries which are un-managed by the platform runtime. This varies by adapter and library, apart from EventFlow concurrency settings.
The reason to add Concurrency and Multiplicity is to split the work across more than the default single thread in order to allow the application to always have processing capacity to keep up with the data input rates and avoid queuing (which adds latency). This is only beneficial if the number of additional busy threads remains fewer than the number of CPUs available. If the number of threads that could be busy is ever greater than the quantity of available CPUs, then blocking will occur, which leads to queuing or back-pressure on the provider of the data.
If when running under business loads, there are more potential busy threads than there are available CPUs considering the Streaming demand as well as other processes running on the same system, then that system is said to be over-subscribed. A briefly over-subscribed system will need data-rates to slow down for a period so it may catch up. A continuously over-subscribed system will increase latency throughout the over-subscribed period and may have secondary effects such as:
- memory exhaustion due to queuing,
- disconnections from external services due to timeouts
Troubleshooting Over-Subscribed Applications
General understanding of the '
sbprofile' log output is described in the product documentation here:
TIBCO Streaming > StreamBase Admin Guide > Monitoring and Profiling Applications > Profiling
ParallelSequence Queues
A ParallelSequence queue represents the point where a tuple leaves one thread (or concurrent region) and enters the next. The queue keeps the tuples received in the order they arrive from any input into the region, and in the order they exit the region (the queue is used for all inputs and all outputs).
Every group of one or more operators in the same region has a ParallelSequence queue which handles all tuple input and output from that thread. The '
sbprofile' log reports the ParallelSequence queues on lines that begin with "Q," and have a name ending in "ParallelSequence".
For example:
Q, geoMapTable.QueryInRef1:ParallelSequence, 0, 0, 2019-11-12 09:45:50.749where the columns "
0, 0" are the "Queue Max Size" and "Queue Current Size".
In the profile log created by '
sbprofile' pay attention to the "Queue Max Size" and "Queue Current Size" columns. The "Queue Max Size" indicates how backlogged the queue became at some point in the session. The "Queue Current Size" indicates the progression of queuing (and therefore latency) over time. In a module which is keeping up with input data rates these values will be small numbers of digits. A module which at times cannot keep up will have larger numbers in these columns. The log will indicate when queuing began and second-by-second reports of how it increases or decreases over time. A queue that consistently climbs over the course of the session indicates a module which should be given more CPU to keep up, and that can be provided by giving the module Concurrency and Multiplicity values greater than "1".
Operator CPU Time
The '
sbprofile' log output can help identify the specific operators which are taking the most CPU time, and these are candidates for both making them concurrent, so the work they do is isolated from and does not block other threads, and with multiplicity to split the work between 2 or more instances of the operator.
For example from
sbprofile output:
O, MyContainer.myModule.myOperator:1, 1, 1, 1.1, 10943, N/A, 2019-11-12 09:45:50.749This operator (instance ":1" of multiplicity>1) received one tuple, emitted one tuple, took 1.1% of the time this tuple spent in this region, and took 10.9 milliseconds to do its work.
If the microseconds value is consistently 1000000 (or even greater due to profiling limitations) then this operator used all of a CPU this second and did not share, so other operators in this region were blocked.