Products | Versions |
---|---|
Spotfire Data Science | 6.x |
When running a workflow on Hadoop, Chorus caches interim results to optimize interactive workflow creation. For instance, when adding or modifying an operator, there is no requirement to wait for the entire upstream flow (aka computational DAG) to be recomputed. Rather, when the user runs the modified flow, Chorus parses the computational DAG to determine which operators have changed (and need to be recomputed) and restarts computation at a valid checkpoint that is as close to the first modified operator as possible. Accordingly, for a DAG containing 20 operators, if the user modifies the last operator in the chain, Chorus doesn't not need to rerun the entire upstream DAG of 19 operators, but leverages its HDFS cache to restart the computation of the DAG as close to the last operator as possible. As can be readily imagined, this caching delivers significant performance improvements!
While caching delivers significant performance improvements, there is obviously an overhead in HDFS related to retaining these checkpoints. In this KB article, I discuss simple configuration options to significantly reduce the HDFS footprint of these checkpoints; allowing users to benefit from Chorus caching, while being judicious with their HDFS resources.
The Chorus cache can be viewed as being composed of two components:
For each workflow, the user visible checkpoints can be viewed via the visual workflow editor's action menu. From this dropdown select "Clear Temporary Data". The resulting popup window displays the current checkpoints for the workflow, and allows the user to selectively delete unwanted checkpoints.
See Attached_Screenshot_1.
Reducing the cache size
There are a couple of simple ways to reduce the HDFS overhead associated with maintaining the Chorus cache, which can be easily configured on a per data source, or even per workflow level:
dfs.replication=1
mapreduce.output.fileoutputformat.compress=true mapreduce.output.fileoutputformat.compress.type=BLOCK mapreduce.output.fileoutputformat.compress.codec=org.apache.hadoop.io.compress.GzipCodec
Its also possible create explicit checkpoints using the Chorus convert operator, that allows the explicit generation of a compressed parquet checkpoint that can be used by downstream operators, and can be maintained independently of the Alpine runtime checkpoints. This provides a cost-effective way to support checkpoints at strategic points in the DAG e.g. after feature engineering is completed, and before modeling training. A simple of example of this approach is illustrated below.
See Attached_Screenshot_2