Managing the Chorus HDFS cache

Products	Versions
Spotfire Data Science	6.x

Description

Resolution

When running a workflow on Hadoop, Chorus caches interim results to optimize interactive workflow creation. For instance, when adding or modifying an operator, there is no requirement to wait for the entire upstream flow (aka computational DAG) to be recomputed. Rather, when the user runs the modified flow, Chorus parses the computational DAG to determine which operators have changed (and need to be recomputed) and restarts computation at a valid checkpoint that is as close to the first modified operator as possible. Accordingly, for a DAG containing 20 operators, if the user modifies the last operator in the chain, Chorus doesn't not need to rerun the entire upstream DAG of 19 operators, but leverages its HDFS cache to restart the computation of the DAG as close to the last operator as possible. As can be readily imagined, this caching delivers significant performance improvements!

While caching delivers significant performance improvements, there is obviously an overhead in HDFS related to retaining these checkpoints. In this KB article, I discuss simple configuration options to significantly reduce the HDFS footprint of these checkpoints; allowing users to benefit from Chorus caching, while being judicious with their HDFS resources.

Chorus Cache structure

The Chorus cache can be viewed as being composed of two components:

alpine_out: this is the user visible component of the cache and should be leveraged for results that users want to retain and potentially export. This component of the cache is used when the user has configured an operator with "Store Results" as true. The user is then free to select the location to persist the checkpoint. The user can determine whether a checkpoint is generated on a per operator basis or just at strategic points in their computational DAG.
alpine_runtime: this is the Alpine managed component of the cache, where Alpine automatically determines when to generate & persist workflow checkpoints into the cache.

View cache contents

For each workflow, the user visible checkpoints can be viewed via the visual workflow editor's action menu. From this dropdown select "Clear Temporary Data". The resulting popup window displays the current checkpoints for the workflow, and allows the user to selectively delete unwanted checkpoints.

See Attached_Screenshot_1.

Reducing the cache size

There are a couple of simple ways to reduce the HDFS overhead associated with maintaining the Chorus cache, which can be easily configured on a per data source, or even per workflow level:

Control HDFS cache replication: HDFS typically retains multiple versions of each file, leveraging redundancy to provide resilience to HDD failure. Given that these results are temporary cache files, the default HDFS replication strategy is frequently excessive. The replication of the cache files generated by Chorus can be configured at the data source level, by adding the following key-value pair to the connection parameters.

dfs.replication=1

Compress checkpoint files: By default, checkpoints are stored uncompressed to maximize performance. However, its also possible to configure Chorus to compress its checkpoints. This can be configured on a per-data source, per workflow, or per operator basis.

Per data source: Checkpoint compression can be enabled for the Chorus ETL operators, by adding the following key-value pairs to the data source configuration

mapreduce.output.fileoutputformat.compress=true
mapreduce.output.fileoutputformat.compress.type=BLOCK
mapreduce.output.fileoutputformat.compress.codec=org.apache.hadoop.io.compress.GzipCodec

Per workflow/operator: The above configuration options can be set in the workflow variables on a per-workflow basis, rather than at data source level, such that compression can be enabled (or disabled i.e. override the data source configuration) for a specific workflow or for a specific operator or class of operators in the flow. Details about how to apply permissions to the workflow variables are explained here and here.

Using explicit checkpoints

Its also possible create explicit checkpoints using the Chorus convert operator, that allows the explicit generation of a compressed parquet checkpoint that can be used by downstream operators, and can be maintained independently of the Alpine runtime checkpoints. This provides a cost-effective way to support checkpoints at strategic points in the DAG e.g. after feature engineering is completed, and before modeling training. A simple of example of this approach is illustrated below.

See Attached_Screenshot_2

Issue/Introduction

Managing the Chorus HDFS cache

Attachments

Managing the Chorus HDFS cache get_app

Welcome to "KB Articles"