Enable Spark Dynamic Allocation on CDH 5.9

Enable Spark Dynamic Allocation on CDH 5.9

book

Article ID: KB0082624

calendar_today

Updated On:

Products Versions
Spotfire Data Science 6.2+

Description

Enable Spark Dynamic Allocation on CDH 5.9

Issue/Introduction

Enable Spark Dynamic Allocation on CDH 5.9

Resolution

Enable Spark Dynamic Allocation on CDH 5.9

Applicable to Spotfire Data Science Version 6.2 and up.

Changes to Cloudera Manager to get Spark Dynamic Allocation working for Spotfire Data Science.

TO ENABLE DYNAMIC ALLOCATION ON CDH:
These are the changes that needs to be done in order to enable Spark Dynamic allocation on a CDH Cluster as described by spark.apache.org

Configuration and Setup

In YARN mode, ensure the shuffle service is available on each NodeManager in yarn-site.xml as follows:

In Cloudera Manager, Search for: 
YARN Service Advanced Configuration Snippet (Safety Valve) for yarn-site.xml

Add the following properties.

<property>
<name>yarn.nodemanager.aux-services</name>
<value>spark_shuffle</value>
</property>
<property>
<name>yarn.nodemanager.aux-services.spark_shuffle.class</name>
<value>org.apache.spark.network.yarn.YarnShuffleService</value>
</property>

(see the attached screenshot)

In a Terminal CDH: Locate yarn shuffle jar and copy the filepath
you can use find to search for *yarn-shuffle.jar

    #find / -name *yarn-shuffle.jar

The following is an example file path, this may be different on your cluster.
/opt/cloudera/parcels/CDH-5.9.1-1.cdh5.9.1.p0.4/lib/hadoop-yarn/lib/spark-1.6.0-cdh5.9.1-yarn-shuffle.jar

In Cloudera Manager, Search for:
yarn_application_classpath
Add the yarn-shuffle full path name located above.

(see the attached screenshot)

Save changes and restart CM.

Link for reference:
https://spark.apache.org/docs/1.6.1/job-scheduling.html#configuration-and-setup

For Spotfire Data Science, You can place these parameters in the Data Source under additional Parameters.
spark.dynamicAllocation.enabled=true
spark.dynamicAllocation.maxExecutors=10
spark.dynamicAllocation.minExecutors=2
spark.dynamicAllocation.initialExecutors=2
spark.dynamicAllocation.executorIdleTimeout=60

(see the attached screenshot)

***This can also be done on a per operator basis***

Spark Auto-tuning can determine if dynamic allocation is enabled on the cluster and if so, it will use that to choose the maximum number of executors. (spark.dynamic.allocation.max.executors and spark.dynamic.allocation.enabled).

Attachments

Enable Spark Dynamic Allocation on CDH 5.9 get_app