Products | Versions |
---|---|
Spotfire Data Science | 6.2+ |
Enable Spark Dynamic Allocation on CDH 5.9
Applicable to Spotfire Data Science Version 6.2 and up.
Changes to Cloudera Manager to get Spark Dynamic Allocation working for Spotfire Data Science.
TO ENABLE DYNAMIC ALLOCATION ON CDH:
These are the changes that needs to be done in order to enable Spark Dynamic allocation on a CDH Cluster as described by spark.apache.org.
Configuration and Setup
In YARN mode, ensure the shuffle service is available on each NodeManager in yarn-site.xml as follows:
In Cloudera Manager, Search for:
YARN Service Advanced Configuration Snippet (Safety Valve) for yarn-site.xml
Add the following properties.
<property>
<name>yarn.nodemanager.aux-services</name>
<value>spark_shuffle</value>
</property>
<property>
<name>yarn.nodemanager.aux-services.spark_shuffle.class</name>
<value>org.apache.spark.network.yarn.YarnShuffleService</value>
</property>
(see the attached screenshot)
In a Terminal CDH: Locate yarn shuffle jar and copy the filepath
you can use find to search for *yarn-shuffle.jar
#find / -name *yarn-shuffle.jar
The following is an example file path, this may be different on your cluster.
/opt/cloudera/parcels/CDH-5.9.1-1.cdh5.9.1.p0.4/lib/hadoop-yarn/lib/spark-1.6.0-cdh5.9.1-yarn-shuffle.jar
In Cloudera Manager, Search for:
yarn_application_classpath
Add the yarn-shuffle full path name located above.
(see the attached screenshot)
Save changes and restart CM.
Link for reference:
https://spark.apache.org/docs/1.6.1/job-scheduling.html#configuration-and-setup
For Spotfire Data Science, You can place these parameters in the Data Source under additional Parameters.
spark.dynamicAllocation.enabled=true
spark.dynamicAllocation.maxExecutors=10
spark.dynamicAllocation.minExecutors=2
spark.dynamicAllocation.initialExecutors=2
spark.dynamicAllocation.executorIdleTimeout=60
(see the attached screenshot)
***This can also be done on a per operator basis***
Spark Auto-tuning can determine if dynamic allocation is enabled on the cluster and if so, it will use that to choose the maximum number of executors. (spark.dynamic.allocation.max.executors and spark.dynamic.allocation.enabled).