List of additional parameters when Kerberos and High availability enabled on the Hadoop cluster

List of additional parameters when Kerberos and High availability enabled on the Hadoop cluster

book

Article ID: KB0077664

calendar_today

Updated On:

Products Versions
Spotfire Data Science All supported versions

Description

What are the list of additional parameters required when you have Kerberos and High availability enabled on the Hadoop cluster and  you want to connect that data source to TIBCO Spotfire Data Science?

Issue/Introduction

This article provides information about list of all additional parameters required when Kerberos and High availability are enabled on the Hadoop cluster

Environment

Linux

Resolution

Here is the list of additional parameters that need to configured when you add a data source to TIBCO Spotfire Data Science. 

Kerberos Related : 
  1. alpine.principal=alpine/chorus.alpinenow.local@ALPINENOW.LOCAL

  2. alpine.keytab=/home/chorus/keytab/alpine.keytab

  3. dfs.datanode.kerberos.principal=hdfs/_HOST@TDS.LOCAL 

  4. dfs.namenode.kerberos.principal=hdfs/_HOST@TDS.LOCAL

  5. yarn.resourcemanager.principal=yarn/_HOST@TDS.LOCAL (The Kerberos principal for the resource manager.)

  6. mapreduce.jobhistory.principal=mapred/_HOST@TDS.LOCAL

Note: _HOST allows any host to connect using this principle.  

Protections:  

  1. spark.hadoop.hadoop.rpc.protection=privacy

  2. hadoop.security.authentication=kerberos // Only when you have kerberos 

When Data in Transit Encryption is enabled on CDH cluster:  

  1. hadoop.rpc.protection=privacy ( authentication is default) 

  2. dfs.data.transfer.protection=privacy 

Yarn Parameters: 

  You can get the following parameters from yarn-site.xml on the Hadoop server.  
  1. yarn.app.mapreduce.am.staging-dir=/tmp

  2. yarn.resourcemanager.admin.address=cdh516dare.tds.local:8033 (The address of the  Resource manager admin interface.)

  3. yarn.resourcemanager.resource-tracker.address=cdh516dare.tds.local:8031 (The address of the  Resource tracker admin interface.)

  4. yarn.resourcemanager.scheduler.address=cdh516dare.tds.local:8030 (The address of scheduler interface) 

  5. yarn.resourcemanager.webapp.address=cdh516dare.tds.local:8088 ( HTTP Address of Resource manager)

  6. yarn.resourcemanager.webapp.https.address=cdh516dare.tds.local:8090 ( HTTPS Address of Resource manager)

  7. yarn.application.classpath= ( We get this value when you run the command yarn classpath on CDH server’s command line)

High availability:  

You can get the following parameters from Hdfs-site.xml on the Hadoop server. We are just giving the name to the service

  •  dfs .nameservices=nameservice1  

dfs.ha.namenodes.[nameservice ID] - unique identifiers for each NameNode in the nameservice. We need to configure a list of comma-separated NameNode IDs. This will be used by DataNodes to determine all the NameNodes in the cluster. For example, if you use mycluster as the NameService ID previously, and you wanted to use nn1 and nn2 as the individual IDs of the NameNodes, you would configure this as follows:

  • dfs.ha.namenodes.nameservice1=namenode64,namenode72

For communication between nodes, we use RPC protocol and the following parameters establishes that communication. 

 For both of the previously-configured NameNode IDs, set the full address and RPC port of the NameNode process. 
  • dfs.namenode.rpc-address.nameservice1.namenode64=nn1.alpinenow.local:8020

  • dfs.namenode.rpc-address.nameservice1.namenode72=nn2.alpinenow.local:8020

dfs.client.failover.proxy.provider.[nameservice ID] - the Java class that HDFS clients use to contact the Active NameNode. Configure the name of the Java class which the DFS client will use to determine which NameNode is the current active, and therefore which NameNode is currently serving client requests. The only implementation which currently ships with Hadoop is the ConfiguredFailoverProxyProvider, so use this unless you are using a custom one.

  • dfs.client.failover.proxy.provider.nameservice1=org.apache.hadoop.hdfs.server.namenode.ha.ConfiguredFailoverProxyProvider

dfs.namenode.http-address.[nameservice ID].[name node ID] - the fully-qualified HTTP address for each NameNode to listen on. Similarly to rpc-address above, set the addresses for both NameNodes' HTTP servers to listen on. 

  • dfs.namenode.http-address.nameservice1.namenode64=nn1.alpinenow.local:50070

  • dfs.namenode.http-address.nameservice1.namenode72=nn2.alpinenow.local:50070

If you have HTTPS enabled on CDH 

  • dfs.namenode.https-address.nameservice1.namenode64=nn1.alpinenow.local:50470

  • dfs.namenode.https-address.nameservice1.namenode72=nn2.alpinenow.local:50470

  • dfs.namenode.servicerpc-address.nameservice1.namenode64=nn1.alpinenow.local:8022

  • dfs.namenode.servicerpc-address.nameservice1.namenode72=nn2.alpinenow.local:8022 

  • dfs.ha.automatic-failover.enabled.nameservice1=true 

If Resource Manager is configured for High Availability:  

You can get the following parameters from yarn-site.xml on the Hadoop server. 

  1. yarn.resourcemanager.ha.rm-ids=rm60,rm70

  2. yarn.resourcemanager.webapp.https.address.rm70=nn2.alpinenow.local:8090

  3. yarn.resourcemanager.webapp.address.rm70=nn2.alpinenow.local:8088

  4. yarn.resourcemanager.admin.address.rm70=nn2.alpinenow.local:8033

  5. yarn.resourcemanager.resource-tracker.address.rm70=nn2.alpinenow.local:8031

  6. yarn.resourcemanager.scheduler.address.rm70=nn2.alpinenow.local:8030

  7. yarn.resourcemanager.address.rm70=nn2.alpinenow.local:8032

  8. yarn.resourcemanager.webapp.https.address.rm60=nn1.alpinenow.local:8090

  9. yarn.resourcemanager.webapp.address.rm60=nn1.alpinenow.local:8088

  10. yarn.resourcemanager.admin.address.rm60=nn1.alpinenow.local:8033

  11. yarn.resourcemanager.resource-tracker.address.rm60=nn1.alpinenow.local:8031

  12. yarn.resourcemanager.scheduler.address.rm60=nn1.alpinenow.local:8030

  13. yarn.resourcemanager.address.rm60=nn1.alpinenow.local:8032

  14. yarn.resourcemanager.zk-address=cm.alpinenow.local:2181,nn1.alpinenow.local:2181,nn2.alpinenow.local:2181

  15. yarn.resourcemanager.recovery.enabled=true

  16. yarn.resourcemanager.ha.automatic-failover.embedded=true

  17. yarn.resourcemanager.ha.automatic-failover.enabled=true

  18. yarn.resourcemanager.ha.enabled=true

  19. failover_resource_manager_hosts=cdh516node1.tds.local,cdh516node2.tds.local

Mapreduce: 

You can get the following parameters from Mapred-site.xml on the Hadoop server. 

  1. mapreduce.job.map.output.collector.class=org.apache.hadoop.mapred.MapTask$MapOutputBuffer (The MapOutputCollector implementation(s) to use. This may be a comma-separated list of class names, in which case the map task will try to initialize each of the collectors in turn. The first to successfully initialize will be used.

  1. mapreduce.job.reduce.shuffle.consumer.plugin.class=org.apache.hadoop.mapreduce.task.reduce.Shuffle. (Name of the class whose instance will be used to send shuffle requests by reducetasks of this job. The class must be an instance of org.apache.hadoop.mapred.ShuffleConsumerPlugin.) 

  1. mapreduce.jobhistory.address=cdh6dite.tds.local:10020 (MapReduce JobHistory Server IPC host:port) 

  1. mapreduce.jobhistory.webapp.address=cdh6dite.tds.local:19888 (MapReduce JobHistory Server Web UI host:port) 

  2. Mapreduce.application.classpath= ( We get this value when you run the command hadoop classpath on CDH server’s command line)

If using Hive: 

  1. hive.metastore.client.connect.retry.delay=1

  2. hive.metastore.client.socket.timeout=600

Hive with Kerberos:  

  1. hive.hiveserver2.uris=jdbc:hive2://cm.alpinenow.local:10000/default

  2. hive.metastore.kerberos.principal=hive/_HOST@ALPINENOW.LOCAL( realm name) 

  3. hive.server2.authentication.kerberos.principal=hive/_HOST@ALPINENOW.LOCAL

Spark History Server:  

The following parameters need to be added when you have spark service on the Hadoop cluster.   

  1. spark.yarn.historyServer.address=http://172.27.0.3:18088

  2. spark.eventLog.dir=hdfs://172.27.0.3:8020/user/spark/applicationHistory

  3. spark.eventLog.enabled=true