List of additional parameters when Kerberos and High availability enabled on the Hadoop cluster

Products	Versions
Spotfire Data Science	All supported versions

Description

What are the list of additional parameters required when you have Kerberos and High availability enabled on the Hadoop cluster and you want to connect that data source to TIBCO Spotfire Data Science?

Environment

Linux

Resolution

Here is the list of additional parameters that need to configured when you add a data source to TIBCO Spotfire Data Science.

Kerberos Related :

alpine.principal=alpine/chorus.alpinenow.local@ALPINENOW.LOCAL
alpine.keytab=/home/chorus/keytab/alpine.keytab
dfs.datanode.kerberos.principal=hdfs/_HOST@TDS.LOCAL
dfs.namenode.kerberos.principal=hdfs/_HOST@TDS.LOCAL
yarn.resourcemanager.principal=yarn/_HOST@TDS.LOCAL (The Kerberos principal for the resource manager.)
mapreduce.jobhistory.principal=mapred/_HOST@TDS.LOCAL

Note: _HOST allows any host to connect using this principle.

Protections:

spark.hadoop.hadoop.rpc.protection=privacy
hadoop.security.authentication=kerberos // Only when you have kerberos

When Data in Transit Encryption is enabled on CDH cluster:

hadoop.rpc.protection=privacy ( authentication is default)
dfs.data.transfer.protection=privacy

Yarn Parameters:

You can get the following parameters from yarn-site.xml on the Hadoop server.

yarn.app.mapreduce.am.staging-dir=/tmp
yarn.resourcemanager.admin.address=cdh516dare.tds.local:8033 (The address of the Resource manager admin interface.)
yarn.resourcemanager.resource-tracker.address=cdh516dare.tds.local:8031 (The address of the Resource tracker admin interface.)
yarn.resourcemanager.scheduler.address=cdh516dare.tds.local:8030 (The address of scheduler interface)
yarn.resourcemanager.webapp.address=cdh516dare.tds.local:8088 ( HTTP Address of Resource manager)
yarn.resourcemanager.webapp.https.address=cdh516dare.tds.local:8090 ( HTTPS Address of Resource manager)
yarn.application.classpath= ( We get this value when you run the command yarn classpath on CDH server’s command line)

High availability:

You can get the following parameters from Hdfs-site.xml on the Hadoop server. We are just giving the name to the service

dfs .nameservices=nameservice1

dfs.ha.namenodes.[nameservice ID] - unique identifiers for each NameNode in the nameservice. We need to configure a list of comma-separated NameNode IDs. This will be used by DataNodes to determine all the NameNodes in the cluster. For example, if you use mycluster as the NameService ID previously, and you wanted to use nn1 and nn2 as the individual IDs of the NameNodes, you would configure this as follows:

dfs.ha.namenodes.nameservice1=namenode64,namenode72

For communication between nodes, we use RPC protocol and the following parameters establishes that communication.

For both of the previously-configured NameNode IDs, set the full address and RPC port of the NameNode process.

dfs.namenode.rpc-address.nameservice1.namenode64=nn1.alpinenow.local:8020
dfs.namenode.rpc-address.nameservice1.namenode72=nn2.alpinenow.local:8020

dfs.client.failover.proxy.provider.[nameservice ID] - the Java class that HDFS clients use to contact the Active NameNode. Configure the name of the Java class which the DFS client will use to determine which NameNode is the current active, and therefore which NameNode is currently serving client requests. The only implementation which currently ships with Hadoop is the ConfiguredFailoverProxyProvider, so use this unless you are using a custom one.

dfs.client.failover.proxy.provider.nameservice1=org.apache.hadoop.hdfs.server.namenode.ha.ConfiguredFailoverProxyProvider

dfs.namenode.http-address.[nameservice ID].[name node ID] - the fully-qualified HTTP address for each NameNode to listen on. Similarly to rpc-address above, set the addresses for both NameNodes' HTTP servers to listen on.

dfs.namenode.http-address.nameservice1.namenode64=nn1.alpinenow.local:50070
dfs.namenode.http-address.nameservice1.namenode72=nn2.alpinenow.local:50070

If you have HTTPS enabled on CDH

dfs.namenode.https-address.nameservice1.namenode64=nn1.alpinenow.local:50470
dfs.namenode.https-address.nameservice1.namenode72=nn2.alpinenow.local:50470
dfs.namenode.servicerpc-address.nameservice1.namenode64=nn1.alpinenow.local:8022
dfs.namenode.servicerpc-address.nameservice1.namenode72=nn2.alpinenow.local:8022

dfs.ha.automatic-failover.enabled.nameservice1=true

If Resource Manager is configured for High Availability:

You can get the following parameters from yarn-site.xml on the Hadoop server.

yarn.resourcemanager.ha.rm-ids=rm60,rm70
yarn.resourcemanager.webapp.https.address.rm70=nn2.alpinenow.local:8090
yarn.resourcemanager.webapp.address.rm70=nn2.alpinenow.local:8088
yarn.resourcemanager.admin.address.rm70=nn2.alpinenow.local:8033
yarn.resourcemanager.resource-tracker.address.rm70=nn2.alpinenow.local:8031
yarn.resourcemanager.scheduler.address.rm70=nn2.alpinenow.local:8030
yarn.resourcemanager.address.rm70=nn2.alpinenow.local:8032
yarn.resourcemanager.webapp.https.address.rm60=nn1.alpinenow.local:8090
yarn.resourcemanager.webapp.address.rm60=nn1.alpinenow.local:8088
yarn.resourcemanager.admin.address.rm60=nn1.alpinenow.local:8033
yarn.resourcemanager.resource-tracker.address.rm60=nn1.alpinenow.local:8031
yarn.resourcemanager.scheduler.address.rm60=nn1.alpinenow.local:8030
yarn.resourcemanager.address.rm60=nn1.alpinenow.local:8032
yarn.resourcemanager.zk-address=cm.alpinenow.local:2181,nn1.alpinenow.local:2181,nn2.alpinenow.local:2181
yarn.resourcemanager.recovery.enabled=true
yarn.resourcemanager.ha.automatic-failover.embedded=true
yarn.resourcemanager.ha.automatic-failover.enabled=true
yarn.resourcemanager.ha.enabled=true
failover_resource_manager_hosts=cdh516node1.tds.local,cdh516node2.tds.local

Mapreduce:

You can get the following parameters from Mapred-site.xml on the Hadoop server.

mapreduce.job.map.output.collector.class=org.apache.hadoop.mapred.MapTask$MapOutputBuffer (The MapOutputCollector implementation(s) to use. This may be a comma-separated list of class names, in which case the map task will try to initialize each of the collectors in turn. The first to successfully initialize will be used.

mapreduce.job.reduce.shuffle.consumer.plugin.class=org.apache.hadoop.mapreduce.task.reduce.Shuffle. (Name of the class whose instance will be used to send shuffle requests by reducetasks of this job. The class must be an instance of org.apache.hadoop.mapred.ShuffleConsumerPlugin.)

mapreduce.jobhistory.address=cdh6dite.tds.local:10020 (MapReduce JobHistory Server IPC host:port)

mapreduce.jobhistory.webapp.address=cdh6dite.tds.local:19888 (MapReduce JobHistory Server Web UI host:port)
Mapreduce.application.classpath= ( We get this value when you run the command hadoop classpath on CDH server’s command line)

If using Hive:

hive.metastore.client.connect.retry.delay=1
hive.metastore.client.socket.timeout=600

Hive with Kerberos:

hive.hiveserver2.uris=jdbc:hive2://cm.alpinenow.local:10000/default
hive.metastore.kerberos.principal=hive/_HOST@ALPINENOW.LOCAL( realm name)
hive.server2.authentication.kerberos.principal=hive/_HOST@ALPINENOW.LOCAL

Spark History Server:

The following parameters need to be added when you have spark service on the Hadoop cluster.

spark.yarn.historyServer.address=http://172.27.0.3:18088
spark.eventLog.dir=hdfs://172.27.0.3:8020/user/spark/applicationHistory
spark.eventLog.enabled=true

Issue/Introduction

This article provides information about list of all additional parameters required when Kerberos and High availability are enabled on the Hadoop cluster

Welcome to "KB Articles"