Configure Hive HDFS Permissions

Configure Hive HDFS Permissions

book

Article ID: KB0082627

calendar_today

Updated On:

Products Versions
Spotfire Data Science 6.2.2

Description

Configure Hive HDFS Permissions

Issue/Introduction

Configure Hive HDFS Permissions

Resolution

Configure Hive HDFS Permissions

Note: As of 6.2.2, we provide updated Hive support. This Knowledge Base article will walk you through the steps to set up your system and describe how we store results files on HDFS.
 

HDFS Directory and Permissions Configuration

User Chorus HDFS Directory

Create a /user/chorus directory with the owner:group as chorus:supergroup.

hdfs dfs -mkdir -p /user/chorus

This directory will be used to cache the uploaded JAR files such as spark-assembly.jar.

The /user/chorus directory should have read, write, and execute permissions set for the chorus user.

hdfs dfs -chown chorus:supergroup /user/chorus
hdfs dfs -chmod 777 /user/chorus
The staging directory is typically set as /user. If not, please create a directory using the modified /<stagingdirectory>/chorus.
 

Active Directory (AD) Permissions

In order to run Pig jobs, the Spotfire Data Science application attempts to create a folder called /user/<username> as the AD user. By default, the permissions are set to hdfs:supergroup:drwxr-xr-x which prevents Spotfire Data Science from creating that folder. Change the permissions to grant write access to that folder to the AD users running the Spotfire Data Science application. (Use drwxrwxr-x or drwxrwxrwx).

Temp Directory Permissions

In order to run YARN, Pig, and similar jobs, each individual user may need to write temporary files to the temporary directories.

There are many Hadoop temp directories such as hadoop.tmp.dirpig.tmp.dir, etc. By default, all of them are based off the /tmp directory.

Therefore, the /tmp directory must be writeable to everyone in order to let everyone run different jobs. 

Additionally, the /tmp directory but be executable by everyone in order to let everyone recurse the directory tree. 

Set the /tmp permissions by using the following command:

hdfs dfs -chmod 777 /tmp

Spotfire Data Science Related HDFS Configuration

Spotfire Data Science Directory Structure

Spotfire Data Science uses several temporary directories on HDFS. These directories and files are created with hdfsyarnmapred, and other users.

The temporary directories must be made accessible to the user alpine and other relevant useres at the base level.

Note: Only individual directories for the specified user can be viewed by that user.

These directories are:

  • Standard output for operators@default_tmpdir/tsds_out/<user_name>/<workflow_name>/
  • Spotfire Data Science temporary output: @default_tmpdir/tsds_runtime/<user_name>/<workflow_name>/
  • Spotfire Data Science model location@default_tmpdir/tsds_model/<user_name>/<workflow_name>/

Temp Directory Ownership For Spotfire Data Science Folders

The /tmp directory should be readable and writable.

The /tmp/hadoop-yarn directory should be readable and writable for Spark jobs.

Create the Spotfire Data Science folders and assign permissions to them to avoid permission failures.

hdfs dfs -mkdir /tmp/tsds_out /tmp/tsds_runtime /tmp/tsds_model
hdfs dfs -chown chorus /tmp/tsds_out /tmp/tsds_runtime /tmp/tsds_model
hdfs dfs -chmod 777 /tmp/alpine_out /tmp/tsds_runtime /tmp/tsds_model

Hive ACL (Access Control List) Configuration

In order to run Hive operators and jobs, we need to set up an Access Control List (ACL) for the Hive user.

The Hive user should have read, write, and execute access to /tmp and all Spotfire Data Science folders.

hdfs dfs -setfacl -m default:user:hive:rwx /tmp
hdfs dfs -setfacl -m user:hive:rwx /tmp
hdfs dfs -setfacl -R -m default:user:hive:rwx /tmp/alpine_*
hdfs dfs -setfacl -R -m user:hive:rwx /tmp

Upgrade Options

If you're upgrading Spotfire Data Science from a previous version to 6.2 or later, you'll need to perform these actions as well:

Change /tmp/alpine_* directories to have full permissions so that everyone can read, write, and execute.

hdfs dfs -chmod -R 777 /tmp/alpine_out /tmp/alpine_runtime /tmp/alpine_model
hdfs dfs -setfacl -R -m default:user:hive:rwx /tmp
hdfs dfs -setfacl -R -m user:hive:rwx /tmp

Customizing Your Permission Settings

With the following settings, users can customize their permissions for the Spotfire Data Science user folders, workflow folders, operator folders, and output files.

There are three configuration options you can set in alpine.conf.

  • alpine.hdfs.userDirPerms – sets permissions for the user folders @default_tempdir/alpine_*/<user>
  • alpine.hdfs.dirPerms – sets permissions for the workflow folders and the operator folders in @default_tempdir/alpine_*/<user>
  • alpine.hdfs.filePerms – sets permissions for Spotfire Data Science output files. 

Each of these needs to be set with a 10 character long permission string. Here are the default settings:

alpine.hdfs.userDirPerms = "-rwxrwxrwx"
alpine.hdfs.dirPerms = "-rwxrwxrwx”
alpine.hdfs.filePerms = "-rwxr-x---"

Frequently Asked Questions

How do I change @default_tmpdir?

@default_tmpdir is set to /tmp initially. You can change this for individual workflows using Workflow Variables or for all newly created workflows using Work Flow Preferences.

Which files can I safely clear from @default_tmpdir

Spotfire Data Science overwrites @default_tmpdir/alpine_* files when users re-run workflows.

Spotfire Data Science users can clear selected @default_tmpdir/alpine_out files using Clear Temporary Data.

Hadoop administrators can safely clear @default_tmpdir/alpine_runtime from HDFS as this directory is used to store information for which Spotfire Data Science users have chosen the option "Store Results = False".

Please handle @default_tmpdir/alpine_model with caution, as Spotfire Data Science users may need to export models from this directory.