How to fetch data from Hive into TIBCO Statistica?

How to fetch data from Hive into TIBCO Statistica?

book

Article ID: KB0074429

calendar_today

Updated On:

Products Versions
Spotfire Statistica 12.7 and higher

Description

The Apache Hive provides a data warehouse layer on top of Hadoop Distributed File System (HDFS). It is designed to enforce and project structure onto the data and provides SQL-like language for querying. Documentation and developers community are available through the official website here

This document is based on Cloudera distribution of HDFS and its ecosystem (v4.3). It is expected, however, that the general workflow is applicable to other distributions.

Issue/Introduction

How to fetch data from Hive into TIBCO Statistica?

Resolution

Working with Hive:

  • You can login into the Hue Web UI which is available on port 8888 (e.g. x.x.x.x:8888, where x.x.x.x is the IP address of the server). In the tools Hue has a dedicated Hive UI.

              

  • Once the data is available in Hive, it can be accessed from Statistica using ODBC driver. 
  • Getting ODBC Driver for Hive
  • Variety of different versions of Hive ODBC Connectors and Version Compatibility info is available e.g. from Cloudera
  • It is very important to install the version of the ODBC connector that is compatible with the HDFS distribution.

Creating Connection to Hive from Statistica

Configuration of the Hive connection includes 1.configure a system DSN connection 2. create a new database connection. Below are more details:


Step 1. Open ODBC Data Source Administrator (run a command odbcad32.exe). Switch to System DSN tab and click on “Add…” button.

Step 2. Select a appropriate driver should be available for selection.

In this example we select “Cloudera ODBC Driver for Apache Hive” and click “Finish”.

Step 3. Define parameters of the connection.
Data Source – is the name of the ODBC connection as it is stored in the system.
Description – is information field.
Host – is the hostname or IP address of the Hive server
Port – specific port used by Hive. Typically, it is 10000, which is default setting.
Database – default database selection.
Hive Server Type – “Hive Server 2” should be selected for the ODBC connections.
Provide necessary Authentication information following your Hive configuration.

Detailed instructions are available e.g. here

Once the settings are defined, click the “Test” button to check connection. 

Once you click OK on the “Cloudera ODBC Driver for Apache Hive DSN setup” a new connection becomes available under System Data Sources.

Step 4. Open Statistica and select “Create Query”

On the next window, select “New…” to create a new connection. On the “Data Link Properties” select Microsoft OLE DB Provider for ODBC Drivers, click next and select a data source from a drop down menu.

Click Test Connection to check that everything is properly configured.

On the next screen specify the name of this connection.

Now everything is ready to creating HiveQL queries and extracting data to Statistica.
Step 5. Write a query. Hive uses own dialect of SQL, which is called HiveQL. It is somewhat constrained version as compared with robust implementations for Oracle and SQL Server. Please, refer to the language manual for details
For example, to extract all data, one can write

Click “Return data to Statistica” button (green arrow) to get data from Hive imported into Statistica.

Note: to create a database connection in Statistica enterprise, one needs to complete similar steps (1-4).

Additional Information

Legacy Article ID: 149959