Computer Science Ilab Hadoop Support: Hive

This page documents how to use hive on our cluster. Commonly people access it through Spark. In that case they should check the Spark documentation. This page talks about using it from a command-line client. It also gives the JDBC URL in case you want to access a hive database from a program of your own.

There are two command-line tools: hive and beeline. Hive is an older tool, which is considered out of date. It runs a copy of hive locally. Beeline connects to a hive server using JDBC.

This document details how to do "load data inpath," because someone raised that issue. But similar concerns would likely apply to any Hive operations that read or write files from the file system.

By default, Hive works with databases stored as normal sequential files. "load data inpath" puts the data into an HDFS file located in /apps/hive/warehouse/TABLE. If the file was in HDFS to begin with, it is moved into that location. If it's in a local file system, it's copied. Hive can also access data stored in various database systems.

We recommand that you keep a copy of all files in your normal ilab home directory, or /common/users. HDFS is not well backed up, and may be cleared over the summer.

Hive

To use the hive command, simply type "hive." This program is considered out of date, and may not be present in the future. Consider using beeline.

We've run into one complexity with Hive. If you want to import a file, the usual command is "load data inpath '....' into table ...;" By default, the path is the name of a file stored in HDFS. Note it will rename the file into the HDFS directory /apps/hive/warehouse/TABLE, where TABLE is the name of your table. It's supposed to put it back in its original location, but that fails with a Java backtrace. The import works. Your file just ends up in a different place. If you want a copy in your own directory, you can use "hdfs dfs -cp" to move it back.

You can also load files from the normal Linux file system, using "load data local inpath '...' into table ...;" In this case specify a normal file name. If your file is too big for your home directory, you can put it in /common/users/NETID. Everyone should have a directory there.

Beeline

To use beeline for our cluster type

beeline -u "jdbc:hive2://data-services2.cs.rutgers.edu:2181,data-services3.cs.rutgers.edu:2181,data1.cs.rutgers.edu:2181/;serviceDiscoveryMode=zooKeeper;zooKeeperNamespace=hiveserver2”

(The string in quotes is the JDBC URL. You'll find that useful if you want to write your own Java applications that use Hive. Note however that you'll need to use Kerberos security to use this URL. Here's an example of using Hive with kerberos. If you do a Google search for "hive jdbc kerberos" you'll find more information.)

Beeline will work mostly the same way as the Hive client. However it's talking to a server running on data-services3. HDFS loads work the same. But to load a file from the normal Linux file system, the file has to be readable by the server. For that to work, you need to copy it to /common/clusterdata/NETID. /common/clusterdata is similar to /common/users. Everyone should already have a directory in /common/clusterdata created when you login to data1, data2, or data3.cs.rutgers.edu. The difference between /common/clusterdata and our other file systems is that it doesn't use Kerberos security. At the moment the Hive server can't read from a Kerberized file system. Also, in order to let Hive read it, you have to open your directory to world read. We'd rather restrict that to a special file system, and not have you open up your home directory or your directory on /common/users.

Suppose you want to import data.txt. Do the following

Put the file in /common/clusterdata/NETID/data.txt
Make sure /common/clusterdata/NETID is set so that it's world-readable: "chmod 755 /common/clusterdata/NETID". Please don't put any file with private information there. You only need to do this once. Once the directory is set this way, it will stay.
Make sure the file is world-readable: "chmod 644 /common/clusterdata/NETID/data.txt". This has to be done for each file you want Hive to be able to read.

Now you can use "load data local inpath '/common/clusterdata/NETID/data.txt" into table ...;"