Jupyterhub

In case your server is unusable, here's a link to stop it and restart it: Reset Jupyter

Jupyterhub is part of the computer science instructional Hadoop cluster.

NOTE: You may also want to look at the Jupyter project's own documentation. This document focuses on how to do Hadoop programming from the notebook.

Jupyter is a "notebook," a web interface that makes it easier to run python. It also lets you use Spark and Spark SQL with Python, Scala, and R.

We actually have two notebooks, Jupyterhub and Zeppelin. Zeppelin is newer, and potentially might have issues, but you may prefer its design, particularly its support for graphical output.

After you've logged into https://jupyter.cs.rutgers.edu, you'll see a file browser for your home directory, which shows only notebook files. To get the interesting functionality, you need to open a notebook.

If you're using Jupyterhub as a way to run Python, and have no interest in Spark or the Hadoop cluser, consider running Jupyter Notebook on another system. See Using Python on CS Linux Machines for specifics on how to choose a Python environment on any of our systems, and then how to start Jupyter Notebook. If you do want to use Jupyterhub for non-Haoop Python, pick the "Python 3" notebook type. In that case, you don't really need the rest of this page. This documentation is aimed at people who want to use Python for Spark and for Hadoop jobs.

Here's what the various types of notebook are:

Note that there's overlap in functionality between "Python 3", and the Spark / Pyspark notebook types.

So as you can see, it's a master of personal preference whether to use Python 3 or Spark / Pyspark.

NOTE: Once you've looked at the summary here, we strongly suggest that you look at the Sparkmagic examples from the main Sparkmagic site. It contains a number of sample notebooks with detailed explanations.

NOTE: Many of the tools in Jupyter use a special file system, HDFS. Files on HDFS are backed up nightly to a second HDFS file system in a separate building. Snapshots are taken nightly, so it is possible to restore deleted files within 60 days. However we are not yet sure what our policy on retention of files is going to be. It is possible that we might reset the file system each summer. Please contact help@cs.rutgers.edu if you need to keep files in HDFS on an ongoing basis, and we'll arrange to preserve them if we decice to clean the file system.

COMPLETION: you can type part of a variable name and use the TAB key. That will show you all of the possibilities beginning with what you typed. In some cases (dependinging upon context) you can hit TAB right after ., to show the available properties and methods.

HDFS

HDFS is a distributed file system, used for the Hadoop cluster. When you are running code on the cluster, it can only read files in the HDFS file system. You'll need to copy files from your home directory into your HDFS directory.

If you login to data1.cs.rutgers.edu, data2.cs.rutgers.edu or data3.cs.rutgers.edu you can use this command

hdfs dfs -put FILE /user/NETID/FILE
HDFS doesn't have any concept of a current directory, so you can't do the equivalent of "cd" to a different directory. If you omit the directory name, you'll always get /user/NETID.

HDFS has commands much like normal Linux commands, e.g. "hdfs dfs -ls" and "hdfs dfs -rm" are equivalent to "ls" and "rm". Use "hdfs dfs -help" for a list of commands.

There's also a web interface to HDFS, which will let you upload and download files. Login to Ambari, at https://data-services1.cs.rutgers.edu. There's a tick-tack-toe icon in the upper right. If you hover over it you'll get a list of Web tools. "Files view" shows you HDFS.

The web interface starts out at root, so to get to your files you'll need to pick "user" and then your Netid.

If you want to issue hdfs commands from within Python, you can use subprocess.check_output. E.g. to do "hdfs dfs -put data.txt /user/USER/data.txt", do

%%local
import subprocess
subprocess.check_output("hdfs dfs -put data.txt /user/USER/data.txt; exit 0", shell=True, stderr=subprocess.STDOUT)

Starting and restarting Jupyter; if things go wrong

Normally when you login, you'll see a window showing your directories and any files ending in .ipynb. These files represent notebooks you've aleady created. If you haven't created any, you may get a fairly blank display. You can creata a new notebook with the "new" pulldown from the upper right.

The notebooks will time out after 8 hours of non-use. You can reopen them by clicking the name of the notebook file in the main window.

Jupyter itself will time out after 2 weeks of non-use. However you will need to login again after 2 weeks, whether you're using it or not.

We think these settings won't cause you any problems. However it's possible that due to a timeout or something else your session could become unusable. If so, you need to do two different things:

If your code gets into an infinite loop, you can interrupt it. There's an icon just to the right of the run icon (at the top of the screen). It's labelled "Interrupt the kernel." It's like typing ^C in the terminal session: it interrupts the current program.

It's possible to do something that will cause your kernel to crash or hang. Maybe it gives a system error every time. Or nothing shows at all. Of course this might be because there's a problem with the program. But it could also be that the program has done something to the interpreter to make it crash or hang.

To deal with a system that has hung or crashed, you can restart the system code. At the upper right, there's a button "Control Panel." It brings up a window with two buttoms: "Stop My Server" and "My Server." "My Server" just takes you back to the main page. "Stop My Server" stops the system process dealing with you. Once you've done that, you'll have just one button "Start My Server." Click it. That will take you back to your main page.

Note on python for this cluster

For the Hadoop cluster, we have three versions of Python. Versions of python2 and python3 come with the ooperating system. We haven't removed them. But the one you probably want is a more recent Python 3, which we have installed using anaconda.

When you use Python from Jupyterhub you automatically get the new Python 3. For data1, data2, and data3, we have set the default environment to use the new Python 3 as well.

If you set PATH yourself in .bashrc, make sure you include /usr/lib/anaconda3/bin before /bin and /usr/bin. We also set PYSPARK_PYTHON=/usr/lib/anaconda3/bin/python3. This will make sure that when you run Python interactively you get the same version that you get with Jupyterhub.

If you need to install your own python packages, we suggest that you use the command

pip install --user PACKAGE
Jupyter, data1, data2 and data3 (but not other ilab systems) are set up so that "install --user" automatically installs packages to /common/clusterdata/USER/local. That makes sure that python will use the packages whether you call it interactively, via Jupyterhub, or for jobs submitted to the cluster.

Normally, "install --user" installs in ~/.local. That location won't work for jobs running on the cluster. So we are using a special location that is available in all Hadoop contexts. It is not available outside the Hadoop system, i.e. outside jupyterhub, data1, data2 and data3.

WARNING: When you use pip, it will suggest that you upgrade it to a new version. Do NOT try to do this. You can't actually upgrade PIP, because it is installed in a system directory. In attempting to do so, you will end up with an inconsistent set of packages.

Software versions

Jupyterhub and the python software it uses was installed using Anaconda 5.2.0. The Python used is version 3.6.5. The Spark softward is from Hortonworks 2.6.3. Spark is version 2.11. (Spark version 1 is also available, but we set up configuration files for you that specify Spark 2.)

The cluster has python 2.7.5, python 3.4.8, and the same Anaconda python 3.6.5. By default Python jobs submitted to the cluster use Anaconda's python, so the python version you get locally in jupyterhub is the same you get on the cluster.

You can change the version of python used for jobs on the cluster, either using %%configure -f in pyspark or when you create a new session in Python 3 using%manage_spark. You can make a permanent change by editing .sparkmagic/config.json. Look for "session_configs" and change the value of PYSPARK_PYTHON.

Python 3 Notebook Type

You can use this to run any python 3 code you want. Just type the code into a cell and hit "run." However if you are going to be using primarily Spark from python, you might prefer the pyspark notebook type. See below.

You can also use it to run Spark code on the cluster, using either Python or Scala.

To use Spark, first execute

%load_ext sparkmagic.magics
then
%manage_spark

you run something on the cluster. It takes almost a minute to set up. A lot of work has to be done to create a session.

To run code on the cluster, you need a "session." manage_spark is used to create and manage sessions. In "manage endpoint." If no endpoint is shown, use "add endpoint" and specify the URL as "http://data-services2.cs.rutgers.edu:8999", with Kerberos authentication. This should be the default, so normally you don't have to type th URL or Kerberos.

After making sure there's an appropriate end point, see if there's a session. If not, use "create session", choose scala or python, then "create session."

Note that when you start a session, it will take quite a while. It is creating a software environment for your session on all of the cluster nodes, installing some software, and starting them. Once the session is started you'll see some session information.

Spark cluster sessions expire, currently after an hour. Unfortunately the notebook doesn't know your session has expired, so the next time you try to do something on the cluster you'll get an error. To fix it, use %manage_spark. Find the current session, kill it, and create a new one.

You can now execute Spark in python or scala (whichever you chose). Put "%%spark" in a cell, and then your python or scala Spark code on lines after it.

For more information on what you can do, pull down "Help" from the menu at the top, and choose iPython.

If you want to run Python on the cluster, in %manage_spark, when you create the session, look for the Language field. It lets you choose Scala or Python. Choose Python.

Normally when you run python locally, i.e. without any %spark or other magic, you can't use Spark, because you don't have a SparkContext. %manage_spark creates SparkContexts, but for use on the cluster. When you're trying things out, you may prefer to run locally rather than on the cluster. It should be significantly faster. To get a SparkContext you can use locally, do the following in a cell:

import pyspark
sc = pyspark.SparkContext(master="local",appName="count")
(The appname can be anything.) You should only do this once in a session. If you try it again, it will fail, probably with "Cannot run multiple SparkContexts at once." Once you've done it the first time, the variable sc can be used for local code, just as for code run on the cluster.

If for some reason you need to recreate the SparkContext, you can do

sc.stop()
and then reinitialize it as above.

Spark Notebook Type

A Spark notebook is fairly similar to an iPython notebook, though session management is a bit different. The default language is also Scala. To get the same thing with Python, use a PySpark notebook. (See below.)

To run code on the cluster, you need a "session." A session is created automatically the first time you run something on the cluster. It takes almost a minute to set up. A lot of work has to be done to create a session.

You can see whether there's currently a session by using

%%info
in a cell. To run spark code, use %%spark followed by the spark code, e.g.
%%spark
sc.version

Spark cluster sessions expire, currently after an hour. Unfortunately the notebook doesn't know your session has expired, so the next time you try to do something on the cluster you'll get an error. %%info will show you the sessions that Jupyter thinks are active. However if a session has expired it will look normally; it just won't work. To fix this do "%%cleanup -f". The next time you do a cluster operation a new session will be created.

I ran into a case where %%spark failed, claiming there was no session. I manually created one using

%%configure -f
{}

The %%cleanup command can be used to close your session. You might do this if your session times out on the cluster, but the notebook still thinks it's alive.

%%cleanup -f

You can run python code locally (in the VM that's runniing Jupyterhub, not the cluster) using %%local

%%local
1 + 1

To see all the things you can do, use

%%help

As with iPython, it takes a while to create a session. It's creating a container for you on every cluster node and adding software. That's why we ask you not to create new notesbooks and sessions unnecessarily.

Spark Notebook Examples

Here are a couple of examples to get you started. These were done in a Spark notebook, but it should be fairly easy to adapt it to the other notebook types.

This is just about the shortest possible Spark program. It simply returns the version of Spark.

%%spark
sc.version

This loads data into a Hive SQL table from a URL at Amazon, using Spark in Scala.

%%spark
import org.apache.commons.io.IOUtils
import java.net.URL
import java.nio.charset.Charset

// Zeppelin creates and injects sc (SparkContext) and sqlContext (HiveContext or SqlContext)
// So you don't need create them manually

// load bank data
val bankText = sc.parallelize(
    IOUtils.toString(
        new URL("https://s3.amazonaws.com/apache-zeppelin/tutorial/bank/bank.csv"),
        Charset.forName("utf8")).split("\n"))

case class Bank(age: Integer, job: String, marital: String, education: String, balance: Integer)

val bank = bankText.map(s => s.split(";")).filter(s => s(0) != "\"age\"").map(
    s => Bank(s(0).toInt, 
            s(1).replaceAll("\"", ""),
            s(2).replaceAll("\"", ""),
            s(3).replaceAll("\"", ""),
            s(5).replaceAll("\"", "").toInt
        )
).toDF()
bank.registerTempTable("bank")
Note that this creates a temporary table, meaning that it won't be there after the end of the session.

This is SQL code that retrieves data from the table

%%sql
select age, count(1) value
from bank 
where age < 30 
group by age 
order by age
The initial output will be text, exactly as you'd expect from this SQL query. However you'll see options that let you select various types of visualiztion: Pie chart, scatter diagram, etc.

Advanced options with Spark

This section applies to all three types of notebook, although the specific properties used as the example here apply mostly to Java and Scala (i.e. the "Spark" notebook type).

There are times when you want to specify options for your Spark session. E.g. if you want to use packages that we haven't installed, you can specify packages, and if necessary the URL of the repository they come from. You can also specify the number of cores to be used, the amount of memory, etc. To specify options, close your session if necessary using "%%cleanup -f". Then configure it with "%%configure -f", e.g.

%%configure -f
{"conf":{"spark.yarn.appMasterEnv.PYSPARK_PYTHON":"/usr/lib/anaconda3/bin/python3",
         "spark.yarn.appMasterEnv.PYTHONUSERBASE": "/common/clusterdata/NETID/local",
         "spark.jars.packages":"graphframes:graphframes:0.5.0-spark2.1-s_2.11",
         "spark.jars.repositories":"https://dl.bintray.com/spark-packages/maven"}}
Where NETID is your netid. This example specifies python3 for python jobs (which we recomnend), sets Python to be able to access packages installed with "pip install --user", and for Scala and Java, adds the graphframes package from the spark-packages repository.

Options like this can be made the default by editing your .sparkmagic/config.json file. Add a dictionary "session_configs" if it isn't there, or modify it if it is. Here's the way to set the above configuration as default:

  "session_configs": {
      "conf": {
          "spark.yarn.appMasterEnv.PYSPARK_PYTHON": "/usr/lib/anaconda3/bin/python3",
          "spark.yarn.appMasterEnv.PYTHONUSERBASE": "/common/clusterdata/NETID/local",
          "spark.jars.packages": "graphframes:graphframes:0.5.0-spark2.1-s_2.11",
          "spark.jars.repositories": "https://dl.bintray.com/spark-packages/maven"
      }
  },

Notet that you can look at the current configuration with %%info. However %%info displays it using single quotes. "%%configuration -f" won't recognize single quotes. You must use double quotes.

For a list of all of the options available, see the Spark documentation.

For the "Python 3" notebook type, if you want to supply configuration for one session, rather than putting it on config.json, the configuration is supplied with the "%manage_spark" command when starting a session.

Pyspark notebook type

The PySpark notebook type is intended to run Python/Spark code on the cluster, although you can explicitly require running code locally.

When you type something into the cell and hit "run", by default it runs on the cluster. You can override this by using the "%%local" magic as the first line in the cell.

To run code on the cluster, you need a "session." A session is created automatically the first time you run something on the cluster. It takes almost a minute to set up. A lot of work has to be done to create a session.

Spark cluster sessions expire, currently after an hour. Unfortunately the notebook doesn't know your session has expired, so the next time you try to do something on the cluster you'll get an error. %%info will show you the sessions that Jupyter thinks are active. However if a session has expired it will look normally; it just won't work. To fix this do "%%cleanup -f". The next time you do a cluster operation a new session will be created.

Commands to the notebook are done by putting "magics" into a cell and running it. The magics all begin with %. To see a list of all of them run

%help

Normally when you run python locally, i.e. with %%local, you can't use Spark, because you don't have a SparkContext. Without %%local, you're working on the cluster, and the system creates a Sparkcontext for you, but only for use on the cluster. When you're trying things out, you may prefer to run locally rather than on the cluster. It should be significantly faster. To get a SparkContext for local use, do the following in a cell:

%%local
import pyspark
sc = pyspark.SparkContext(master="local",appName="count")
You should only do this once in a session. If you try it again, it will fail, probably with "Cannot run multiple SparkContexts at once." Once you've done it the first time, the variable sc can be used for local code, just as for code run on the cluster. If for some reason you need to recreate the SparkContext, you can do
%%local
sc.stop()
and then reinitialize it as above.

To use Spark SQL in %%local, you will have to do additional imports and initialization. Documentation for using Python with Spark will describe this. Note that using pyspark on the cluster, and the pyspark shell in a command-line process will set up both the SparkContext and an SQL context for you.

Sample Pyspark code

Here's a simple example. It assume that you have loaded a text file into hdfs, as /user/USER/data.txt, where USER is your Netid.

Look at the HDFS section above for how to copy a file from your home directory into HDFS. This is complicated by the need to show you any error message. The "; exit 0" forces python to think the command worked. Otherwise it will give you a backtrace rather than showing the error message.

Once you have the data file, Put this into a cell and hit Run. After it starts a Spark session (if one isn't already started) you'll see a count of the various words in the file.

text_file = sc.textFile("/user/USER/data.txt")
counts = text_file.flatMap(lambda line: line.split(" ")) \
             .map(lambda word: (word, 1)) \
             .reduceByKey(lambda a, b: a + b)
print(counts)
for x in counts.collect():
     print(x)

Technical details

Jupyter.cs.rutgers.edu is running Jupyterhub and Jupyter that came with Anaconda. In order to support Hadoop, Sparkmagic was added. This adds the Spark kernels.

The system is a client node in our Hadoop cluster. That is, it doesn't run any cluster services, but it can access HDFS, and has a copy of Spark loaded. That allows Spark to be run locally.

To submit jobs to the cluster, we automatically create .sparkmagic/config.json in the user's home directory the first time they login. It points the Hadoop client to the cluster. Access is done through Livy, which is a proxy that lets systems outside the cluster submit jobs to it. The jobs are scheduled by Yarn.

The cluster is Kerberized. Rutgers code has been added to Jupyterhub to make sure that when the user starts a notebook it points to the user's Kerberos credentials.

The default .sparkmagic/config.json specifies the Anaconda version of Python 3 for jobs submitted to the cluster. It also points PYTHONUSERBASE to /common/clusterdata/NETID/local to make sure that cluster jobs can access modules installed using "pip install --user".