Jupyterhub

Sessions time out. However the browser doesn't already realize it. You can end up with a "zombie session" that doesn't actually work. In this case, or other situations where a session becomes unusable, here's how to reset your session:

On the page the link takes you to, you must do two things: Now login again. If you don't do a full login, which requires you to type a username and password, things probably won't work.

Jupyterhub is part of the computer science instructional Hadoop cluster.

NOTE: You may also want to look at the Jupyter project's own documentation. This document focuses on how to do Hadoop programming from the notebook.

Jupyter is a "notebook," a web interface that makes it easier to run python. It also lets you use Spark and Spark SQL with Python, Scala, and R.

This tool is useful primarily if you want to use Hadoop or Spark. If you just want to use Python, you can run Jupyter notesbooks on any of the student machines, or you can download python to your own computer and install it. This web page describes how to run Jupyter on any of our student systems: Using Python on CS Linux machines.. The Project Jupyter site will show you how to install it on your own computer. Note that on most of our systems (but not this one) Jupyter has access to GPUs. For serious GPU work, please use ilab.cs.rutgers.edu, as it has the fastest GPUs. If you do want to use Jupyterhub for non-Hadoop Python, pick the "Python 3" notebook type. In that case, you don't really need the rest of this page. This documentation is aimed at people who want to use Python for Spark and for Hadoop jobs.

We actually have two notebooks, Jupyterhub and Zeppelin. Zeppelin is newer, and potentially might have issues, but you may prefer its design, particularly its support for graphical output.

NOTE: In May, 2019, we will be moving to a new version of Hadoop, which includes Jupyterhub. The new version of Jupyterhub will look very similar. You can try it, at https://data8.cs.rutgers.edu. (The name jupyter.cs.rutgers.edu will be moved to that system when we change over.) Comments in this document relating to the new version will be flagged with [HDP3].

After you've logged into https://jupyter.cs.rutgers.edu (HDP3: https://data8.cs.rutgers.edu), you'll see a file browser for your home directory, which shows only notebook files. To get the interesting functionality, you need to open a notebook.

Here's what the various types of notebook are:

Spark running in parallel on the cluster:

Non-programming

Note that there's overlap in functionality between "Python 3", and the Spark / Pyspark notebook types.

So as you can see, it's a master of personal preference whether to use Python 3 or Spark / Pyspark.

NOTE: Once you've looked at the summary here, we strongly suggest that you look at the Sparkmagic examples from the main Sparkmagic site. It contains a number of sample notebooks with detailed explanations.

NOTE: Many of the tools in Jupyter use a special file system, HDFS. Files on HDFS are backed up nightly to a second HDFS file system in a separate building. Snapshots are taken nightly, so it is possible to restore deleted files within 60 days. However we are not yet sure what our policy on retention of files is going to be. It is possible that we might reset the file system each summer. Please contact help@cs.rutgers.edu if you need to keep files in HDFS on an ongoing basis, and we'll arrange to preserve them if we decice to clean the file system.

COMPLETION: you can type part of a variable name and use the TAB key. That will show you all of the possibilities beginning with what you typed. In some cases (dependinging upon context) you can hit TAB right after ., to show the available properties and methods.

Table of contents

HDFS

HDFS is a distributed file system, used for the Hadoop cluster. When you are running code on the cluster, it can only read files in the HDFS file system. You'll need to copy files from your home directory into your HDFS directory.

[HDP3] There's a new HDFS file system for the new version. To look at files, it's probably most convenient to login via ssh or X2Go to data4.cs.rutgers.edu, data5, or data6. From there, you can copy files from the old system to the new system using a command like

hdfs dfs -cp hdfs://data-services2/user/NETID/file /user/NETID/file
Note that if you copy a directory, all the files in it are also copied.

To look at your old directory from the new systems, you can use a command like

hdfs dfs -ls hdfs://data-services2/user/NETID

If you login to data1.cs.rutgers.edu, data2.cs.rutgers.edu or data3.cs.rutgers.edu (HDP3: data4, data5 or data6) you can use this command

hdfs dfs -put FILE /user/NETID/FILE
HDFS doesn't have any concept of a current directory, so you can't do the equivalent of "cd" to a different directory. If you omit the directory name, you'll always get /user/NETID.

HDFS has commands much like normal Linux commands, e.g. "hdfs dfs -ls" and "hdfs dfs -rm" are equivalent to "ls" and "rm". Use "hdfs dfs -help" for a list of commands.

There's also a web interface to HDFS, which will let you upload and download files. Login to Ambari, at https://data-services1.cs.rutgers.edu. There's a tick-tack-toe icon in the upper right. If you hover over it you'll get a list of Web tools. "Files view" shows you HDFS.

The web interface starts out at root, so to get to your files you'll need to pick "user" and then your Netid.

If you want to issue hdfs commands from within Python, you can use subprocess.check_output. E.g. to do "hdfs dfs -put data.txt /user/USER/data.txt", do

%%local
import subprocess
subprocess.check_output("hdfs dfs -put data.txt /user/USER/data.txt; exit 0", shell=True, stderr=subprocess.STDOUT)

Starting and restarting Jupyter; if things go wrong

Normally when you login, you'll see a window showing your directories and any files ending in .ipynb. These files represent notebooks you've aleady created. If you haven't created any, you may get a fairly blank display. You can creata a new notebook with the "new" pulldown from the upper right.

The notebooks will time out after 8 hours of non-use. You can reopen them by clicking the name of the notebook file in the main window.

Jupyter itself will time out after 2 weeks of non-use. However you will need to login again after 2 weeks, whether you're using it or not.

We think these settings won't cause you any problems. However it's possible that due to a timeout or something else your session could become unusable. If so, you need to do two different things:

If your code gets into an infinite loop, you can interrupt it. There's an icon just to the right of the run icon (at the top of the screen). It's labelled "Interrupt the kernel." It's like typing ^C in the terminal session: it interrupts the current program.

It's possible to do something that will cause your kernel to crash or hang. Maybe it gives a system error every time. Or nothing shows at all. Of course this might be because there's a problem with the program. But it could also be that the program has done something to the interpreter to make it crash or hang.

To deal with a system that has hung or crashed, you can restart the system code. At the upper right, there's a button "Control Panel." It brings up a window with two buttoms: "Stop My Server" and "My Server." "My Server" just takes you back to the main page. "Stop My Server" stops the system process dealing with you. Once you've done that, you'll have just one button "Start My Server." Click it. That will take you back to your main page.

Note on python for this cluster

For the Hadoop cluster, we have three versions of Python. Versions of python2 and python3 come with the ooperating system. We haven't removed them. But the one you probably want is a more recent Python 3, which we have installed using anaconda.

When you use Python from Jupyterhub you automatically get the new Python 3. For data1, data2, and data3 (HDP4: data4, data5 and data6), we have set the default environment to use the new Python 3 as well.

If you set PATH yourself in .bashrc, make sure you include /usr/lib/anaconda3/bin before /bin and /usr/bin. We also set PYSPARK_PYTHON=/usr/lib/anaconda3/bin/python3. This will make sure that when you run Python interactively you get the same version that you get with Jupyterhub.

If you need to install your own python packages, we suggest that you use the command

pip install --user PACKAGE
Jupyter, data1, data2 and data3 (HDP3: also data4, data5, data6) (but not other ilab systems) are set up so that "install --user" automatically installs packages to /common/clusterdata/USER/local. That makes sure that python will use the packages whether you call it interactively, via Jupyterhub, or for jobs submitted to the cluster.

Normally, "install --user" installs in ~/.local. That location won't work for jobs running on the cluster. So we are using a special location that is available in all Hadoop contexts. It is not available outside the Hadoop system, i.e. outside jupyterhub, data1, data2 and data3.

WARNING: When you use pip, it will suggest that you upgrade it to a new version. Do NOT try to do this. You can't actually upgrade PIP, because it is installed in a system directory. In attempting to do so, you will end up with an inconsistent set of packages.

Software versions

Jupyterhub and the python software it uses was installed using Anaconda 5.2.0. The Python used is version 3.6.5. The Spark softward is from Hortonworks 2.6.3. Spark is version 2.11. (Spark version 1 is also available, but we set up configuration files for you that specify Spark 2.)

The cluster has python 2.7.5, python 3.4.8, and the same Anaconda python 3.6.5. By default Python jobs submitted to the cluster use Anaconda's python, so the python version you get locally in jupyterhub is the same you get on the cluster.

You can change the version of python used for jobs on the cluster, either using %%configure -f in pyspark or when you create a new session in Python 3 using%manage_spark. You can make a permanent change by editing .sparkmagic/config.json. Look for "session_configs" and change the value of PYSPARK_PYTHON. (HDP3: configuration is slightly different, so we place the new version of config.json in .sparkmagic2)

HDP3:

Jupyterhub and the python software it uses was installed using Anaconda 2019.3. The Python used is version 3.7.1. The Spark softward is from Hortonworks 3.1.1. Spark is version 2.3.2. (Spark version 1 is also available, but we set up configuration files for you that specify Spark 2.)

The cluster has python 2.7.5, python 3.4.9, and the same Anaconda python 3.7.1. By default Python jobs submitted to the cluster use Anaconda's python, so the python version you get locally in jupyterhub is the same you get on the cluster.

You can change the version of python used for jobs on the cluster, either using %%configure -f in pyspark or when you create a new session in Python 3 using%manage_spark. You can make a permanent change by editing .sparkmagic/config.json. Look for "session_configs" and change the value of PYSPARK_PYTHON. (HDP3: configuration is slightly different, so we place the new version of config.json in .sparkmagic2)

Python 3 And PySpark local Notebook Types

You can use this to run any python 3 code you want. Just type the code into a cell and hit "run." Documentation for Python and Jupyter is readily available elsewhere, so this section is going to concentrate on using Spark and Hive from a Python 3 notebook. Also see Using Matplotlib for information on doing graphics.

PySpark local is the same python as Python 3, but it is set up to do Spark operations. It has predefined the following variables:

[Not HDP3. This section will be removed when we change over. Use a notebook of type "PySpark" to do Pythoo on the cluster]

There are two ways to use Spark:

Because jupyter.cs.rutgers.edu is actually part of the cluster, the same version of Spark is avaiable both ways. The local copy can also access the cluster's HDFS file storage. We recommend that you start out running Spark locally, and only use the cluster when the program is debugged and you want to try it out using parallel processing. Graphics is only available when running locally.

Running Spark on the cluster

[Not HDP3. This section will be removed when we change over. Use a notebook of type "PySpark" to do Pythoo on the cluster]

if you are going to be using primarily Spark from the cluster, you may be better off using a Pyspark (python) or Spark (Scala) notebook type. This section tells you how to connect to the custer from a Python 3 notebook. However this does't currentlyh work in the new version (HDP3). We suggest using a notebook of type "PySpark" to run on the cluster. That's what it is for.

But you can run Spark code on the cluster from a Python 3 notebook using special magics.

To use Spark, first execute

%load_ext sparkmagic.magics
then
%manage_spark

you run something on the cluster. It takes almost a minute to set up. A lot of work has to be done to create a session.

To run code on the cluster, you need a "session." manage_spark is used to create and manage sessions. In "manage endpoint." If no endpoint is shown, use "add endpoint" and specify the URL as "http://data-services2.cs.rutgers.edu:8999", with Kerberos authentication. This should be the default, so normally you don't have to type th URL or Kerberos.

After making sure there's an appropriate end point, see if there's a session. If not, use "create session", choose scala or python, then "create session."

Note that when you start a session, it will take quite a while. It is creating a software environment for your session on all of the cluster nodes, installing some software, and starting them. Once the session is started you'll see some session information.

Spark cluster sessions expire, currently after an hour. Unfortunately the notebook doesn't know your session has expired, so the next time you try to do something on the cluster you'll get an error. To fix it, use %manage_spark. Find the current session, kill it, and create a new one.

You can now execute Spark in python or scala (whichever you chose). Put "%%spark" in a cell, and then your python or scala Spark code on lines after it.

For more information on what you can do, pull down "Help" from the menu at the top, and choose iPython.

If you want to run Python on the cluster, in %manage_spark, when you create the session, look for the Language field. It lets you choose Scala or Python. Choose Python.

Spark Notebook Type

A Spark notebook is fairly similar to an iPython notebook, though session management is a bit different. The default language is also Scala. To get the same thing with Python, use a PySpark notebook. (See below.)

To run code on the cluster, you need a "session." A session is created automatically the first time you run something on the cluster. It takes almost a minute to set up. A lot of work has to be done to create a session.

You can see whether there's currently a session by using

%%info
in a cell. To run spark code, use %%spark followed by the spark code, e.g.
%%spark
sc.version

Spark cluster sessions expire, currently after an hour. Unfortunately the notebook doesn't know your session has expired, so the next time you try to do something on the cluster you'll get an error. %%info will show you the sessions that Jupyter thinks are active. However if a session has expired it will look normally; it just won't work. To fix this do "%%cleanup -f". The next time you do a cluster operation a new session will be created.

I ran into a case where %%spark failed, claiming there was no session. I manually created one using

%%configure -f
{}

The %%cleanup command can be used to close your session. You might do this if your session times out on the cluster, but the notebook still thinks it's alive.

%%cleanup -f

You can run python code locally (in the VM that's runniing Jupyterhub, not the cluster) using %%local

%%local
1 + 1

To see all the things you can do, use

%%help

As with iPython, it takes a while to create a session. It's creating a container for you on every cluster node and adding software. That's why we ask you not to create new notesbooks and sessions unnecessarily.

Spark Notebook Examples

Here are a couple of examples to get you started. These were done in a Spark notebook, but it should be fairly easy to adapt it to the other notebook types.

This is just about the shortest possible Spark program. It simply returns the version of Spark.

%%spark
sc.version

This loads data into a Hive SQL table from a URL at Amazon, using Spark in Scala.

%%spark
import org.apache.commons.io.IOUtils
import java.net.URL
import java.nio.charset.Charset

// Zeppelin creates and injects sc (SparkContext) and sqlContext (HiveContext or SqlContext)
// So you don't need create them manually

// load bank data
val bankText = sc.parallelize(
    IOUtils.toString(
        new URL("https://s3.amazonaws.com/apache-zeppelin/tutorial/bank/bank.csv"),
        Charset.forName("utf8")).split("\n"))

case class Bank(age: Integer, job: String, marital: String, education: String, balance: Integer)

val bank = bankText.map(s => s.split(";")).filter(s => s(0) != "\"age\"").map(
    s => Bank(s(0).toInt, 
            s(1).replaceAll("\"", ""),
            s(2).replaceAll("\"", ""),
            s(3).replaceAll("\"", ""),
            s(5).replaceAll("\"", "").toInt
        )
).toDF()
bank.registerTempTable("bank")
Note that this creates a temporary table, meaning that it won't be there after the end of the session.

This is SQL code that retrieves data from the table

%%sql
select age, count(1) value
from bank 
where age < 30 
group by age 
order by age
The initial output will be text, exactly as you'd expect from this SQL query. However you'll see options that let you select various types of visualiztion: Pie chart, scatter diagram, etc.

Advanced options with Spark

This section applies to all three types of notebook, although the specific properties used as the example here apply mostly to Java and Scala (i.e. the "Spark" notebook type).

There are times when you want to specify options for your Spark session. E.g. if you want to use packages that we haven't installed, you can specify packages, and if necessary the URL of the repository they come from. You can also specify the number of cores to be used, the amount of memory, etc. To specify options, close your session if necessary using "%%cleanup -f". Then configure it with "%%configure -f", e.g.

%%configure -f
{"conf":{"spark.yarn.appMasterEnv.PYSPARK_PYTHON":"/usr/lib/anaconda3/bin/python3",
         "spark.yarn.appMasterEnv.PYTHONUSERBASE": "/common/clusterdata/NETID/local",
         "spark.jars.packages":"graphframes:graphframes:0.5.0-spark2.1-s_2.11",
         "spark.jars.repositories":"https://dl.bintray.com/spark-packages/maven"}}
Where NETID is your netid. This example specifies python3 for python jobs (which we recomnend), sets Python to be able to access packages installed with "pip install --user", and for Scala and Java, adds the graphframes package from the spark-packages repository.

Options like this can be made the default by editing your .sparkmagic/config.json file. Add a dictionary "session_configs" if it isn't there, or modify it if it is. (HDP3: configuration is slightly different, so we place the new version of config.json in .sparkmagic2) Here's the way to set the above configuration as default:

  "session_configs": {
      "conf": {
          "spark.yarn.appMasterEnv.PYSPARK_PYTHON": "/usr/lib/anaconda3/bin/python3",
          "spark.yarn.appMasterEnv.PYTHONUSERBASE": "/common/clusterdata/NETID/local",
          "spark.jars.packages": "graphframes:graphframes:0.5.0-spark2.1-s_2.11",
          "spark.jars.repositories": "https://dl.bintray.com/spark-packages/maven"
      }
  },

Notet that you can look at the current configuration with %%info. However %%info displays it using single quotes. "%%configuration -f" won't recognize single quotes. You must use double quotes.

For a list of all of the options available, see the Spark documentation.

For the "Python 3" notebook type, if you want to supply configuration for one session, rather than putting it on config.json, the configuration is supplied with the "%manage_spark" command when starting a session.

Pyspark notebook type

The PySpark notebook type is intended to run Python/Spark code on the cluster, although you can explicitly require running code locally.

When you type something into the cell and hit "run", by default it runs on the cluster. You can override this by using the "%%local" magic as the first line in the cell.

To run code on the cluster, you need a "session." A session is created automatically the first time you run something on the cluster. It takes almost a minute to set up. A lot of work has to be done to create a session.

Spark cluster sessions expire, currently after an hour. Unfortunately the notebook doesn't know your session has expired, so the next time you try to do something on the cluster you'll get an error. %%info will show you the sessions that Jupyter thinks are active. However if a session has expired it will look normally; it just won't work. To fix this do "%%cleanup -f". The next time you do a cluster operation a new session will be created.

Commands to the notebook are done by putting "magics" into a cell and running it. The magics all begin with %. To see a list of all of them run

%help

Normally when you run python locally, i.e. with %%local, you can't use Spark, because you don't have a SparkContext. Without %%local, you're working on the cluster, and the system creates a Sparkcontext for you, but only for use on the cluster. When you're trying things out, you may prefer to run locally rather than on the cluster. It should be significantly faster. To get a SparkContext for local use, do the following in a cell:

%%local
import pyspark
sc = pyspark.SparkContext(master="local",appName="count")
You should only do this once in a session. If you try it again, it will fail, probably with "Cannot run multiple SparkContexts at once." Once you've done it the first time, the variable sc can be used for local code, just as for code run on the cluster. If for some reason you need to recreate the SparkContext, you can do
%%local
sc.stop()
and then reinitialize it as above.

To use Spark SQL in %%local, you will have to do additional imports and initialization. Documentation for using Python with Spark will describe this. Note that using pyspark on the cluster, and the pyspark shell in a command-line process will set up both the SparkContext and an SQL context for you.

Sample Pyspark code

Here's a simple example. It assume that you have loaded a text file into hdfs, as /user/USER/data.txt, where USER is your Netid.

Look at the HDFS section above for how to copy a file from your home directory into HDFS. This is complicated by the need to show you any error message. The "; exit 0" forces python to think the command worked. Otherwise it will give you a backtrace rather than showing the error message.

Once you have the data file, Put this into a cell and hit Run. After it starts a Spark session (if one isn't already started) you'll see a count of the various words in the file.

text_file = sc.textFile("/user/USER/data.txt")
counts = text_file.flatMap(lambda line: line.split(" ")) \
             .map(lambda word: (word, 1)) \
             .reduceByKey(lambda a, b: a + b)
print(counts)
for x in counts.collect():
     print(x)

Using Matplotlib

The primary way of getting graphics for data analysis is a Python package Matplotlib. Another common approach is pandas, but it uses matplotlib. Using matplotlib in Jupyter requires configuration.

In local python images (i.e. Python 3 notebooks, and %%local from Spark and Pyspark), add the line

%matplotlib inline
near the beginning of your code. This is sufficient for output. If you want to do interactive graphics, try
%matplotlib notebook
With %mathplotlib notebook, in one case I found that show() wasn't generating output, and had to do draw(). (show() is the normal way to output a plot. draw() is used in interactive mode to update the output when something has changed.)

By default, notebook produces high-resolution vector graphics, and inline, low-resolution bit-map graphics. That makes inline faster than notebook. However this can be changed. To get inline to use higher resolution bitmaps, use

%config InlineBackend.figure_format='retina'
%matplotlib inline
To get vector graphics, use
%config InlineBackend.figure_format='svg'
%matplotlib inline
To see all options for the inline backend, try %config InlineBackend

To switch between inline and notebook you will need to restart your server using the control panel link at the upper right of the window.

Matplotlib with Spark

If you want to use Spark, you are probably better off running it locally in a Spark 3 notebook. That's because the protocol used to run Spark on the cluster doesn't allow graphics. Here's an example of setting up Spark and Hive Sql contexts locally:

from pyspark.sql import HiveContext
import pyspark
sc = pyspark.SparkContext()
sqlContext = HiveContext(sc)
Of course you would need additional imports and configurtation for the Matplotlib part.

If you want to try running Spark on the cluster, you need to send data back and plot it in a local copy of python, e.g. using %%local. See https://github.com/jupyter-incubator/sparkmagic/issues/322 for more information on how to do this.

If you want to use graphics with Spark or the cluster, you may be better off using Zeppelin.

If you want to use inline or interactive mode a lot, you can make it the default. From a command line on any ilab machine do this:

echo "c.InteractiveShellApp.matplotlib = 'notebook'" > ~/.ipython/profile_default/ipython_config.py
Use 'inline' rather than 'notebook' for non-interactive output. Note that this will affect any copy of ipython you start, not just copies running in Jupyter. That's why we're not doing it by default.

Technical details

Jupyter.cs.rutgers.edu is running Jupyterhub and Jupyter that came with Anaconda. In order to support Hadoop, Sparkmagic was added. This adds the Spark kernels.

The system is a client node in our Hadoop cluster. That is, it doesn't run any cluster services, but it can access HDFS, and has a copy of Spark loaded. That allows Spark to be run locally.

To submit jobs to the cluster, we automatically create .sparkmagic/config.json in the user's home directory the first time they login. It points the Hadoop client to the cluster. Access is done through Livy, which is a proxy that lets systems outside the cluster submit jobs to it. The jobs are scheduled by Yarn. (HDP3: configuration is slightly different, so we place the new version of config.json in .sparkmagic2)

The cluster is Kerberized. Rutgers code has been added to Jupyterhub to make sure that when the user starts a notebook it points to the user's Kerberos credentials.

The default .sparkmagic/config.json specifies the Anaconda version of Python 3 for jobs submitted to the cluster. It also points PYTHONUSERBASE to /common/clusterdata/NETID/local to make sure that cluster jobs can access modules installed using "pip install --user". (HDP3: configuration is slightly different, so we place the new version of config.json in .sparkmagic2)