Jupyter

This page douments Jupyter notebook and Jupyter lab, both versions that you run yourself from the command line and the versions you get from the web server at jupyter.cs.rutgers.edu. The software is the same.

Jupyter is a "notebook." It's a graphical tool that makes it easy to do Python, Java, or R code and display results graphihcally. It's particularly useful for exploring data. However if you're writing substantial programs you're better off using Python, Java, or Scala on the command line or from an IDE.

There are two ways to run Juypter, Jupyter Lab and Jupyter Notebook. It's the same software with the same capabiities, but Jupyter Lab is a newer user interface, and seems slightly faster. They use the same notebook files. We recommend using Jupyter Lab.

You can type a command like jupyter lab to run jupyter lab. In practice you'll need to activate a python environame first, or type /common/system/venv/python3.12-jupyter/bin/jupyter instead of just jupyter

Jupyter starts a web server. there are two ways to connect to it

Use weblogin, rdp, or x2go to create a graphical session to one of our computers. When you run jupyter, it will automatically open a browser for you, pointed to jupyter. If for some reason that doesn't work, you can run Chrome or Firefox yourself. Copy and paste either of the URLs that jupyter prints into the browser's URL bar.
Use a command line session, i.e. ssh. In this case, type jupyter lab --ip=`hostname` --browser="none". It will print a list of 3 URLs, e.g.
```
      To access the server, open this file in a browser:
        file:///common/home/hedrick/...
      Or copy and paste one of these URLs:
        http://ilab1.cs.rutgers.edu:8889/...
      or http://127.0.0.1:8889/...
```
Copy and paste the middle one (the one with the hostname, in this case ilab1.cs.rutgers.edu) into your browser.

For Jupyter notebook, use "notebook" instead of "lab."

If you prefer, you can point your browser to http://jupyter.cs.rutgers.edu. That will start a copy of Jupyter Lab that is identical to what you'd get if you run it yourself. The only difference is that jupyter.cs.rutgers.edu currently does not have access to GPUs. If you prefer the original Notebook interface, under "Help" you'll see an option "Launch Classic Notebook".

NOTE: You may also want to look at the Jupyter project's own documentation. This document focuses on how to do Spark programming from the notebook.

After you've logged into https://jupyter.cs.rutgers.edu, or started Jupyter from the comnand line, you'll see a file browser for your home directory, which shows only notebook files. (In Jupyter Lab, the file browser is on the left.) To get the interesting functionality, you need to open a notebook.

If you've already created a notebook click on it in the file browser.
NOTE: The work you do is automatically saved in a notebook file. When you start, you should consider opening an existing notebook file, rather than creating a new notebook.
If you need a new notebook, you'll see for Notebook: a "New" pulldown in the upper right, or for Lab: icons with the various types in the main panel. When you create a new one, I suggest changing the title at the top to a name you'll remember.

Notebook types

Here's what the various types of notebook are, for the copy of Jupyter in the 3.12 Python environment, i.e.

/common/system/venv/python312/bin/jupyter lab --ip=`hostname` --browser=none

Python 3: Python without Spark
Spark 3 in Python3: Python with Spark 3
Java 21: Java, with Spark optional
R

This list applies to the web versions and to jupyter if you start it from /common/system/venv/python3.12-jupyter. If you use older versions of python from /common/system/venv/, you'll get an older set of kernels.

If you start Jupyter from the command line, it will use the version of Python you currently have activated. If you connect to jupyter.cs.rutgers.edu, the default is Python 3.12.

Non-programming

Text File - lets you create a notebook that is just text
Terminal - gives you a shell on the system running Jupyterhub. Note that this system is not set up for you to run Spark / Hadoop jobs from a shell. We don't recommend using this except for simple commands. The container that is running jupyterhub doesn't have our full set of applications.

For Python3, there's excellent documetation at the main Jupyter site: The Jupyter Notebook.

COMPLETION: you can type part of a variable name and use the TAB key. That will show you all of the possibilities beginning with what you typed. In some cases (dependinging upon context) you can hit TAB right after ., to show the available properties and methods.

The rest of this page gives instructions for using Spark in Jupyter, and also an introduction to graphics from Jupyter. If you're not interested in these things, you can stop now.

Software versions
Python3 Notebook Types
Special note on using GPUs
Java Notebook Types
Spark in Python3 Notebook Type
Using Matplotlib
Spark in Java Notebook Type

Software versions

Currently Spark is 4.0.0, which was released in Spring, 2025.

Tensorflow is 2.19. Pytorch is 2.7.1.

We have Python virtual environments for 3.9 through 3.12. (There may be older enviolrnments, but they will eventually be removed.) For 3.12, please use /common/system/venv/python3.12-jupyter. There are problems with jupyter in the original python312 enviornment.

On our systems, Java is normally OpenJDK 21. We may add newer versions to ilab for the fall, but jupyter will remain Java 21.

If you need to install your own python packages, we suggest creating your own virtual environment. For information on how to do that, see https://resources.cs.rutgers.edu/docs/using-python-on-cs-linux-machines/

WARNING: When you use pip, it will suggest that you upgrade it to a new version. Do NOT try to do this. You can't actually upgrade PIP, because it is installed in a system directory. In attempting to do so, you will end up with an inconsistent set of packages.

Spark 4 is currently version 4.0.0.

Python 3 Notebook Types

You can use this to run any python 3 code you want. Just type the code into a cell and hit "run." Documentation for Python and Jupyter is readily available elsewhere, so this section is going to concentrate on using Spark and Hive from a Python 3 notebook. Also see Using Matplotlib for information on doing graphics.

PySpark is the same python as Python 3, but it is set up to do Spark operations. It has predefined the following variables:

spark - a pyspark.sql.SparkSession (using Hive)
sc - a SparkContext
sql - an bound method SparkSession.sql for the session
sqlContext - an SQLContext [for compatibility]
sqlCtx - an old name for sqlContext

Special Note On Using GPUs

GPUs are a special problem, because our normal virtual environments may not work with all common GPU software, although they do have pytorch installed.

If you want to use GPUs, we recommend taking a look at the containers supplied by Nvidia. See /common/system/nvidia-containers/INDEX-pytorch or INDEX-tensorflow for the available containers. They have cuda, pytorch (or tensorflow), python, etc. They also have a copy of jupyter.

To use jupyterlab, once you've started your singularity container, simply type "jupyter lab". It will print a message that has a URL such as "http://hostname:NNNN?token=XXXX" Connect to that URL with a browser, but use the actual hostname. E.g. if you logged into ilab2, use "http://ilab2.cs.rutgers.edu:NNNN?token=XXXX".

If you want to use spark, once you've started the singularity container type "source /common/system/spark-setup.sh". At that point if you run python, you'll get python with spark in it (i.e. "sc" will be a Spark context, etc). If you run "jupyter lab" and start a Python notebook, you'll get Spark in it.

Java notebook type

This is JJava. It uses jshell, so it should support any Java from 9 on. We use the system-wide Java, which is currently Java 21. See the web link for examples of how to display graphics. You can use xchart, though that's not necessarily the only package that would work. See the documentation for more details.

Spark in Python3 notebook type

The Spark in Python3 notebook type is intended to run Python/Spark code

Here's a simple example. It assume that you have a file "data.txt" in your directory. It should have text in it. It doesn't matter what. (The program will count words.)

Once you have the data file, Put this into a cell and hit Run. After it starts a Spark session (if one isn't already started) you'll see a count of the various words in the file.

text_file = sc.textFile("data.txt")
counts = text_file.flatMap(lambda line: line.split(" ")) \
             .map(lambda word: (word, 1)) \
             .reduceByKey(lambda a, b: a + b)
print(counts)
for x in counts.collect():
     print(x)

Sometimes you may need to use packages that aren't builtin to the system. The easiest way to do this is to create your own virtual environment. See https://resources.cs.rutgers.edu/docs/using-python-on-cs-linux-machines/ for more information.

Using Matplotlib

The primary way of getting graphics for data analysis is a Python package Matplotlib. Another common approach is pandas, but it uses matplotlib. Using matplotlib in Jupyter requires configuration.

In Python 3 and Pyspark notebook types, add the line

%matplotlib inline

near the beginning of your code. This is sufficient for output. If you want to do interactive graphics, try

%matplotlib notebook

With %mathplotlib notebook, in one case I found that show() wasn't generating output, and had to do draw(). (show() is the normal way to output a plot. draw() is used in interactive mode to update the output when something has changed.)

Here's a sample

%matplotlib inline
import matplotlib
import matplotlib.pyplot as plt
import numpy as np

# Data for plotting
t = np.arange(0.0, 2.0, 0.01)
s = 1 + np.sin(2 * np.pi * t)

fig, ax = plt.subplots()
ax.plot(t, s)

ax.set(xlabel='time (s)', ylabel='voltage (mV)',
       title='About as simple as it gets, folks')
ax.grid()

fig.savefig("test.png")
plt.show()

It shows a sine function, and also saves the output to a file "test.png".

By default, notebook produces high-resolution vector graphics, and inline, low-resolution bit-map graphics. That makes inline faster than notebook. However this can be changed. To get inline to use higher resolution bitmaps, use

%config InlineBackend.figure_format='retina'
%matplotlib inline

To get vector graphics, use

%config InlineBackend.figure_format='svg'
%matplotlib inline

To see all options for the inline backend, try %config InlineBackend

To switch between inline and notebook you will need to restart your server using the "Control Panel" link at the upper right of the window.

If you want to use inline or interactive mode a lot, you can make it the default. From a command line on any ilab machine do this:

echo "c.InteractiveShellApp.matplotlib = 'notebook'" > ~/.ipython/profile_default/ipython_config.py

Use 'inline' rather than 'notebook' for non-interactive output. Note that this will affect any copy of ipython you start, not just copies running in Jupyter. That's why we're not doing it by default.

Spark in Java

This is JJava. It uses jshell, so it should support any Java from 9 on. We are using Java 21 for this kernel. This is the most recent version that supports Spark 4.0.0.

See the web link for examples of how to display graphics. It uses xchart, though other packages should be possible.

To use Spark, type "initspark.init()". This will define

jsc - a JavaSparkContext
spark - a SparkSession
sc - a SparkContext taken from the Spark Session

If you need an sqlContext, try

var sqlContext = spark.sqlContext()

However this is considered a backward compatibility feature. You can do SQL operations directly from the SparkSession. To get a Hive-based SQLContext, use

var sqlContext = new org.apache.spark.sql.hive.HiveContext(sc)

But this is considered deprecated. Use SparkSession.builder.enableHiveSupport instead. You'll use that to create a new SparkSession with Hive enabled, from sc.

This has not been tested beyond verifying that those objects are properly generated.