Jupyter

This page douments Jupyter notebook and Jupyter lab, both versions that you run yourself from the command line and the versions you get from the web server at jupyter.cs.rutgers.edu. The software is the same.

Warning for people using jupyter.cs.rutgers.edu.

Sessions time out. In the newest version, the system seems to handle it reasonably. You may however be confonted with a page that tells you to restart your server. If that happens, we strongly recommend that you logout and login first.

There are two links you may need to restart.

Jupyter is a "notebook." It's a graphical tool that makes it easy to do Python or Scala commands and display results graphihcally. It's particularly useful for exploring data. However for substantial Python or Scala programs you're better off using Python or Scala on the command line or from an IDE.

There are two ways run Juypter, Jupyter Lab and Jupyter Notebook. It's the same software with the same capabiities, but Jupyter Lab is a newer user interface, and seems slightly faster. They use the same notebook files. We recommend using Jupyter Lab.

You can run jupyter lab/notebook on any of our systems, using a command like

/koko/system/anaconda/envs/python38/bin/jupyter lab --ip=`hostname` --browser="none".
This will display a URL. Use your browser to connect to it. It actually gives three versions of the URL. Copy the one that starts with http://HOSTNAME. Don't use http://127.0.0.1. (If you are sitting at the computer, or using X2Go / Microsoft Remote Desktop, can omit the --ip and --browser arguments. It will then start up a browser for you on the same system.)

For Jupyter notebook, you "notebook" instead of "lab."

If you prefer, you can point your browser to http://jupyter.cs.rutgers.edu. That will start a copy of Jupyter Lab that is identical to what you'd get if you run it yourself. The only difference is that jupyter.cs.rutgers.edu currently does not have access to GPUs. If you prefer the original Notebook interface, under "Help" you'll see an option "Launch Classic Notebook".

NOTE: You may also want to look at the Jupyter project's own documentation. This document focuses on how to do Spark programming from the notebook.

We actually have two notebooks, Jupyterhub and Zeppelin. Zeppelin is newer, and potentially might have issues, but you may prefer its design, particularly its support for graphical output.

After you've logged into https://jupyter.cs.rutgers.edu, or started Jupyter from the comnand line, you'll see a file browser for your home directory, which shows only notebook files. (In Jupyter Lab, the file browser is on the left.) To get the interesting functionality, you need to open a notebook.

Notebook types

Here's what the various types of notebook are, for the copy of Jupyter in Anaconda python 3.8, i.e.

/koko/system/anaconda/envs/python38/bin/jupyter lab --ip=`hostname` --browser=none
Older python environments will have Spark 2. We use Spark 3 with Java 11, and Spark 2 with Java 8.

If you start Jupyter from the command line, it will use the version of Python you currently have activated. If you connect to jupyter.cs.rutgers.edu, you'll have a choice of the most recent version of Python, or Python 3.6, as well as Spark 2 and 3.

We recommend that you use the most recent version, currently Spark3 and Python 3.8.

Non-programming

For Python3, there's excellent documetation at the main Jupyter site: The Jupyter Notebook.

COMPLETION: you can type part of a variable name and use the TAB key. That will show you all of the possibilities beginning with what you typed. In some cases (dependinging upon context) you can hit TAB right after ., to show the available properties and methods.

The rest of this page gives instructions for using Spark in Jupyter, and also an introduction to graphics from Jupyter. If you're not interested in these things, you can stop now.

Table of contents

Software versions

Currently Spark is 3.1.1, which was released in summer of 2021. (Python 3.6 and 3.7 have older versions.)

Tensorflow is

The Tensorflow in Python 3.8 will change during summer of 2021

We have Anaconda environments for Python 2.7 and 3.5 through 3.8. Spark 3 is in Python 3.8. Spark 2 is in Python 3.7. It may also work in Python 2, but we no longer support Python 2.

On our systems, Java is normally OpenJDK 11. However Spark 2 only works with Java 8, so Spark 2 kernels in Jupyter are configured to use OpenJDK 8.

If you run jupyter notebook or jupyterlab yourself, you get whatever version of python you run it from.

If you need to install your own python packages, we suggest that you use the command

pip install --user PACKAGE
You must use the same version of pip as your notebook uses. E.g. if you are using a python3.8 notebook, you should do
/koko/system/anaconda/envs/python38/bin/pip install --user PACKAGE

WARNING: When you use pip, it will suggest that you upgrade it to a new version. Do NOT try to do this. You can't actually upgrade PIP, because it is installed in a system directory. In attempting to do so, you will end up with an inconsistent set of packages.

Spark 3 is currently version 3.1.1. Spark 2 is version 2.4.5.

Python 3 Notebook Types

You can use this to run any python 3 code you want. Just type the code into a cell and hit "run." Documentation for Python and Jupyter is readily available elsewhere, so this section is going to concentrate on using Spark and Hive from a Python 3 notebook. Also see Using Matplotlib for information on doing graphics.

PySpark is the same python as Python 3, but it is set up to do Spark operations. It has predefined the following variables:

Spark in Python3 notebook type

The Spark in Python3 notebook type is intended to run Python/Spark code

Here's a simple example. It assume that you have a file "data.txt" in your directory. It should have text in it. It doesn't matter what. (The program will count words.)

Once you have the data file, Put this into a cell and hit Run. After it starts a Spark session (if one isn't already started) you'll see a count of the various words in the file.

text_file = sc.textFile("data.txt")
counts = text_file.flatMap(lambda line: line.split(" ")) \
             .map(lambda word: (word, 1)) \
             .reduceByKey(lambda a, b: a + b)
print(counts)
for x in counts.collect():
     print(x)

Sometimes you may need to use packages that aren't builtin to the system. The easiest way to do this is with the command-line tool "pip". You can log into one of one ilab systems and use it from the command line, or you can create a "terminal" session from jupyterhub.

The command looks like

pip install --user PACKAGE
Pip will offer to update itself. Please don't do that. You must use the same version of pip as your notebook uses. E.g. if you are using a python3.8 notebook, you should do
/koko/system/anaconda/envs/python38/bin/pip install --user PACKAGE

Using Matplotlib

The primary way of getting graphics for data analysis is a Python package Matplotlib. Another common approach is pandas, but it uses matplotlib. Using matplotlib in Jupyter requires configuration.

In Python 3 and Pyspark notebook types, add the line

%matplotlib inline
near the beginning of your code. This is sufficient for output. If you want to do interactive graphics, try
%matplotlib notebook
With %mathplotlib notebook, in one case I found that show() wasn't generating output, and had to do draw(). (show() is the normal way to output a plot. draw() is used in interactive mode to update the output when something has changed.)

Here's a sample

%matplotlib inline
import matplotlib
import matplotlib.pyplot as plt
import numpy as np

# Data for plotting
t = np.arange(0.0, 2.0, 0.01)
s = 1 + np.sin(2 * np.pi * t)

fig, ax = plt.subplots()
ax.plot(t, s)

ax.set(xlabel='time (s)', ylabel='voltage (mV)',
       title='About as simple as it gets, folks')
ax.grid()

fig.savefig("test.png")
plt.show()
It shows a sine function, and also saves the output to a file "test.png".

By default, notebook produces high-resolution vector graphics, and inline, low-resolution bit-map graphics. That makes inline faster than notebook. However this can be changed. To get inline to use higher resolution bitmaps, use

%config InlineBackend.figure_format='retina'
%matplotlib inline
To get vector graphics, use
%config InlineBackend.figure_format='svg'
%matplotlib inline
To see all options for the inline backend, try %config InlineBackend

To switch between inline and notebook you will need to restart your server using the "Control Panel" link at the upper right of the window.

If you want to use inline or interactive mode a lot, you can make it the default. From a command line on any ilab machine do this:

echo "c.InteractiveShellApp.matplotlib = 'notebook'" > ~/.ipython/profile_default/ipython_config.py
Use 'inline' rather than 'notebook' for non-interactive output. Note that this will affect any copy of ipython you start, not just copies running in Jupyter. That's why we're not doing it by default.

Spark in Scala Notebook Type

The Spark in Scala notebook type is intended to run Spark in the Scala language. It could be used for any Scala code, but it sets up a Spark context, with the following variables:

This is just about the shortest possible Spark program. It simply returns the version of Spark.

sc.version

This loads data into a Hive SQL table from a URL at Amazon, using Spark in Scala.

import org.apache.commons.io.IOUtils
import java.net.URL
import java.nio.charset.Charset

// load bank data
val bankText = sc.parallelize(
    IOUtils.toString(
        new URL("https://s3.amazonaws.com/apache-zeppelin/tutorial/bank/bank.csv"),
        Charset.forName("utf8")).split("\n"))

case class Bank(age: Integer, job: String, marital: String, education: String, balance: Integer)
val bank = bankText.map(s => s.split(";")).filter(s => s(0) != "\"age\"").map(
    s => Bank(s(0).toInt, 
            s(1).replaceAll("\"", ""),
            s(2).replaceAll("\"", ""),
            s(3).replaceAll("\"", ""),
            s(5).replaceAll("\"", "").toInt
        )
).toDF()
bank.registerTempTable("bank")

The reason this has to be done in two separate cell is a class can't be defined and used in the same cell. The Bank class is defined in the first cell and used in the second.

This creates a temporary table, meaning that it won't be there after the end of the session.

Here is SQL code that retrieves data from the table

%%sql
select age, count(1) value
from bank 
where age < 30 
group by age 
order by age
The initial output will be text, exactly as you'd expect from this SQL query. However you'll see options that let you select various types of visualiztion: Pie chart, scatter diagram, etc.

If you need to use classes that aren't already part of the system, you can use %AddDeps. It can load any module from the Maven 2 repository. Here's an example:

%AddDeps org.vegas-viz vegas_2.11 0.3.11 --transitive
If you go https://mvnrepository.com/, you can search for specific packages. Ib this case if you look for vegas, you'll find a group org.vegas-viz, with the artifact vegas_2.11. In the end, you'll find a maven declaration

    org.vegas-viz
    vegas_2.11
    0.3.11

In the %AddDeps declaration, you list the groupId, artifactId, and version. The --transitive means that it should pull in any other packages needed by this one. You normally want to use --transitive.

Here's their first demo display, which imports a package to display graphics, Vegas.

import vegas._
import vegas.render.WindowRenderer._

val plot = Vegas("Country Pop").
  withData(
    Seq(
      Map("country" -> "USA", "population" -> 314),
      Map("country" -> "UK", "population" -> 64),
      Map("country" -> "DK", "population" -> 80)
    )
  ).
  encodeX("country", Nom).
  encodeY("population", Quant).
  mark(Bar)

plot.show
(Note that Vegas won't actually work in the current version of Spark. This is given as an example of how to include an extra module.)