Jupyter

This page douments Jupyter notebook and Jupyter lab, both versions that you run yourself from the command line and the versions you get from the web server at jupyter.cs.rutgers.edu. The software is the same.

Jupyter is a "notebook." It's a graphical tool that makes it easy to do Python, Java, Scala, Mathematica or R code and display results graphihcally. It's particularly useful for exploring data. However if you're writing substantial programs you're better off using Python, Java, or Scala on the command line or from an IDE.

There are two ways to run Juypter, Jupyter Lab and Jupyter Notebook. It's the same software with the same capabiities, but Jupyter Lab is a newer user interface, and seems slightly faster. They use the same notebook files. We recommend using Jupyter Lab.

You can type a command like jupyter lab to run jupyter lab. In practice you'll need to activate a python environame first, or type /common/system/anaconda3/envs/python310/bin/jupyter instead of just jupyter

Jupyter starts a web server. there are two ways to connect to it

  1. Use weblogin, rdp, or x2go to create a graphical session to one of our computers. When you run jupyter, it will automatically open a browser for you, pointed to jupyter. If for some reason that doesn't work, you can run Chrome or Firefox yourself. Copy and paste either of the URLs that jupyter prints into the browser's URL bar.
  2. Use a command line session, i.e. ssh. In this case, type jupyter lab --ip=`hostname` --browser="none". It will print a list of 3 URLs, e.g.
          To access the server, open this file in a browser:
            file:///common/home/hedrick/...
          Or copy and paste one of these URLs:
            http://ilab1.cs.rutgers.edu:8889/...
          or http://127.0.0.1:8889/...
    Copy and paste the middle one (the one with the hostname, in this case ilab1.cs.rutgers.edu) into your browser.

For Jupyter notebook, use "notebook" instead of "lab."

If you prefer, you can point your browser to http://jupyter.cs.rutgers.edu. That will start a copy of Jupyter Lab that is identical to what you'd get if you run it yourself. The only difference is that jupyter.cs.rutgers.edu currently does not have access to GPUs. If you prefer the original Notebook interface, under "Help" you'll see an option "Launch Classic Notebook".

NOTE: You may also want to look at the Jupyter project's own documentation. This document focuses on how to do Spark programming from the notebook.

After you've logged into https://jupyter.cs.rutgers.edu, or started Jupyter from the comnand line, you'll see a file browser for your home directory, which shows only notebook files. (In Jupyter Lab, the file browser is on the left.) To get the interesting functionality, you need to open a notebook.

Notebook types

Here's what the various types of notebook are, for the copy of Jupyter in Anaconda python 3.10, i.e.

/koko/system/anaconda3/envs/python310/bin/jupyter lab --ip=`hostname` --browser=none
In addition, jupyter.cs.rutgers.edu has Mathematica and R, and icons for older versions of Python. Older python environments will have Spark 2. We use Spark 3 with Java 17, and Spark 2 with Java 8.

If you start Jupyter from the command line, it will use the version of Python you currently have activated. If you connect to jupyter.cs.rutgers.edu, the default is Python 3.9. You can also choose Python 3.7 or 3.8, in case you have software that hasn't been updated to 3.9 yet.

Non-programming

For Python3, there's excellent documetation at the main Jupyter site: The Jupyter Notebook.

COMPLETION: you can type part of a variable name and use the TAB key. That will show you all of the possibilities beginning with what you typed. In some cases (dependinging upon context) you can hit TAB right after ., to show the available properties and methods.

The rest of this page gives instructions for using Spark in Jupyter, and also an introduction to graphics from Jupyter. If you're not interested in these things, you can stop now.

Table of contents

Software versions

Currently Spark is 3.4.0, which was released in April, 2023.

Tensorflow is

Pytorch is 2.0.1.

We have Anaconda environments for 3.9 through 3.11. (There may be older enviolrnments, but they will eventually be removed.)

On our systems, Java is normally OpenJDK 17. However Spark 2 only works with Java 8, so Spark 2 kernels in Jupyter are configured to use OpenJDK 8. Spark 3 only works with Java 11 in Jupyter, though it works fine with Scala in the Spark 17 JVM.

If you run jupyter notebook or jupyterlab yourself, you get whatever version of python you run it from.

If you need to install your own python packages, we suggest that you use the command

pip install --user PACKAGE
You must use the same version of pip as your notebook uses. E.g. if you are using a python3.9 notebook, you should do
/common/system/anaconda3/envs/python310/bin/pip install --user PACKAGE

WARNING: When you use pip, it will suggest that you upgrade it to a new version. Do NOT try to do this. You can't actually upgrade PIP, because it is installed in a system directory. In attempting to do so, you will end up with an inconsistent set of packages.

Spark 3 is currently version 3.4.0.

Python 3 Notebook Types

You can use this to run any python 3 code you want. Just type the code into a cell and hit "run." Documentation for Python and Jupyter is readily available elsewhere, so this section is going to concentrate on using Spark and Hive from a Python 3 notebook. Also see Using Matplotlib for information on doing graphics.

PySpark is the same python as Python 3, but it is set up to do Spark operations. It has predefined the following variables:

Java notebook type

This is ijava. It uses jshell, so it should support any Java from 9 on. We use the system-wide Java, which is currently Java 17. See the web link for examples of how to display graphics. You can use xchart, though that's not necessarily the only package that would work.

Spark in Python3 notebook type

The Spark in Python3 notebook type is intended to run Python/Spark code

Here's a simple example. It assume that you have a file "data.txt" in your directory. It should have text in it. It doesn't matter what. (The program will count words.)

Once you have the data file, Put this into a cell and hit Run. After it starts a Spark session (if one isn't already started) you'll see a count of the various words in the file.

text_file = sc.textFile("data.txt")
counts = text_file.flatMap(lambda line: line.split(" ")) \
             .map(lambda word: (word, 1)) \
             .reduceByKey(lambda a, b: a + b)
print(counts)
for x in counts.collect():
     print(x)

Sometimes you may need to use packages that aren't builtin to the system. The easiest way to do this is with the command-line tool "pip". You can log into one of one ilab systems and use it from the command line, or you can create a "terminal" session from jupyterhub.

The command looks like

pip install --user PACKAGE
Pip will offer to update itself. Please don't do that. You must use the same version of pip as your notebook uses. E.g. if you are using a python3.9 notebook, you should do
/common/system/anaconda3/envs/python310/bin/pip install --user PACKAGE

Using Matplotlib

The primary way of getting graphics for data analysis is a Python package Matplotlib. Another common approach is pandas, but it uses matplotlib. Using matplotlib in Jupyter requires configuration.

In Python 3 and Pyspark notebook types, add the line

%matplotlib inline
near the beginning of your code. This is sufficient for output. If you want to do interactive graphics, try
%matplotlib notebook
With %mathplotlib notebook, in one case I found that show() wasn't generating output, and had to do draw(). (show() is the normal way to output a plot. draw() is used in interactive mode to update the output when something has changed.)

Here's a sample

%matplotlib inline
import matplotlib
import matplotlib.pyplot as plt
import numpy as np

# Data for plotting
t = np.arange(0.0, 2.0, 0.01)
s = 1 + np.sin(2 * np.pi * t)

fig, ax = plt.subplots()
ax.plot(t, s)

ax.set(xlabel='time (s)', ylabel='voltage (mV)',
       title='About as simple as it gets, folks')
ax.grid()

fig.savefig("test.png")
plt.show()
It shows a sine function, and also saves the output to a file "test.png".

By default, notebook produces high-resolution vector graphics, and inline, low-resolution bit-map graphics. That makes inline faster than notebook. However this can be changed. To get inline to use higher resolution bitmaps, use

%config InlineBackend.figure_format='retina'
%matplotlib inline
To get vector graphics, use
%config InlineBackend.figure_format='svg'
%matplotlib inline
To see all options for the inline backend, try %config InlineBackend

To switch between inline and notebook you will need to restart your server using the "Control Panel" link at the upper right of the window.

If you want to use inline or interactive mode a lot, you can make it the default. From a command line on any ilab machine do this:

echo "c.InteractiveShellApp.matplotlib = 'notebook'" > ~/.ipython/profile_default/ipython_config.py
Use 'inline' rather than 'notebook' for non-interactive output. Note that this will affect any copy of ipython you start, not just copies running in Jupyter. That's why we're not doing it by default.

Spark in Scala Notebook Type

We still have Scala in our copies of Jupyter, because it's one of the 3 major languages for Spark. However we don't recommend it. There are several implementation for Jupyter, but they haven't been touched for years, or they are difficult to use. We haven't been able to find a graphics package that supports Scala 2.12 or later, but 2.12 is required for Spark. Of the Java-like languages, Kotlin seems to have the best support. Please send email to help@cs.rutgers.edu if you're interested.

The Spark in Scala notebook type is intended to run Spark in the Scala language. It could be used for any Scala code, but it sets up a Spark context, with the following variables:

This is just about the shortest possible Spark program. It simply returns the version of Spark.

sc.version

This loads data into a Hive SQL table from a URL at Amazon, using Spark in Scala.

import org.apache.commons.io.IOUtils
import java.net.URL
import java.nio.charset.Charset

// load bank data
val bankText = sc.parallelize(
    IOUtils.toString(
        new URL("https://s3.amazonaws.com/apache-zeppelin/tutorial/bank/bank.csv"),
        Charset.forName("utf8")).split("\n"))

case class Bank(age: Integer, job: String, marital: String, education: String, balance: Integer)
val bank = bankText.map(s => s.split(";")).filter(s => s(0) != "\"age\"").map(
    s => Bank(s(0).toInt, 
            s(1).replaceAll("\"", ""),
            s(2).replaceAll("\"", ""),
            s(3).replaceAll("\"", ""),
            s(5).replaceAll("\"", "").toInt
        )
).toDF()
bank.registerTempTable("bank")

The reason this has to be done in two separate cell is a class can't be defined and used in the same cell. The Bank class is defined in the first cell and used in the second.

This creates a temporary table, meaning that it won't be there after the end of the session.

Here is SQL code that retrieves data from the table

%%sql
select age, count(1) value
from bank 
where age < 30 
group by age 
order by age
The initial output will be text, exactly as you'd expect from this SQL query. However you'll see options that let you select various types of visualiztion: Pie chart, scatter diagram, etc.

If you need to use classes that aren't already part of the system, you can use %AddDeps. It can load any module from the Maven 2 repository. Here's an example:

%AddDeps org.vegas-viz vegas_2.11 0.3.11 --transitive
If you go https://mvnrepository.com/, you can search for specific packages. Ib this case if you look for vegas, you'll find a group org.vegas-viz, with the artifact vegas_2.11. In the end, you'll find a maven declaration

    org.vegas-viz
    vegas_2.11
    0.3.11

In the %AddDeps declaration, you list the groupId, artifactId, and version. The --transitive means that it should pull in any other packages needed by this one. You normally want to use --transitive.

Here's their first demo display, which imports a package to display graphics, Vegas. Unforunately it no longer works. We're still trying to find a graphics package that works with Scala.

import vegas._
import vegas.render.WindowRenderer._

val plot = Vegas("Country Pop").
  withData(
    Seq(
      Map("country" -> "USA", "population" -> 314),
      Map("country" -> "UK", "population" -> 64),
      Map("country" -> "DK", "population" -> 80)
    )
  ).
  encodeX("country", Nom).
  encodeY("population", Quant).
  mark(Bar)

plot.show
(Note that Vegas won't actually work in the current version of Spark. This is given as an example of how to include an extra module.)

Spark in Java

This is ijava. It uses jshell, so it should support any Java from 9 on. We are using Java 11 for this kernel, because Java 17 doesn't work with Spark 3.4.0 in Jupyter.

See the web link for examples of how to display graphics. It uses xchart, though other packages should be possible.

Spark will initialize when you run the first command. It predefines

If you need an sqlContext, try

var sqlContext = spark.sqlContext()
However this is considered a backward compatibility feature. You can do SQL operations directly from the SparkSession. To get a Hive-based SQLContext, use
var sqlContext = new org.apache.spark.sql.hive.HiveContext(sc)
But this is considered deprecated. Use SparkSession.builder.enableHiveSupport instead. You'll use that to create a new SparkSession with Hive enabled, from sc.

This has not been tested beyond verifying that those objects are properly generated.