Jupyter
This page douments Jupyter notebook and Jupyter lab, both versions that you run yourself from the command line and the versions you get from the web server at jupyter.cs.rutgers.edu. The software is the same.
Jupyter is a "notebook." It's a graphical tool that makes it easy to do Python, Java, Scala, Mathematica or R code and display results graphihcally. It's particularly useful for exploring data. However if you're writing substantial programs you're better off using Python, Java, or Scala on the command line or from an IDE.
There are two ways to run Juypter, Jupyter Lab and Jupyter Notebook. It's the same software with the same capabiities, but Jupyter Lab is a newer user interface, and seems slightly faster. They use the same notebook files. We recommend using Jupyter Lab.
You can type a command like jupyter lab to run jupyter lab. In practice you'll need to activate a python environame first, or type /common/system/venv/python312/bin/jupyter instead of just jupyter
Jupyter starts a web server. there are two ways to connect to it
- Use weblogin, rdp, or x2go to create a graphical session to one of our computers. When you run jupyter, it will automatically open a browser for you, pointed to jupyter. If for some reason that doesn't work, you can run Chrome or Firefox yourself. Copy and paste either of the URLs that jupyter prints into the browser's URL bar.
- Use a command line session, i.e. ssh.
In this case, type jupyter lab --ip=`hostname` --browser="none".
It will print a list of 3 URLs, e.g.
To access the server, open this file in a browser: file:///common/home/hedrick/... Or copy and paste one of these URLs: http://ilab1.cs.rutgers.edu:8889/... or http://127.0.0.1:8889/...
Copy and paste the middle one (the one with the hostname, in this case ilab1.cs.rutgers.edu) into your browser.
For Jupyter notebook, use "notebook" instead of "lab."
If you prefer, you can point your browser to http://jupyter.cs.rutgers.edu. That will start a copy of Jupyter Lab that is identical to what you'd get if you run it yourself. The only difference is that jupyter.cs.rutgers.edu currently does not have access to GPUs. If you prefer the original Notebook interface, under "Help" you'll see an option "Launch Classic Notebook".
NOTE: You may also want to look at the Jupyter project's own documentation. This document focuses on how to do Spark programming from the notebook.
After you've logged into https://jupyter.cs.rutgers.edu, or started Jupyter from the comnand line, you'll see a file browser for your home directory, which shows only notebook files. (In Jupyter Lab, the file browser is on the left.) To get the interesting functionality, you need to open a notebook.
- If you've already created a notebook click on it in the file browser.
NOTE: The work you do is automatically saved in a notebook file. When you start, you should consider opening an existing notebook file, rather than creating a new notebook.
- If you need a new notebook, you'll see for Notebook: a "New" pulldown in the upper right, or for Lab: icons with the various types in the main panel. When you create a new one, I suggest changing the title at the top to a name you'll remember.
Notebook types
Here's what the various types of notebook are, for the copy of Jupyter in the 3.12 Python environment, i.e.
/common/system/venv/python312/bin/jupyter lab --ip=`hostname` --browser=none
- Python 3: Python without Spark
- Spark 3 in Python3: Python with Spark 3
- Spark 3 in Scala 1.12: Apache Toree
- Java 17: Java without Spark
- Spark 3 in Java 11: Java with Spark 3 (Java 17 doesn't work)
If you start Jupyter from the command line, it will use the version of Python you currently have activated. If you connect to jupyter.cs.rutgers.edu, the default is Python 3.9. You can also choose Python 3.7 or 3.8, in case you have software that hasn't been updated to 3.9 yet.
Non-programming
- Text File - lets you create a notebook that is just text
- Terminal - gives you a shell on the system running Jupyterhub. Note that this system is not set up for you to run Spark / Hadoop jobs from a shell. We don't recommend using this except for simple commands. The container that is running jupyterhub doesn't have our full set of applications.
For Python3, there's excellent documetation at the main Jupyter site: The Jupyter Notebook.
COMPLETION: you can type part of a variable name and use the TAB key. That will show you all of the possibilities beginning with what you typed. In some cases (dependinging upon context) you can hit TAB right after ., to show the available properties and methods.
The rest of this page gives instructions for using Spark in Jupyter, and also an introduction to graphics from Jupyter. If you're not interested in these things, you can stop now.
Table of contents
- Software versions
- Python3 Notebook Types
- Special note on using GPUs
- Java Notebook Types
- Spark in Python3 Notebook Type
- Using Matplotlib
- Spark in Scala Notebook Type
- Spark in Java Notebook Type
Software versions
Currently Spark is 3.4.0, which was released in April, 2023.
Tensorflow is
- 2.4.1 in Python 3.9
- 2.9.1 in Python 2.10
- 2.12 in python 2.11
Pytorch is 2.0.1.
We have Python virtual environments for 3.9 through 3.12. (There may be older enviolrnments, but they will eventually be removed.)
On our systems, Java is normally OpenJDK 17. However Spark 2 only works with Java 8, so Spark 2 kernels in Jupyter are configured to use OpenJDK 8. Spark 3 only works with Java 11 in Jupyter, though it works fine with Scala in the Spark 17 JVM.
If you run jupyter notebook or jupyterlab yourself, you get whatever version of python you run it from.
If you need to install your own python packages, we suggest creating your own virtual environment. For information on how to do that, see https://resources.cs.rutgers.edu/docs/using-python-on-cs-linux-machines/
WARNING: When you use pip, it will suggest that you upgrade it to a new version. Do NOT try to do this. You can't actually upgrade PIP, because it is installed in a system directory. In attempting to do so, you will end up with an inconsistent set of packages.
Spark 3 is currently version 3.4.0.
Python 3 Notebook Types
You can use this to run any python 3 code you want. Just type the code into a cell and hit "run." Documentation for Python and Jupyter is readily available elsewhere, so this section is going to concentrate on using Spark and Hive from a Python 3 notebook. Also see Using Matplotlib for information on doing graphics.
PySpark is the same python as Python 3, but it is set up to do Spark operations. It has predefined the following variables:
- spark - a pyspark.sql.SparkSession (using Hive)
- sc - a SparkContext
- sql - an bound method SparkSession.sql for the session
- sqlContext - an SQLContext [for compatibility]
- sqlCtx - an old name for sqlContext
Special Note On Using GPUs
GPUs are a special problem, because our normal virtual environments may not work with all common GPU software, although they do have pytorch installed.If you want to use GPUs, we recommend taking a look at the containers supplied by Nvidia. See /common/system/nvidia-containers/INDEX-pytorch or INDEX-tensorflow for the available containers. They have cuda, pytorch (or tensorflow), python, etc. They also have a copy of jupyter.
To use jupyterlab, once you've started your singularity container, simply type "jupyter lab". It will print a message that has a URL such as "http://hostname:NNNN?token=XXXX" Connect to that URL with a browser, but use the actual hostname. E.g. if you logged into ilab2, use "http://ilab2.cs.rutgers.edu:NNNN?token=XXXX".
If you want to use spark, once you've started the singularity container type "source /common/system/spark-setup.sh". At that point if you run python, you'll get python with spark in it (i.e. "sc" will be a Spark context, etc). If you run "jupyter lab" and start a Python notebook, you'll get Spark in it.
Java notebook type
This is ijava. It uses jshell, so it should support any Java from 9 on. We use the system-wide Java, which is currently Java 17. See the web link for examples of how to display graphics. You can use xchart, though that's not necessarily the only package that would work.
Spark in Python3 notebook type
The Spark in Python3 notebook type is intended to run Python/Spark code
Here's a simple example. It assume that you have a file "data.txt" in your directory. It should have text in it. It doesn't matter what. (The program will count words.)
Once you have the data file, Put this into a cell and hit Run. After it starts a Spark session (if one isn't already started) you'll see a count of the various words in the file.
text_file = sc.textFile("data.txt") counts = text_file.flatMap(lambda line: line.split(" ")) \ .map(lambda word: (word, 1)) \ .reduceByKey(lambda a, b: a + b) print(counts) for x in counts.collect(): print(x)
Sometimes you may need to use packages that aren't builtin to the system. The easiest way to do this is to create your own virtual environment. See https://resources.cs.rutgers.edu/docs/using-python-on-cs-linux-machines/ for more information.
Using Matplotlib
The primary way of getting graphics for data analysis is a Python package Matplotlib. Another common approach is pandas, but it uses matplotlib. Using matplotlib in Jupyter requires configuration.
In Python 3 and Pyspark notebook types, add the line
%matplotlib inline
near the beginning of your code. This is sufficient for output. If you want to do interactive graphics, try%matplotlib notebook
With %mathplotlib notebook, in one case I found that show() wasn't generating output, and had to do draw(). (show() is the normal way to output a plot. draw() is used in interactive mode to update the output when something has changed.)Here's a sample
%matplotlib inline import matplotlib import matplotlib.pyplot as plt import numpy as np # Data for plotting t = np.arange(0.0, 2.0, 0.01) s = 1 + np.sin(2 * np.pi * t) fig, ax = plt.subplots() ax.plot(t, s) ax.set(xlabel='time (s)', ylabel='voltage (mV)', title='About as simple as it gets, folks') ax.grid() fig.savefig("test.png") plt.show()
It shows a sine function, and also saves the output to a file "test.png".By default, notebook produces high-resolution vector graphics, and inline, low-resolution bit-map graphics. That makes inline faster than notebook. However this can be changed. To get inline to use higher resolution bitmaps, use
%config InlineBackend.figure_format='retina' %matplotlib inline
To get vector graphics, use%config InlineBackend.figure_format='svg' %matplotlib inline
To see all options for the inline backend, try %config InlineBackendTo switch between inline and notebook you will need to restart your server using the "Control Panel" link at the upper right of the window.
If you want to use inline or interactive mode a lot, you can make it the default. From a command line on any ilab machine do this:
echo "c.InteractiveShellApp.matplotlib = 'notebook'" > ~/.ipython/profile_default/ipython_config.py
Use 'inline' rather than 'notebook' for non-interactive output. Note that this will affect any copy of ipython you start, not just copies running in Jupyter. That's why we're not doing it by default.Spark in Scala Notebook Type
We still have Scala in our copies of Jupyter, because it's one of the 3 major languages for Spark. However we don't recommend it. There are several implementation for Jupyter, but they haven't been touched for years, or they are difficult to use. We haven't been able to find a graphics package that supports Scala 2.12 or later, but 2.12 is required for Spark. Of the Java-like languages, Kotlin seems to have the best support. Please send email to help@cs.rutgers.edu if you're interested.
The Spark in Scala notebook type is intended to run Spark in the Scala language. It could be used for any Scala code, but it sets up a Spark context, with the following variables:
- spark - a pyspark.sql.SparkSession (using Hive)
- sc - a SparkContext
This is just about the shortest possible Spark program. It simply returns the version of Spark.
sc.version
This loads data into a Hive SQL table from a URL at Amazon, using Spark in Scala.
import org.apache.commons.io.IOUtils import java.net.URL import java.nio.charset.Charset // load bank data val bankText = sc.parallelize( IOUtils.toString( new URL("https://s3.amazonaws.com/apache-zeppelin/tutorial/bank/bank.csv"), Charset.forName("utf8")).split("\n")) case class Bank(age: Integer, job: String, marital: String, education: String, balance: Integer)
val bank = bankText.map(s => s.split(";")).filter(s => s(0) != "\"age\"").map( s => Bank(s(0).toInt, s(1).replaceAll("\"", ""), s(2).replaceAll("\"", ""), s(3).replaceAll("\"", ""), s(5).replaceAll("\"", "").toInt ) ).toDF() bank.registerTempTable("bank")
The reason this has to be done in two separate cell is a class can't be defined and used in the same cell. The Bank class is defined in the first cell and used in the second.
This creates a temporary table, meaning that it won't be there after the end of the session.
Here is SQL code that retrieves data from the table
%%sql select age, count(1) value from bank where age < 30 group by age order by age
The initial output will be text, exactly as you'd expect from this SQL query. However you'll see options that let you select various types of visualiztion: Pie chart, scatter diagram, etc.If you need to use classes that aren't already part of the system, you can use %AddDeps. It can load any module from the Maven 2 repository. Here's an example:
%AddDeps org.vegas-viz vegas_2.11 0.3.11 --transitive
If you go https://mvnrepository.com/, you can search for specific packages. Ib this case if you look for vegas, you'll find a group org.vegas-viz, with the artifact vegas_2.11. In the end, you'll find a maven declarationorg.vegas-viz vegas_2.11 0.3.11 Here's their first demo display, which imports a package to display graphics, Vegas. Unforunately it no longer works. We're still trying to find a graphics package that works with Scala.
import vegas._ import vegas.render.WindowRenderer._ val plot = Vegas("Country Pop"). withData( Seq( Map("country" -> "USA", "population" -> 314), Map("country" -> "UK", "population" -> 64), Map("country" -> "DK", "population" -> 80) ) ). encodeX("country", Nom). encodeY("population", Quant). mark(Bar) plot.show
(Note that Vegas won't actually work in the current version of Spark. This is given as an example of how to include an extra module.)Spark in Java
This is ijava. It uses jshell, so it should support any Java from 9 on. We are using Java 11 for this kernel, because Java 17 doesn't work with Spark 3.4.0 in Jupyter.
See the web link for examples of how to display graphics. It uses xchart, though other packages should be possible.
Spark will initialize when you run the first command. It predefines
- jsc - a JavaSparkContext
- spark - a SparkSession
- sc - a SparkContext taken from the Spark Session
If you need an sqlContext, try
var sqlContext = spark.sqlContext()
However this is considered a backward compatibility feature. You can do SQL operations directly from the SparkSession. To get a Hive-based SQLContext, usevar sqlContext = new org.apache.spark.sql.hive.HiveContext(sc)
But this is considered deprecated. Use SparkSession.builder.enableHiveSupport instead. You'll use that to create a new SparkSession with Hive enabled, from sc.This has not been tested beyond verifying that those objects are properly generated.
- If you need a new notebook, you'll see for Notebook: a "New" pulldown in the upper right, or for Lab: icons with the various types in the main panel. When you create a new one, I suggest changing the title at the top to a name you'll remember.