Jupyter
This page douments Jupyter notebook and Jupyter lab, both versions that you run yourself from the command line and the versions you get from the web server at jupyter.cs.rutgers.edu. The software is the same.
Jupyter is a "notebook." It's a graphical tool that makes it easy to do Python, Scala, Mathematica or R code and display results graphihcally. It's particularly useful for exploring data. However if you're writing substantial programs you're better off using Python or Scala on the command line or from an IDE.
There are two ways to run Juypter, Jupyter Lab and Jupyter Notebook. It's the same software with the same capabiities, but Jupyter Lab is a newer user interface, and seems slightly faster. They use the same notebook files. We recommend using Jupyter Lab in Python 3.8 and 3.9. However, Jupter Lab does not appear to be working well in Python 3.7 or at all in 3.6, so we recommand Jupyter Notebook for them.
You can type a command like jupyter lab to run jupyter lab. In practice you'll need to activate a python environame first, or type /common/system/anaconda/envs/python39/bin/jupyter instead of just jupyter
Jupyter starts a web server. there are two ways to connect to it
- Use weblogin, rdp, or x2go to create a graphical session to one of our computers. When you run jupyter, it will automatically open a browser for you, pointed to jupyter. If for some reason that doesn't work, you can run Chrome or Firefox yourself. Copy and paste either of the URLs that jupyter prints into the browser's URL bar.
- Use a command line session, i.e. ssh.
In this case, type jupyter lab --ip=`hostname` --browser="none".
It will print a list of 3 URLs, e.g.
To access the server, open this file in a browser: file:///common/home/hedrick/... Or copy and paste one of these URLs: http://ilab1.cs.rutgers.edu:8889/... or http://127.0.0.1:8889/...
Copy and paste the middle one (the one with the hostname, in this case ilab1.cs.rutgers.edu) into your browser.
For Jupyter notebook, use "notebook" instead of "lab."
If you prefer, you can point your browser to http://jupyter.cs.rutgers.edu. That will start a copy of Jupyter Lab that is identical to what you'd get if you run it yourself. The only difference is that jupyter.cs.rutgers.edu currently does not have access to GPUs. If you prefer the original Notebook interface, under "Help" you'll see an option "Launch Classic Notebook".
NOTE: You may also want to look at the Jupyter project's own documentation. This document focuses on how to do Spark programming from the notebook.
After you've logged into https://jupyter.cs.rutgers.edu, or started Jupyter from the comnand line, you'll see a file browser for your home directory, which shows only notebook files. (In Jupyter Lab, the file browser is on the left.) To get the interesting functionality, you need to open a notebook.
- If you've already created a notebook click on it in the file browser.
NOTE: The work you do is automatically saved in a notebook file. When you start, you should consider opening an existing notebook file, rather than creating a new notebook.
- If you need a new notebook, you'll see for Notebook: a "New" pulldown in the upper right, or for Lab: icons with the various types in the main panel. When you create a new one, I suggest changing the title at the top to a name you'll remember.
Notebook types
Here's what the various types of notebook are, for the copy of Jupyter in Anaconda python 3.9, i.e.
/koko/system/anaconda/envs/python39/bin/jupyter lab --ip=`hostname` --browser=none
- Python 3: Python without Spark
- Spark 3 in Python3: Python with Spark 3
- Spark 3 in Scala 1.12: Apache Toree
If you start Jupyter from the command line, it will use the version of Python you currently have activated. If you connect to jupyter.cs.rutgers.edu, the default is Python 3.9. You can also choose Python 3.7 or 3.8, in case you have software that hasn't been updated to 3.9 yet.
Non-programming
- Text File - lets you create a notebook that is just text
- Terminal - gives you a shell on the system running Jupyterhub. Note that this system is not set up for you to run Spark / Hadoop jobs from a shell. We don't recommend using this except for simple commands. The container that is running jupyterhub doesn't have our full set of applications.
For Python3, there's excellent documetation at the main Jupyter site: The Jupyter Notebook.
COMPLETION: you can type part of a variable name and use the TAB key. That will show you all of the possibilities beginning with what you typed. In some cases (dependinging upon context) you can hit TAB right after ., to show the available properties and methods.
The rest of this page gives instructions for using Spark in Jupyter, and also an introduction to graphics from Jupyter. If you're not interested in these things, you can stop now.
Table of contents
- Software versions
- Python3 Notebook Types
- Spark in Python3 Notebook Type
- Using Matplotlib
- Spark in Scala Notebook Type
Software versions
Currently Spark is 3.4.0, which was released in April, 2023. (Python 3.6 has an older version.)
Tensorflow is
- 1.12 in Python 3.6
- 1.15 in Python 3.7
- 2.2 in Python 3.8
We have Anaconda environments for Python 2.7 and 3.6 through 3.9. Spark 3 is in Python 3.7 - 3.9. Spark 2 is in Python 2.7.
On our systems, Java is normally OpenJDK 17. However Spark 2 only works with Java 8, so Spark 2 kernels in Jupyter are configured to use OpenJDK 8.
If you run jupyter notebook or jupyterlab yourself, you get whatever version of python you run it from.
If you need to install your own python packages, we suggest that you use the command
pip install --user PACKAGE
You must use the same version of pip as your notebook uses. E.g. if you are using a python3.9 notebook, you should do/common/system/anaconda/envs/python39/bin/pip install --user PACKAGE
WARNING: When you use pip, it will suggest that you upgrade it to a new version. Do NOT try to do this. You can't actually upgrade PIP, because it is installed in a system directory. In attempting to do so, you will end up with an inconsistent set of packages.
Spark 3 is currently version 3.4.0. Spark 2 is version 2.4.5.
Python 3 Notebook Types
You can use this to run any python 3 code you want. Just type the code into a cell and hit "run." Documentation for Python and Jupyter is readily available elsewhere, so this section is going to concentrate on using Spark and Hive from a Python 3 notebook. Also see Using Matplotlib for information on doing graphics.
PySpark is the same python as Python 3, but it is set up to do Spark operations. It has predefined the following variables:
- spark - a pyspark.sql.SparkSession (using Hive)
- sc - a SparkContext
- sql - an bound method SparkSession.sql for the session
- sqlContext - an SQLContext [for compatibility]
- sqlCtx - an old name for sqlContext
Spark in Python3 notebook type
The Spark in Python3 notebook type is intended to run Python/Spark code
Here's a simple example. It assume that you have a file "data.txt" in your directory. It should have text in it. It doesn't matter what. (The program will count words.)
Once you have the data file, Put this into a cell and hit Run. After it starts a Spark session (if one isn't already started) you'll see a count of the various words in the file.
text_file = sc.textFile("data.txt") counts = text_file.flatMap(lambda line: line.split(" ")) \ .map(lambda word: (word, 1)) \ .reduceByKey(lambda a, b: a + b) print(counts) for x in counts.collect(): print(x)
Sometimes you may need to use packages that aren't builtin to the system. The easiest way to do this is with the command-line tool "pip". You can log into one of one ilab systems and use it from the command line, or you can create a "terminal" session from jupyterhub.
The command looks like
pip install --user PACKAGE
Pip will offer to update itself. Please don't do that. You must use the same version of pip as your notebook uses. E.g. if you are using a python3.9 notebook, you should do/common/system/anaconda/envs/python39/bin/pip install --user PACKAGE
Using Matplotlib
The primary way of getting graphics for data analysis is a Python package Matplotlib. Another common approach is pandas, but it uses matplotlib. Using matplotlib in Jupyter requires configuration.
In Python 3 and Pyspark notebook types, add the line
%matplotlib inline
near the beginning of your code. This is sufficient for output. If you want to do interactive graphics, try%matplotlib notebook
With %mathplotlib notebook, in one case I found that show() wasn't generating output, and had to do draw(). (show() is the normal way to output a plot. draw() is used in interactive mode to update the output when something has changed.)Here's a sample
%matplotlib inline import matplotlib import matplotlib.pyplot as plt import numpy as np # Data for plotting t = np.arange(0.0, 2.0, 0.01) s = 1 + np.sin(2 * np.pi * t) fig, ax = plt.subplots() ax.plot(t, s) ax.set(xlabel='time (s)', ylabel='voltage (mV)', title='About as simple as it gets, folks') ax.grid() fig.savefig("test.png") plt.show()
It shows a sine function, and also saves the output to a file "test.png".By default, notebook produces high-resolution vector graphics, and inline, low-resolution bit-map graphics. That makes inline faster than notebook. However this can be changed. To get inline to use higher resolution bitmaps, use
%config InlineBackend.figure_format='retina' %matplotlib inline
To get vector graphics, use%config InlineBackend.figure_format='svg' %matplotlib inline
To see all options for the inline backend, try %config InlineBackendTo switch between inline and notebook you will need to restart your server using the "Control Panel" link at the upper right of the window.
If you want to use inline or interactive mode a lot, you can make it the default. From a command line on any ilab machine do this:
echo "c.InteractiveShellApp.matplotlib = 'notebook'" > ~/.ipython/profile_default/ipython_config.py
Use 'inline' rather than 'notebook' for non-interactive output. Note that this will affect any copy of ipython you start, not just copies running in Jupyter. That's why we're not doing it by default.Spark in Scala Notebook Type
While we still have Scala in our copies of Jupyter, because it's one of the 3 major languages for Spark. However we don't recommend it. There are several implementation for Jupyter, but they haven't been touched for years, or they are difficult to use. We haven't been able to find a graphics package that supports Scala 2.12 or later, but 2.12 is required for Spark. Of the Java-like languages, Kotlin seems to have the best support. Please send email to help@cs.rutgers.edu if you're interested.
The Spark in Scala notebook type is intended to run Spark in the Scala language. It could be used for any Scala code, but it sets up a Spark context, with the following variables:
- spark - a pyspark.sql.SparkSession (using Hive)
- sc - a SparkContext
This is just about the shortest possible Spark program. It simply returns the version of Spark.
sc.version
This loads data into a Hive SQL table from a URL at Amazon, using Spark in Scala.
import org.apache.commons.io.IOUtils import java.net.URL import java.nio.charset.Charset // load bank data val bankText = sc.parallelize( IOUtils.toString( new URL("https://s3.amazonaws.com/apache-zeppelin/tutorial/bank/bank.csv"), Charset.forName("utf8")).split("\n")) case class Bank(age: Integer, job: String, marital: String, education: String, balance: Integer)
val bank = bankText.map(s => s.split(";")).filter(s => s(0) != "\"age\"").map( s => Bank(s(0).toInt, s(1).replaceAll("\"", ""), s(2).replaceAll("\"", ""), s(3).replaceAll("\"", ""), s(5).replaceAll("\"", "").toInt ) ).toDF() bank.registerTempTable("bank")
The reason this has to be done in two separate cell is a class can't be defined and used in the same cell. The Bank class is defined in the first cell and used in the second.
This creates a temporary table, meaning that it won't be there after the end of the session.
Here is SQL code that retrieves data from the table
%%sql select age, count(1) value from bank where age < 30 group by age order by age
The initial output will be text, exactly as you'd expect from this SQL query. However you'll see options that let you select various types of visualiztion: Pie chart, scatter diagram, etc.If you need to use classes that aren't already part of the system, you can use %AddDeps. It can load any module from the Maven 2 repository. Here's an example:
%AddDeps org.vegas-viz vegas_2.11 0.3.11 --transitive
If you go https://mvnrepository.com/, you can search for specific packages. Ib this case if you look for vegas, you'll find a group org.vegas-viz, with the artifact vegas_2.11. In the end, you'll find a maven declarationorg.vegas-viz vegas_2.11 0.3.11 Here's their first demo display, which imports a package to display graphics, Vegas. Unforunately it no longer works. We're still trying to find a graphics package that works with Scala.
import vegas._ import vegas.render.WindowRenderer._ val plot = Vegas("Country Pop"). withData( Seq( Map("country" -> "USA", "population" -> 314), Map("country" -> "UK", "population" -> 64), Map("country" -> "DK", "population" -> 80) ) ). encodeX("country", Nom). encodeY("population", Quant). mark(Bar) plot.show
(Note that Vegas won't actually work in the current version of Spark. This is given as an example of how to include an extra module.) - If you need a new notebook, you'll see for Notebook: a "New" pulldown in the upper right, or for Lab: icons with the various types in the main panel. When you create a new one, I suggest changing the title at the top to a name you'll remember.