Computer Science Data Science Facilities

This page describes primary software and systems for general use within the department. Individuals and research groups may, of course, have their own.

The major Hadoop-related tools are avaiable on all of our systems, though we don't have an actual Hadoop cluster. We believe you can get at least as good performance from running tools on the local system as on a small cluster, particularly if you use one of the large systems such as ilab1, ilab2 and ilab3.

Here are the tools discussed on this page.

(As of summer, 2021, we are dropping support for Zeppelin, though we can put it back if there's serious need for it.)

If you need additional tools, please contact help@cs.rutgers.edu. We're willing to install more tools, but will want your help in verifying that they work.

Python

Most data science within computer science is done in Python. We have Anaconda-based environments available on our Linux systems. They have the major packages used for data science already loaded. If you need additional packages, please contact help@cs.rutgers.edu.

Jupyter

Jupyter is a "notebook." It's a web interface designed to make it easy to do quick analysis, primarily in python. (We have also installed a kernel for Scala.) We don't recommend it for large programs, but many people use it for data analysis.

As of May 2022, our primary version of Python is 3.9, and Spark is 3.2.1, although we have older versions of Python. We recommend using the newest one that will work with your software. Note that Spark 2 works with Python 3.7 but not 3.8, while Spark 3 only works with 3.8 and later.

For more information on our copy of Jupyter, see Jupyter.

Spark

Spark is available from all of our systems. Spark will run standalone on the node where you run it. On systems such as ilab.cs.rutgers.edu there are enough cores to get reasonable parallelism.

If you want to write significant Spark programs, you'll probably be using the command line (and maybe an IDE). See Spark Programming for specifics in doing Spark programs here. The rest of this section describes use of Spark from Jupyter and the interactive commands.

The pyspark command-line program, and the Spark in Python3 session in Jupyter, set up a Spark context for you in python3. The following variables are defined:

The spark-shell command-line program, and the Spark in Scala session in Jupyter, set up a Spark context in Scala, with the following variables:

Graphics is avaiable within Jupyter / ipython using matplotlib. E.g "%matplotlib inline".

Python support for Jupyter is well documented. See The Jupyter Notebook.

The rest of this section has information on the Scala kernel for Jupyter, and spark-shell, since it's less well documented elsewhere. Here's the official documentation: Apache Torree Quick Start.

Graphics in Scala

We are currently unable to support graphics in Scala from Jupyter. We suggest that you prepare data in Scala, write it out, and then use Python to graph it. While there are graphics packages for Scala, they either haven't been updated for the current version of Scala or won't install. (If you find one that works, please tell us. We'd be happy to install it.)

Add classes to Scala

The Scala enviornment has access to a large set of Spark-related libraries, as well as other standard libraries such as Apache Commons. Try "ls /koko/system/spark/jars/" to see them all. If you need more, in Jupyter, you can use "%classpath" to load them. See the FAQ for more information. They tell use to use

%AddDeps group-id artifact-id version
to load libraries. This command searches the Maven collection of libraries. Just about any library you could want is in Maven. Here's a search tool: Maven search That search will display the group ID, artifact ID, and latest version. For generic libraries you probably should use the most recent. With Spark-related libraries you may want to use version 3.2.1, so it matches the version of Spark we have installed.

In Spark-shell, the --jars and --packages options perform the same function. For --packages, arguments look like groupid:artifactid:version

Oddities in Scala

Map/Reduce

The Hadoop software is installed in standalone mode on all of our systems. The only real use would be for Map/Reduce jobs. Note that jobs will run in local mode. However on our larger systems, e.g. ilab1, ilab2 and ilab3, you can get a reasonable amount of parallelism if you adjust the number of tasks. (By default only 2 map tasks are run.)

See Map/Reduce on CS Systems for details.