Computer Science Data Science Facilities
This page describes primary software and systems for general use within the department. Individuals and research groups may, of course, have their own.
The major Hadoop-related tools are avaiable on all of our systems, though we don't have an actual Hadoop cluster. We believe you can get at least as good performance from running tools on the local system as on a small cluster, particularly if you use one of the large systems such as ilab1, ilab2 and ilab3.
Here are the tools discussed on this page.
(Zeppelin has had almost no use. It is currently not available, but can easily be restarted if anyone needs it.)
If you need additional tools, please contact help@cs.rutgers.edu. We're willing to install more tools, but will want your help in verifying that they work.
Python
Most data science within computer science is done in Python. We have virtual environments available on our Linux systems. They have the major packages used for data science already loaded. If you need additional packages, please contact help@cs.rutgers.edu.
- Do not simply type "python." That will give you a copy of python2. Support for python 2 was dropped Jan, 2020. Also, our default python2 doesn't have the data science packages loaded. Despite the fact no one should be using Python 2, the Python project wants the command "python" to run Python 2. So you should always use "python3".
- Use one of the Python virtual environments. See Using Python on our systems.
- You can install packages yourself, using "pip install --user". This will put the additional packages in your home directory.
- If our environments don't have what you need, you can download a copy of Python and create your own environment. Since these environments are large, consider putting it in /common/users/NETID, where you have a larger quota than your home directory.
Jupyter
Jupyter is a "notebook." It's a web interface designed to make it easy to do quick analysis, primarily in python. (We have also installed kernels for Java, Scala, and R.) We don't recommend it for large programs, but many people use it for data analysis.
- Jupyter is installed in all of our Python 3 virtual environments. See Using Python on our systems for how to activate an environment.
- Once you've activated one, you can type a command like jupyter lab to run jupyter.
- Jupyter starts a web server. there are three ways to connect to it
- Use weblogin, rdp, or x2go to create a graphical session to one of our computers. When you run jupyter, it will automatically open a browser for you, pointed to jupyter. If for some reason that doesn't work, you can run Chrome or Firefox yourself. Copy and paste either of the URLs that jupyter prints into the browser's URL bar.
- Use ssh, and connect to your jupyter session using your browser
at home. In this case, type jupyter lab --ip=`hostname` --no-browser.
It will print a list of 3 URLs, e.g.
To access the server, open this file in a browser: file:///common/home/hedrick/... Or copy and paste one of these URLs: http://ilab1.cs.rutgers.edu:8889/... or http://127.0.0.1:8889/...
Copy and paste the middle one (the one with the hostname, in this case ilab1.cs.rutgers.edu) into your browser.
- If all you want to do is run jupyter, you don't actually have to activate the python environment. You can run the copy of jupyter directly from it, e.g. for the python3.10 environment, /common/system/venv/python312/bin/jupyter lab ...
- We have installed Jupyter kernels for use with Spark. See the next section. If you want to use Scala interactively, but aren't interested in Spark, you can still use Spark in Scala. You'll just ignore the Spark features.
- GPUs: GPUs are a special problem, because our normal Python environments probably won't work with common GPU software such as pytorch. If you want to use that, we recommend taking a look at the containers supplied by Nvidia. See /common/system/nvidia-containers/INDEX-pytorch or INDEX-tensorflow for the available containers. They have cuda, pytorch (or tensorflow), python, etc. They also have a copy of jupyter. To use jupyterlab, once you've started your singularity container, simply type "jupyter lab". It will print a message that has a URL such as "http://hostname:NNNN?token=XXXX" Connect to that URL with a browser, but use the actual hostname. E.g. if you logged into ilab2, use "http://ilab2.cs.rutgers.edu:NNNN?token=XXXX". If you want to use spark, once you've started the singularity container type "source /common/system/spark-setup.sh". At that point if you run python, you'll get python with spark in it (i.e. "sc" will be a Spark context, etc). If you run "jupyter lab" and start a Python notebook, you'll get Spark in it.
As of July 2024, our primary version of Python is 3.11, and Spark is 3.5.1, although we also have Python 3.9 and 3.10. We recommend using the newest one that will work with your software. Note that Spark 2 works with Python 3.7 but not 3.8, while Spark 3 only works with 3.8 and later.
While our copy of Jupyter has support for Scala, we strongly recommend using Zeppelin for Scala, particularly for classes. The Jupyter version of Scala uses more CPU than it needs to. In a class this can cause performance problems. The Java kernel for Jupyter looks better, though it hasn't had much testing. We have both a Java 17 kernel and Spark with Java 11. (There were issues getting Java 17 to work with Spark. We have a workaround, so Jupyter will shortly be fixed. There's a test version currently in jupyter.cs.rutgers.edu.)
For more information on our copy of Jupyter, see Jupyter.
Zeppelin
NOTE: We have supported Zeppelin in the past. However since there's been virtually no use, it is currently not running. If anyone needs it, it would be easy to bring it back.
Zeppelin is a different notebook system, from the Apache project. Its capabilities are similar to Jupyter's, but there is arguably better integration between the parts.
Unlike Jupyter, which is available both from the command line and the jupyterhub web site, Zeppelin is only available in a web version. Our is at https://zeppelin.cs.rutgers.edu.
See Zeppelin Documentation for more information.
Spark
Spark is available from all of our systems. Spark will run standalone on the node where you run it. On systems such as ilab.cs.rutgers.edu there are enough cores to get reasonable parallelism.
- The usual commands, e.g. pyspark, spark-shell, sparkR, spark-submit, are avaiable on our systems.
- If you run Jupyter on one of our systems, you'll see that there are kernels for Spark in Python and Scala. These effectively run the pyspark or spark-shell commands, respectively, in a Jupyter session. If you want to use GPUs, there are special considerations. See the note at the end of the Jupyter section
- Spark is configured to use 8gb of memory by default. This should be enough for all classwork. With the command-line pyspark and spark-shell, you can add "--driver-memory NNNg" to override this. With the Jupyter Scala core, you can set an environment variable SPARK_OPTS="--conf spark.driver.memory=NNNg" before starting the notebook.
- Spark 3 (which is our current version of Spark) is configured to use Java 17. If you just type "java" on one of our systems, you get Java 17. Spark 2 won't work with anything more recent than Java 8, so the various Spark 2-related commands set an explicit JAVA_HOME to point to Java 8. Note however that Jupyter's Java kernel for Spark uses Java 11 rather than 17, because the Java 17 jshell doesn't work with Spark.
If you want to write significant Spark programs, you'll probably be using the command line (and maybe an IDE). See Spark Programming for specifics in doing Spark programs here. The rest of this section describes use of Spark from Jupyter and the interactive commands.
The pyspark command-line program, and the Spark in Python3 session in Jupyter, set up a Spark context for you in python3. The following variables are defined:
- spark - a pyspark.sql.SparkSession (using Hive)
- sc - a SparkContext
- sql - a bound method SparkSession.sql for the session
- sqlContext - an SQLContext [for compatibility]
- sqlCtx - an old name for sqlContext
The spark-shell command-line program, and the Spark in Scala session in Jupyter, set up a Spark context in Scala, with the following variables:
- spark - a pyspark.sql.SparkSession (using Hive)
- sc - a SparkContext
The Spark in Java kernel in Jupyter defines
- jsc - a JavaSparkContext
- spark - a SparkSession
- sc - a SparkContext
Graphics is avaiable within Jupyter / ipython using matplotlib. E.g "%matplotlib inline".
Python support for Jupyter is well documented. See The Jupyter Notebook.
The rest of this section has information on the Scala kernel for Jupyter, and spark-shell, since it's less well documented elsewhere. Here's the official documentation: Apache Torree Quick Start.
Graphics in Scala
We are currently unable to support graphics in Scala from Jupyter. We suggest that you prepare data in Scala, write it out, and then use Python to graph it. While there are graphics packages for Scala, they either haven't been updated for the current version of Scala or won't install. (If you find one that works, please tell us. We'd be happy to install it.)
Add classes to Scala
The Scala enviornment has access to a large set of Spark-related libraries, as well as other standard libraries such as Apache Commons. Try "ls /common/system/spark/jars/" to see them all. If you need more, in Jupyter, you can use "%classpath" to load them. See the FAQ for more information. They tell use to use
%AddDeps group-id artifact-id versionto load libraries. This command searches the Maven collection of libraries. Just about any library you could want is in Maven. Here's a search tool: Maven search That search will display the group ID, artifact ID, and latest version. For generic libraries you probably should use the most recent. With Spark-related libraries you may want to use version 3.5.1, so it matches the version of Spark we have installed.
In Spark-shell, the --jars and --packages options perform the same function. For --packages, arguments look like groupid:artifactid:version
Oddities in Scala
- In Jupyter, if you define a class and use it in the same paragraph it may give an undefined error. You may have to execute the definition before the use. You probably won't see this in spark-shell, because it executes every line as it is complete.
- The Jupyter Scala kernel doesn't prefine the SQL context. Try "val sqlContext = new org.apache.spark.sql.hive.HiveContext(sc)"
- If you try to create a Hive table, you may get a permissions error from Derby. If so, you'll have to do "System.setSecurityManager(null)" before creating the SQL context.
Map/Reduce
The Hadoop software is installed in standalone mode on all of our systems. The only real use would be for Map/Reduce jobs. Note that jobs will run in local mode. However on our larger systems, e.g. ilab1, ilab2 and ilab3, you can get a reasonable amount of parallelism if you adjust the number of tasks. (By default only 2 map tasks are run.)
See Map/Reduce on CS Systems for details.