Computer Science Data Science Facilities
This page describes primary software and systems for general use within the department. Individuals and research groups may, of course, have their own.
We currently have a 3-node Hadoop cluster. However the major tools are avaiable on all of our systems. The primary use of the cluster is for courses where you want to show students what a Hadoop cluster looks like. For serious work, you can get at least as good performance from running tools on the local system, particularly if you use one of the large systems such as ilab1, ilab2 and ilab3.
Here are the tools discussed on this page. Except for the last entry, these are all the versions outside the Hadoop cluster, i.e. those available on all of our systems.
If you need additional tools, please contact firstname.lastname@example.org.
Most data science within computer science is done in Python. We have Anaconda-based environments available on our Centos systems. They have the major packages used for data science already loaded. If you need additional packages, please contact email@example.com.
- Do not simply type "python." That will give you a copy of python2. Support for python 2 is being dropped by Jan, 2020. Also, our default python2 doesn't have the data science packages loaded.
- Use one of the Anaconda-based environments. See Using Python on our systems.
- You can install packages yourself, using "pip install --user". This will put the additional packages in your home directory.
- If our environments don't have what you need, you can download a copy of Python and create your own environment. We suggest putting it in /common/users/NETID, since the 3 GB quota on your main home directory may not be enough.
Jupyter is a "notebook." It's a web interface designed to make it easy to do quick analysis, primarily in python. (We have also installed a kernel for Scala.) We don't recommend it for large programs, but many people use it for data analysis.
- Jupyter is installed in all of our Anaconda-based environments. See Using Python on our systems for how to activate an environment.
- Once you've activated one, you can type a command like jupyter notebook --ip=`hostname` --browser="none" That will display a URL to which you can point your browser.
- If all you want to do is run jupyter, you don't actually have to activate the environment. You can run the copy of jupyter directly from it, e.g. for the python37 environment, /koko/system/anaconda/envs/python37/bin/jupyter notebook --ip=`hostname` --browser="none".
- We have installed kernels for use with Spark. See the next section. If you want to use Scala interactively, but aren't interested in Spark, you can still use Spark in Scala. You'll just ignore the Spark features.
- We haven't installed a kernel for Java. Java isn't as well suited for interactive use as Scala. We suggest that Java programmers spend a few minutes to learn enough about Scala to use it. The languages are very similar.
Spark Outside Hadoop
Spark is available from all of our systems. Except within Hadoop, Spark will run standalone on the node where you run it. On systems such as ilab.cs.rutgers.edu there are enough cores to get reasonable parallelism.
- The usual commands, e.g. pyspark, spark-shell, sparkR, spark-submit, are avaiable on the systems.
- If you run Jupyter on one of our systems, you'll see that there are kernels for Spark in Python and Scala. These effectively run the pyspark or spark-shell commands, respectively, in a Jupyter session.
- Spark is configured to use 8gb of memory by default. This should be enough for all classwork. With the command-line pyspark and spark-shell, you can add "--driver-memory NNNg" to override this. With the Jupyter Scala core, you can set an environment variable SPARK_OPTS="--conf spark.driver.memory=NNNg" before starting the notebook.
- Spark is configured to use Java 8. We plan to move our system default to Java 11. However Spark 2.4.3 won't work with anything more recent than Java 8.
If you want to write significant Spark programs, you'll probably be using the command line (and maybe an IDE). See Spark Programming for specifics in doing Spark programs here. The rest of this section describes use of Spark from Jupyter and the interactive commands.
NOTE: Spark is currently available in juypter when run from the base anaconda environment, the python36 enviornment, and the python37 environment. It's not in python35, because we're about to remove it, or in python2.
The pyspark command-line program, and the Spark in Python3 session in Jupyter, set up a Spark context for you in python3. The following variables are defined:
- spark - a pyspark.sql.SparkSession (using Hive)
- sc - a SparkContext
- sql - an bound method SparkSession.sql for the session
- sqlContext - an SQLContext [for compatibility]
- sqlCtx - an old name for sqlContext
The spark-shell command-line program, and the Spark in Scala session in Jupyter, set up a Spark context, with the following variables:
- spark - a pyspark.sql.SparkSession (using Hive)
- sc - a SparkContext
Graphics is avaiable within Jupyter / ipython using matplotlib. E.g "%matplotlib inline".
Python support for Jupyter is well documented. See The Jupyter Notebook.
The rest of this section has information on the Scala kernel for Jupyter, and spark-shell. Here's the official documentation: Apache Torree Quick Start.
Graphics in Scala
Graphics is avaiable within Toree, however it's not builtin. There's no one dominant package. The following two can be loaded with one or two lines. Currently they probably only work in Jupyter, not spark-shell. This will be fixed when we get Spark 3.
Add classes to Scala
The Scala enviornment has access to a large set of Spark-related libraries, as well as other standard libraries such as Apache Commons. Try "ls /koko/system/spark/jars/" to see them all. If you need more, in Jupyter, you can use "%classpath" to load them. See the FAQ for more information. They tell use to use
%AddDeps group-id artifact-id versionto load libraries. This command searches the Maven collection of libraries. Here's a search tool: Maven search That search will display the group ID, artifact ID, and latest version. For generic libraries you probably should use the most recent. With Spark-related libraries you may want to use version 2.4.3, so it matches the version of Spark we have installed.
In Spark-shell, the --jars and --packages options perform the same function. For --packages, arguments look like groupid:artifactid:version
Oddities in Scala
- If you define a data class and use it in the same paragraph it may give an undefined error. You may have to execute the definition before the use. You probably won't see this in spark-shell, because it executes every line as it is complete.
- The Jupyter Scala kernel doesn't prefine the SQL context. Try "val sqlContext = new org.apache.spark.sql.hive.HiveContext(sc)"
- If you try to create a Hive table, you may get a permissions error from Derby. If so, you'll have to do "System.setSecurityManager(null)" before creating the SQL context.
Hadoop is installed in standalone mode on all of our systems. The only real use would be for Map/Reduce jobs. Note that outside the cluster, jobs will run in local mode. However on our larger systems, e.g. ilab1, ilab2 and ilab3, you can get a reasonable amount of parallelism if you adjust the number of tasks. (By default only 2 map tasks are run.)
See Map/Reduce on CS Systems for details.
We have a small Hadoop cluster, with 3 nodes. It has the typical tools: HDFS, YARN, Zookeeper, MapReduce, Hive, Hbase, Pig, Kafka. In addition, it has two web-based noteoobks: Jupyterhub and Zeppelin. It's intended primarily for coursework, so it has enough memory to survive reasonable size classes.
See Computer Science Hadoop Cluster for specifics.