Skip to Main Content

The University of Tennessee

Newton header banner

Frequently Used Tools:



Home » Documentation » New Cluster

This documentation is out of date but retained for historical purposes. The sigma cluster is now in production

Sigma cluster

Σ

In survey responses from Newton users last winter, OIT received many suggestions for increasing computational capacity, availability, and ease of use. In response OIT has formulated an upgrade plan for the Newton HPC computational resources (compute nodes) that will provide the backbone of the Newton HPC Program services for the next few years.

Cluster design objectives

The Newton Program currently operates a number of computational clusters, each with different features (CPU type, hardware age, GPGPU support). While the variety of hardware can be useful, we also recognize that it makes it more difficult to optimize computational tasks to run at maximum efficiency and to submit jobs properly. In addition, a large variety of hardware is more difficult to manage effectively for the system administrators. To improve this situation, OIT has formulated a plan that will allow us to build a larger, more uniform computational resource by moving to a three-year compute cluster upgrade cycle. The design of the new cluster will allow us to expand or upgrade within the three-year window without fragmenting the resources into multiple clusters.

The first installed phase of this new cluster is currently in place and ready for testing. The initial cluster consists of 70 compute nodes with a total of 1680 CPU cores and 9TB of RAM. This cluster can be seamlessly expanded to 144 compute nodes and 3456 CPU cores over the next one-to-two years as needed.

Tech details

The phase-one cluster (named "Sigma") consists of 70 Lenovo NeXtScale nx360 compute nodes, each with two Intel Xeon E5-2680v3 CPUs (Haswell microarchitecture) providing a total of 24 CPU cores and 128 GB of RAM per compute node. The compute nodes are interconnected with a high-performance FDR Infiniband network: providing 56 Gbit/sec throughput with sub-microsecond latency.

Accessing the cluster

The Sigma cluster will be accessible through the login node "sigma.newton.utk.edu". You should directly log into this machine via SSH before compiling new code or recompiling existing applications. Jobs can be submitted to the Sigma cluster from any Newton login node.

Running applications

The new hardware in the Sigma compute cluster requires an upgrade of the Newton cluster operating system from Redhat 6.4 to 6.5. To ensure compatibility with the new operating system, we recommend recompiling all applications that will run on Sigma nodes. This will also require the reinstallation of all applications located under /data/apps/ and managed by the modules system. We have initiated an automated rebuild of this software, but it will not be available until all applications are rebuilt and verified. In the meantime, only the openmpi/1.6.5-intel module has been rebuilt and verified to work properly on the Sigma cluster. All applications using MPI should be rebuilt using this module (execute "module switch openmpi/1.6.5-intel") until other openmpi versions are made available.

Submitting jobs

The Sigma cluster is currently accepting jobs for testing purposes only. These jobs must be submitted with the "-l testing" parameter and will execute on short_sigma with a 2-hour time limit. The Sigma cluster will accept single-process jobs and exclusive access, parallel jobs. Parallel jobs must also use "-l cores_per_node=24" as described in the documentation on Using Dedicated Nodes. For example, a valid parallel job which would use 96 CPU cores on four compute nodes would look like this:

# First, request "testing" and dedicated node access
#$ -l testing,cores_per_node=24
# Now request 4 nodes
#$ -pe openmpi 4
#$ -q short_sigma
#$ -cwd
mpirun application

Feedback

Please send feedback on Sigma cluster testing to Newton_HPC_help@utk.edu .