Newton HPC Program - High Performance Computing
The Newton HPC Program is a joint effort between the Office of Research, the Office of Information Technology (OIT), and the departments of the University of Tennessee to establish a campus research computing environment in support of HPC and data-intensive computing applications. OIT operates a variety of computing systems which are accessed by researchers through a unified software environment. Newton membership is available to University of Tennessee researchers from all UT System campuses and institutes.
The Newton user's mailling list provides announcements to Newton users. For support requests, please contact the OIT help desk at 974-9900, visit http://help.utk.edu/, or email Newton_HPC_help@utk.edu.
Current utilization updated every 5 minutes.
The Newton Program has a student assistant position available starting Jan. 2017. Duties may include HPC system administration, software development, computational science application support, and other topics depending on applicant experience. Compensation may include a tuition waiver and hourly compensation. If you are interested in this opportunity, please send a resume and cover letter to Newton_HPC_help@utk.edu.
-- Gerald Ragghianti - 2016-11-28
In anticipation of fewer jobs during the break, we will be relaxing the medium and short queue limits until Nov. 28th.
-- Gerald Ragghianti - 2016-11-23
The maintenance outage has been moved to Nov. 13th from 11AM to 5PM.
-- Gerald Ragghianti - 2016-11-08
A maintenance outage is scheduled for Nov. 12th from 11AM to 5PM for all Newton cluster systems. This outage is for regular preventive maintenance.
-- Gerald Ragghianti - 2016-10-20
All Newton clusters are now operational. Please report any problems to Newton_HPC_help@utk.edu.
-- Gerald Ragghianti - 2016-09-06
The Newton cluster compute nodes have been turned off and all jobs stopped due to a building chiller failure. The cluster will be turned back on once chiller service is restored.
-- Gerald Ragghianti - 2016-09-05
Due to the popularity of dedicated sigma nodes, we are reserving 10 more nodes to run only core_per_node=24 jobs.
-- Gerald Ragghianti - 2016-07-21
The Newton HPC Program has made a number of exciting updates this summer.
The Sigma compute cluster has been expanded by 16 compute nodes. This increases the sigma cluster to 2592 CPU cores with a peak speed of 112 TFLOPS. These nodes were purchased through the Newton buy-in program by research groups in Chemistry, Physics, the Center for Business and Economic Research, and the Entomology and Plant Pathology depts.
A "Monster" 1TB RAM compute node was installed. This system is designed to facilitate jobs that require a very large in-RAM data set in a single shared memory space. Any Newton user can request use of this system. See Monster for details on using this system.
Nine sigma compute nodes are now reserved for whole-node jobs. If your jobs request cores_per_node=24 (a whole sigma node), the job will get priority on these reserved nodes and will likely experience a shorter queue wait time. We will reserve more nodes in this way as the use of whole-node jobs increases. Details on using whole-node jobs (dedicated nodes) is available at Using Dedicated Nodes.
-- Gerald Ragghianti - 2016-07-07
The Newton cluster updates are finished. Here is some of the work that was completed:
- Moving of cables for the sigma cluster Infiniband network in order to increase the cluster size to 2208 CPU cores.
- Tested all Infiniband cables and switches for the sigma cluster (found one bad cable)
- Installed two new 10 Gbit/sec Ethernet fiber links to the sigma cluster
- Increased the RAM available on the main storage servers
- Applied security updates to all systems
- Moved all home directories to a faster server
- Applied home directory capacity limits to avoid running out of storage space
We will soon start enforcing home directory capacity limits. Information about this will be available in the Newton web documentation soon.
-- Gerald Ragghianti - 2016-02-22
The Newton clusters will be offline for maintenance on Sat, Feb. 20th from 8AM until 9PM. This work includes re-cabling of the new Sigma cluster Infiniband fabric to merge the two racks of sigma compute nodes into a single cluster.
-- Gerald Ragghianti - 2016-02-02
The Newton clusters will be offline for maintenance on Wed, Dec. 16th from 6PM until midnight.
-- Gerald Ragghianti - 2015-12-01
The Newton clusters will undergo maintenance on Sunday, Sept. 13th from noon until 5PM in order to upgrade the compute node operating system and to integrate the new Sigma cluster into production use.
-- Gerald Ragghianti - 2015-08-31
The Newton clusters are back online. The sigma cluster is currently offline for testing in preparation for bringing it into production use.
-- Gerald Ragghianti - 2015-08-28
Cluster login accessis currently disabled while the main storage servers are restarted.
-- Gerald Ragghianti - 2015-08-27
The newest Newton cluster is now online and ready for user application testing. Please see New Cluster for more information.
-- Gerald Ragghianti - 2015-07-30
All Newton systems will be down for electrical maintenance on June 27 from 8:00AM until about 4:00PM (estimated). This maintenance is for upgrading electrical power capacity and to move the Newton network connectivity to the new high performance science network.
-- Gerald Ragghianti - 2015-05-13
The Newton job queues are now enabled. Please check any files that were written to at the time of the outage on Saturday as there is a chance that the files could be corrupt. This is also a good time to review the Newton data backup policy and to ensure that your critical files are located on storage allocations that are regularly backed-up: https://newton.utk.edu/bin/view/Main/Cluster Storage.
Please report any problems to the OIT help desk at 974-9900 or directly to Newton system administrators at Newton_HPC_help@utk.edu.
-- Gerald Ragghianti - 2015-04-27
We have enabled interactive login access to the Newton clusters. The jobs queues will be closed until early Monday morning in order to allow the storage system time to rebuild the storage arrays and to give more time to ensure that the clusters are running properly.
-- Gerald Ragghianti - 2015-04-27
Our investigation determined that the main Lustre storage system experienced a simultaneous failure of 7 hard drives just as the storm hit campus last night. This caused one of the Lustre storage targets (RAID arrays) to go into a failure mode that halted all read and write operations for files on that device. Fortunately, the drives did not experience an actual failure, so no data was lost. We were able to revive the drives and restart the storage systems. As a precautionary measure, we are replacing three of the hard drives. This procedure will take up to 24 hours. We are currently preforming tests on the storage system, and will restore login access to the cluster as soon as possible.
-- Gerald Ragghianti - 2015-04-26
The Newton clusters are currently offline due to storage system instability. We are currently investigating the problem. The cluster compute nodes, login nodes, and storage systems will need to be restarted and tested. We will update Newton users on the system status once the systems are back online.
-- Gerald Ragghianti - 2015-04-26
We discovered that in some cases, use of the "module" command was very slow on the Newton systems. This was caused by the module program searching for a startup file under /data/ and attempting to mount a remote filesystem in some cases. This issue has now been resolved, and use of the module command should be quicker. This will also speed up the interactive log-in process.
-- Gerald Ragghianti - 2015-03-04
Preventive maintenance is scheduled for Aug. 10th between 8AM and 5PM for all Newton HPC systems. All jobs will be stopped during this period, and log in access will be disabled.
-- Gerald Ragghianti - 2014-08-10
We are currently working to restore service to the Newton clusters. Backup power failed early this morning due to the diesel generator running out of fuel during an extended power outage. We will provide updates once service is restored or we have an ETA for the service.
-- Gerald Ragghianti - 2014-07-28
All Newton systems are currently down due to power outage.
-- Gerald Ragghianti - 2014-07-28
The scratch data location has been moved from /data/scratch to /lustre/scratch.
-- Gerald Ragghianti - 2014-04-21
We have finished upgrading the Newton clusters from Scientific Linux 5 to 6.4 with the migration of the Rho compute nodes (GPU nodes) to the new operating system. At this time, all compute nodes are accessible only through the new login nodes "newlogin.newton.utk.edu". The old login nodes will remain available under the name "oldlogin.newton.utk.edu" but no compute nodes will be available. Data under /lustre on the old login nodes will remain there until further notice. We will also be updating the login node name "login.newton.utk.edu" to now point to the new nodes.
-- Gerald Ragghianti - 2014-04-16
We have been having problems with one of the main storage servers on the Newton cluster. The last time the server stopped working, our diagnosis pointed to faulty RAM chips which we then replaced. However, the server went offline again a few days later. Since this server is scheduled to be retired soon, we have decided to move critical data off of this server ASAP to avoid further outages. We have finished the planning for this change, and an emergency maintenance outage of 3 hours will be needed to complete the work and testing. Here is the required work:
- Shut down all user-facing compute nodes on the Newton clusters (impact: no user or job access during the work).
- Transfer all home directories from old server to new storage system.
- Transfer compute node system images from old to new storage system.
- Change compute node images to use the new home directories and system images.
- Restart all compute nodes
This work is scheduled for tomorrow March 9th from 3PM to 6PM.
-- Gerald Ragghianti - 2014-03-09