Skip to Main Content

The University of Tennessee

Newton header banner

Frequently Used Tools:




Newton HPC Program - High Performance Computing

Newton HPC Program Logo

The Newton HPC Program is a joint effort between the Office of Research, the Office of Information Technology (OIT), and the departments of the University of Tennessee to establish a campus research computing environment in support of HPC and data-intensive computing applications. OIT operates a variety of computing systems which are accessed by researchers through a unified software environment. Newton membership is available to University of Tennessee researchers from all UT System campuses and institutes.

The Newton user's mailling list provides announcements to Newton users. For support requests, please contact the OIT help desk at 974-9900, visit http://help.utk.edu/, or email Newton_HPC_help@utk.edu.


Current utilization updated every 5 minutes.

Announcements

Please check here regularly for news regarding the status of Newton.

Compute on Newton transition to ACF on Oct 6

The compute capability on Newton will no longer be available on October 6 to coincide with the power outage. After October 6 no compute resources will be operational on Newton. Please continue to transition your computational work to the ACF. All home directory, /lustre and /gamma files will remain operational until October 18th except during the power outage. On October 18th the Newton file system will be set to readonly and then decommissioned on November 20th. Please transition your Newton storage to the ACF. See ACF data transfer documentation for more info on how to transfer files between Newton and ACF.

-- George Butler - 2017-09-18

Power outage in KPB computing center

A planned power outage is scheduled for 3pm Friday, October 6 to 10am, Monday, October 9 which will affect all equipment in the Newton cluster. This outage will be performed to upgrade the power in the KPB computing center. All Newton equipment in the KPB will have to be powered off for this work. All precautions will be taken to orderly shutdown the systems and power them off. There are risks of thermal stress to electronic equipment in powering them off and then back on. Please make arrangements to backup any critical data from Newton storage before Friday, October 6th. This would be a good time to start moving Newton data to the ACF /lustre/haven file system. /lustre/haven is scheduled to go online at the end of the day on September 13, 2017. See the ACF data transfer documentation for details on how to move data from Newton to the ACF.

-- George Butler - 2017-09-12

Remaining Sigma and Rho nodes to be moved October 6

The Sigma cluster nodes will be moved from KPB to the ACF facility in the JICS building on October 6th. This corresponds with the power outage in KPB on October 6th. The Sigma nodes will be made available as soon as possible in the ACF after the move of the hardware. Sigma will be unavailable to users for ACF benchmarking on Wed, Oct 4 at 8am and then powered off on Friday, Oct 6 at 9am to move the equipment.

-- George Butler - 2017-09-12

Data transfer nodes (DTN)

Two high performance DTNs were put in place and configured as part of the Newton cluster and there are 8 DTNs available in the ACF. For information on how to use the DTNs to transfer data from Newton to the ACF please see the ACF data transfer documentation.

-- George Butler - 2017-09-12

Lustre Outage

Starting at roughly 4 PM today we encountered issues with lustre becoming unresponsive. Several attempts were made to restore connectivity without disrupting users, but ultimately it was determined that correcting the issue required restarting the entire system. As a result, all jobs have been killed and the queue was cleared; these jobs will need to be resubmitted.

-- George Butler - 2017-08-07

Rho and Sigma nodes moved

  • The Advanced Computing Facility (ACF) came online on July 26, 2017 with the 45 node Beacon cluster as the primary cluster running Cent OS 7.3 with the 1.2 petabyte Lustre Medusa file system. Information about the ACF including the ACF timeline is published at ACF Timeline.

  • Monster, the large memory node, was moved to the ACF @ JICS and the target date for integration into the ACF is August 11, 2017. If you need larger memory than what Newton provides, the ACF Beacon nodes have 256GB of memory.

  • On July 28, a chassis of Rho (4 nodes) and a chassis of Sigma (12 nodes) were taken offline and moved to the ACF. This is the start of the integration of Rho and Sigma nodes into the ACF while keeping as much resources available for users as possible.

  • A rack of the Rho nodes will be taken offline on August 10 in preparation of powering those nodes off to move them to the ACF on August 11. The queues will be adjusted to prevent as best as possible new jobs that would run past August 10. Jobs running on these Rho nodes on August 10 in the morning will be removed. Information about the ACF including the ACF timeline is published at ACF.

-- George Butler - 2017-08-07

ACF Transition Update

Currently, the JICS Beacon resource is down for upgrades to the latest versions of the operating system and other software in preparation for starting the Advanced Computing Facility (ACF). Users will not be able to login (to what used to be Beacon and going forward will be the ACF) until the upgrades and testing are complete. Estimated completion is 1pm Monday, July 10. See ACF timeline.

-- George Butler - 2017-07-07

Newton Account Creation Disabled

In order to facilitate the ACF transition, no new accounts will be created on Newton starting July 1, 2017. The account request form will be disabled at that time, and thereafter any inquiries regarding new accounts should be directed to ACF.

-- George Butler - 2017-06-29

Monster Node Taken Offline

As part of the continuing transition of Newton to the ACF, the Monster node will be taken offline on July 7, 2017. To facilitate this, the long queue for monster will no longer accept jobs starting 9:00 AM, July 3, 2017. Likewise the medium queue will no longer accept jobs starting 9:00 AM on July 5, and the short queue will no longer accept jobs starting 10:00 PM on July 6. The overall timeline for the transition can be found at ACF Initial Timeline; additional details will be posted here as they become available.

-- George Butler - 2017-06-28

ACF Transition Update

Starting July 1, 2017, the Newton HPC cluster will be managed by the Joint Institute for Computational Sciences (JICS) and become part of the University of Tennessee Advanced Computing Facility (ACF). Further details are available at the OIT Service Catalog.

-- George Butler - 2017-06-14

OIT is performing a large upgrade to the data center that houses the Newton clusters from Friday, July 14th at 5:00PM to Monday, July 17 at 8:00AM. All Newton systems will be powered off during this period.

-- Gerald Ragghianti - 2017-05-30

OIT will be performing a network upgrade on Sunday, March 12th from 8:00AM to 5:00PM that will affect the Newton clusters. Newton systems will be unavailable for up to two hours during this maintenance window. No running user jobs will be affected.

-- Gerald Ragghianti - 2017-03-07

The Newton clusters will be offline for maintenance on Sunday, March 19th from 10AM until 5PM.

-- Gerald Ragghianti - 2017-03-06

The long_chi job queue has been disabled in preparation for upcoming cluster upgrades. All other queues, including long_sigma, long_phi, and long_rho remain in service.

-- Gerald Ragghianti - 2017-01-04

The Newton Program has a student assistant position available starting Jan. 2017. Duties may include HPC system administration, software development, computational science application support, and other topics depending on applicant experience. Compensation may include a tuition waiver and hourly compensation. If you are interested in this opportunity, please send a resume and cover letter to Newton_HPC_help@utk.edu.

-- Gerald Ragghianti - 2016-11-28

In anticipation of fewer jobs during the break, we will be relaxing the medium and short queue limits until Nov. 28th.

-- Gerald Ragghianti - 2016-11-23

The maintenance outage has been moved to Nov. 13th from 11AM to 5PM.

-- Gerald Ragghianti - 2016-11-08

A maintenance outage is scheduled for Nov. 12th from 11AM to 5PM for all Newton cluster systems. This outage is for regular preventive maintenance.

-- Gerald Ragghianti - 2016-10-20

All Newton clusters are now operational. Please report any problems to Newton_HPC_help@utk.edu.

-- Gerald Ragghianti - 2016-09-06

The Newton cluster compute nodes have been turned off and all jobs stopped due to a building chiller failure. The cluster will be turned back on once chiller service is restored.

-- Gerald Ragghianti - 2016-09-05

Due to the popularity of dedicated sigma nodes, we are reserving 10 more nodes to run only core_per_node=24 jobs.

-- Gerald Ragghianti - 2016-07-21

The Newton HPC Program has made a number of exciting updates this summer.

The Sigma compute cluster has been expanded by 16 compute nodes. This increases the sigma cluster to 2592 CPU cores with a peak speed of 112 TFLOPS. These nodes were purchased through the Newton buy-in program by research groups in Chemistry, Physics, the Center for Business and Economic Research, and the Entomology and Plant Pathology depts.

A "Monster" 1TB RAM compute node was installed. This system is designed to facilitate jobs that require a very large in-RAM data set in a single shared memory space. Any Newton user can request use of this system. See Monster for details on using this system.

Nine sigma compute nodes are now reserved for whole-node jobs. If your jobs request cores_per_node=24 (a whole sigma node), the job will get priority on these reserved nodes and will likely experience a shorter queue wait time. We will reserve more nodes in this way as the use of whole-node jobs increases. Details on using whole-node jobs (dedicated nodes) is available at Using Dedicated Nodes.

-- Gerald Ragghianti - 2016-07-07

The Newton cluster updates are finished. Here is some of the work that was completed:

  • Moving of cables for the sigma cluster Infiniband network in order to increase the cluster size to 2208 CPU cores.
  • Tested all Infiniband cables and switches for the sigma cluster (found one bad cable)
  • Installed two new 10 Gbit/sec Ethernet fiber links to the sigma cluster
  • Increased the RAM available on the main storage servers
  • Applied security updates to all systems
  • Moved all home directories to a faster server
  • Applied home directory capacity limits to avoid running out of storage space

We will soon start enforcing home directory capacity limits. Information about this will be available in the Newton web documentation soon.

-- Gerald Ragghianti - 2016-02-22

The Newton clusters will be offline for maintenance on Sat, Feb. 20th from 8AM until 9PM. This work includes re-cabling of the new Sigma cluster Infiniband fabric to merge the two racks of sigma compute nodes into a single cluster.

-- Gerald Ragghianti - 2016-02-02

The Newton clusters will be offline for maintenance on Wed, Dec. 16th from 6PM until midnight.

-- Gerald Ragghianti - 2015-12-01

The Newton clusters will undergo maintenance on Sunday, Sept. 13th from noon until 5PM in order to upgrade the compute node operating system and to integrate the new Sigma cluster into production use.

-- Gerald Ragghianti - 2015-08-31

The Newton clusters are back online. The sigma cluster is currently offline for testing in preparation for bringing it into production use.

-- Gerald Ragghianti - 2015-08-28

Cluster login accessis currently disabled while the main storage servers are restarted.

-- Gerald Ragghianti - 2015-08-27

The newest Newton cluster is now online and ready for user application testing. Please see New Cluster for more information.

-- Gerald Ragghianti - 2015-07-30

All Newton systems will be down for electrical maintenance on June 27 from 8:00AM until about 4:00PM (estimated). This maintenance is for upgrading electrical power capacity and to move the Newton network connectivity to the new high performance science network.

-- Gerald Ragghianti - 2015-05-13

The Newton job queues are now enabled. Please check any files that were written to at the time of the outage on Saturday as there is a chance that the files could be corrupt. This is also a good time to review the Newton data backup policy and to ensure that your critical files are located on storage allocations that are regularly backed-up: https://newton.utk.edu/bin/view/Main/Cluster Storage.

Please report any problems to the OIT help desk at 974-9900 or directly to Newton system administrators at Newton_HPC_help@utk.edu.

-- Gerald Ragghianti - 2015-04-27

We have enabled interactive login access to the Newton clusters. The jobs queues will be closed until early Monday morning in order to allow the storage system time to rebuild the storage arrays and to give more time to ensure that the clusters are running properly.

-- Gerald Ragghianti - 2015-04-27

Our investigation determined that the main Lustre storage system experienced a simultaneous failure of 7 hard drives just as the storm hit campus last night. This caused one of the Lustre storage targets (RAID arrays) to go into a failure mode that halted all read and write operations for files on that device. Fortunately, the drives did not experience an actual failure, so no data was lost. We were able to revive the drives and restart the storage systems. As a precautionary measure, we are replacing three of the hard drives. This procedure will take up to 24 hours. We are currently preforming tests on the storage system, and will restore login access to the cluster as soon as possible.

-- Gerald Ragghianti - 2015-04-26

The Newton clusters are currently offline due to storage system instability. We are currently investigating the problem. The cluster compute nodes, login nodes, and storage systems will need to be restarted and tested. We will update Newton users on the system status once the systems are back online.

-- Gerald Ragghianti - 2015-04-26

We discovered that in some cases, use of the "module" command was very slow on the Newton systems. This was caused by the module program searching for a startup file under /data/ and attempting to mount a remote filesystem in some cases. This issue has now been resolved, and use of the module command should be quicker. This will also speed up the interactive log-in process.

-- Gerald Ragghianti - 2015-03-04

Preventive maintenance is scheduled for Aug. 10th between 8AM and 5PM for all Newton HPC systems. All jobs will be stopped during this period, and log in access will be disabled.

-- Gerald Ragghianti - 2014-08-10

We are currently working to restore service to the Newton clusters. Backup power failed early this morning due to the diesel generator running out of fuel during an extended power outage. We will provide updates once service is restored or we have an ETA for the service.

-- Gerald Ragghianti - 2014-07-28

All Newton systems are currently down due to power outage.

-- Gerald Ragghianti - 2014-07-28

The scratch data location has been moved from /data/scratch to /lustre/scratch.

-- Gerald Ragghianti - 2014-04-21

We have finished upgrading the Newton clusters from Scientific Linux 5 to 6.4 with the migration of the Rho compute nodes (GPU nodes) to the new operating system. At this time, all compute nodes are accessible only through the new login nodes "newlogin.newton.utk.edu". The old login nodes will remain available under the name "oldlogin.newton.utk.edu" but no compute nodes will be available. Data under /lustre on the old login nodes will remain there until further notice. We will also be updating the login node name "login.newton.utk.edu" to now point to the new nodes.

-- Gerald Ragghianti - 2014-04-16

We have been having problems with one of the main storage servers on the Newton cluster. The last time the server stopped working, our diagnosis pointed to faulty RAM chips which we then replaced. However, the server went offline again a few days later. Since this server is scheduled to be retired soon, we have decided to move critical data off of this server ASAP to avoid further outages. We have finished the planning for this change, and an emergency maintenance outage of 3 hours will be needed to complete the work and testing. Here is the required work:

  1. Shut down all user-facing compute nodes on the Newton clusters (impact: no user or job access during the work).
  2. Transfer all home directories from old server to new storage system.
  3. Transfer compute node system images from old to new storage system.
  4. Change compute node images to use the new home directories and system images.
  5. Restart all compute nodes
  6. Testing

This work is scheduled for tomorrow March 9th from 3PM to 6PM.

-- Gerald Ragghianti - 2014-03-09

Older Announcements