Skip to Main Content

The University of Tennessee

Newton header banner

Frequently Used Tools:



Home » Documentation » Using the Grid Engine » Grid Engine Config

Grid Engine Configuration

The Newton system is composed of multiple groups, each allocated a share of the cluster's CPU resources. We enforce the allocation share with a combination of queue wait priority and job throughput throttling. There are four major factors that drive the configuration of SGE for Newton:

  1. The cluster has multiple separate interconnects for parallel jobs.
  2. We throttle throughput based on job runtime.
  3. We throttle throughput based on user or user group.
  4. We use Fairshare queue wait policy at either the user or group level.

Fairshare

Implementing Fairshare queue wait priority is very straightforward because it does not affect the other three factors. Each Unix group on the cluster is assigned to an SGE "project" and each project has a default share-tree share based on consumed CPU time. Upon creation of a Unix user, the user name is assigned to a default project so that all that user's jobs will count in the group's total resource usage record. Membership in a project is limited to members in the corresponds Unix group. Fairshare "shares" can be assigned to the project as a whole, or to individual members. It is necessary to use projects in the fairshare tree because unix groups are not a valid resource consumer in that context.

Parallel Environments (PE)

The PE configuration is complicated by the fact that we have multiple IB switches and cannot assign parallel jobs across these switches (no inter-switch communication). This requires us to create separate parallel environments (PE's) and cluster queues for each IB switch. For each queue, we provide one of the PE's and assign all nodes that are connected to the corresponds IB switch. This ensures that no jobs run on nodes in more than one cluster-queue. Since there are now multiple PE's, users must submit jobs using a wildcard PE selection" "-pe openmpi* 16."

Job Run Time

We categorize jobs according to their expected run time using the s_cpu resource complex. The s_cpu complex is added to each queue with an appropriate upper limit, and the "CPU Time" hard limit is set to the same upper limit. This means that the limit is enforced whether or not the s_cpu complex is requested by the user. The user can request a specific queue without requesting the s_cpu resource, or the user can request an s_cpu value and the job will automatically be scheduled in the correct queue.

Job Throttling

Job throttling based on job runtime is difficult because resource quotas (the SGE method of throttling job throughput) work mainly with users, groups, projects, and cluster queues (not the cluster complexes like s_cpu). This means that we must create a cluster queue for each job runtime range that we wish to base a resource quota on. Since we already need to create queues for each IB switch, this means that the total queues needed is the product of the number of runtime ranges and the number of IB switches. In our case we have three IB switches and three runtime ranges (long, medium, and short) for a total of 9 cluster queues. These are defined as follows:

| Queue | Run limit | PE | Hosts | Sequence | | short_1 | 1 hour | openmpi1 | SMC | 0 | | short_2 | 1 hour | openmpi2 | ORNL1 | 0 | | short_3 | 1 hour | openmpi3 | ORNL2 | 0 | | medium_3 | 24 hours | openmpi1 | SMC | 1 | | medium_3 | 24 hours | openmpi2 | ORNL1 | 1 | | medium_3 | 24 hours | openmpi3 | ORNL2 | 1 | | long_3 | none | openmpi1 | SMC | 2 | | long_3 | none | openmpi2 | ORNL1 | 2 | | long_3 | none | openmpi3 | ORNL2 | 2 |

The queue sequence determines the order that the system searches through the queue list when determining where to execute a job. This queue naming scheme makes job throttling easier through the use of wildcard selectors in resource quotas:

{
   name         short
   description  NONE
   enabled      TRUE
   limit        projects cfd queues long*,medium*,short* to slots=10
   limit        projects mra queues long*,medium*,short* to slots=10
   limit        users {*}    queues long*,medium*,short* to slots=7
}
{
   name         medium
   description  NONE
   enabled      TRUE
   limit        projects cfd queues long*,medium* to slots=7
   limit        projects mra queues long*,medium* to slots=7
   limit        users {*}    queues long*,medium* to slots=4
}
{
   name         long
   description  NONE
   enabled      TRUE
   limit        user     joe   queues long* to slots=1
   limit        projects cfd   queues long* to slots=5
   limit        projects mra   queues long* to slots=5
   limit        users {*}      queues long* to slots=0
} 

In this configuration, the projects (Unix groups) cfd and mra are our priority groups. Each have 5 slots in the cluster. Withing each quota rule set, we assign slot limits to each project and follow up with the default (catch all) limit for non-priority projects and users. You can also see that in the "long" rule set, one of the priority groups has decided to further limit one of the group members (joe) to 1 slot. The ordering of rules within a rule set should go from specific to general because the limits are applied according to a short-circuiting OR operation.

This configuration allows user to easily submit jobs without worrying about which queue the job needs to execute in. The system will determine the correct queue automatically:

qsub -l s_cpu=12:00:00 job.sge - Executes in any medium queue.

qsub -l s_cpu=30:00 job.sge - Executes in any short queue.

qsub -l s_cpu=2:00:00 -pe 'openmpi*' 16 job.sge - Executes in any medium queue having 16 free slots.

qsub -q 'short*' job.sge - Executes in any "short" queue (run time limit still applies).

Back to Using The Grid Engine