Skip to Main Content

The University of Tennessee

Newton header banner

Frequently Used Tools:



Home » Documentation » Using the Grid Engine » Common Grid Engine Tasks

Common Grid Engine Tasks

Submitting Jobs

Here, we assume that you are using Job Definition Files when submitting your job. If the definition file is named "job.sge" and is located in your current working directory, the job submit procedure is always qsub job.sge. When you submit a job, the Grid Engine will return the job ID number. Once the job starts execution, the Grid Engine will create output files that will contain the job output as it executes. These files will be named following the formula [job name].[job number] and will be located in your home directory (or your current working direcotry if you use the "-cwd" option). Here we will show some example Job Definition Files for various scenarios.

Here is a file for a single processor job that will execute the program /bin/uptime. It requests a Job Run Time of 5 minutes:

#$ -N uptime_test
#$ -l s_cpu=5:00
#$ -cwd
/bin/uptime

This will submit Parallel Jobs comprised of the binary "/home/username/myprogram" that was compiled using the Intel C++ compiler and uses OpenMPI. It will use 16 processors and take no longer than 1 hour to execute.

#$ -N mpijob
#$ -pe openmpi* 16
#$ -l s_cpu=1:00:00
#$ -cwd
/usr/mpi/intel/openmpi-1.2.6/bin/mpirun /home/username/myprogram

Here is a more complex job file. It will submit a parallel job that uses OpenMP to create 8 threads on a single compute node. We will give the job a time limit of 20 hours. We will use the "-V" qsub option to forward all our environment variables to the job when it executes.

#$ -N jobname
#$ -pe openmp 8
#$ -V
#$ -cwd
#$ -q short*
~username/myOpenMPprogram

Running Interactive Jobs

You can also submit an "interactive" job to the Grid Engine by using the "qlogin" program. This allows to run a job on a compute node while still being able to interact with the program as if it were executed on the local machine. The program will access input from you while it is executing and will return the output to your screen immediately (without putting output into a log file). All Grid Engine options for qlogin have to be included on the command line (you cannot use a job definition file).

[username@newton ~]$qlogin -q short_rho
local configuration zeta00.local not defined - using global configuration
Your job 4081579 ("QLOGIN") has been submitted
waiting for interactive job to be scheduled ...timeout (4 s) expired while waiting on socket fd 4
.
Your interactive job 4081579 has been successfully scheduled.
Establishing /data/apps/site_utilities/sge_qlogin session to host rho27.local ...
/usr/bin/ssh -Y -p 53748 rho27.local
****************************************
 The University of Tennessee, Knoxville
    Newton HPC Program Linux Cluster
****************************************
    All use is subject to university 
 policies: http://oit.utk.edu/policies/
****************************************
   Old cluster login nodes are 
   accessible via the login node  
   "oldlogin.newton.utk.edu".
****************************************
[Newton:rho27 ~]$ 

Monitoring

To monitor jobs or queues, you should use the "qstat" command. By default, qstat will list only your own currently executing or pending jobs. You can optionally list all jobs in the system:

[username@newton ~]$ qstat -u \*
job-ID  prior   name       user         state submit/start at     queue             
------------------------------------------------------------------------------------
   1135 0.85714 wrapple2   user2        r     11/23/2008 10:12:03 all.q@sun12.local 
   1708 1.00000 STDIN      user1        qw    01/05/2009 15:58:42                   
   1709 0.50000 micro.lsf  user1        qw    01/05/2009 16:11:11                   

You can list all Cluster Queues:

[username@newton ~]$ qstat -g c
CLUSTER QUEUE                   CQLOAD   USED  AVAIL  TOTAL aoACDS  cdsuE  
-------------------------------------------------------------------------------
admin                             -NA-      0      0      0      0      0 
all.q                             -NA-      0      0      0      0      0 
epsilon                           0.00      0     32     32      0      0 
huge_pages_test                   0.00      0     16     16      0      0 
long_UT                           0.90     94     26    128      0      8 
long_UT_2                         0.59    152    104    384      0    128 
long_chi                          0.64    326   1166   1728      0    240 
long_dao                          0.51    146    134    288      0      8 
long_kpb                          0.32    162    486    672      0     24 
long_phi                          0.89    709    155    864      0      0 
long_psi                          0.08     24    180    240      0     36 
medium_UT                         0.90     20    100    128      0      8 
medium_UT_2                       0.59      0    256    384      0    128 
medium_chi                        0.64    657    831   1728      0    240 
medium_dao                        0.51      0    280    288      0      8 
medium_kpb                        0.32     28    620    672      0     24 
medium_phi                        0.89     64    800    864      0      0 
medium_psi                        0.08      0    204    240      0     36 
short_UT                          0.90      0    120    128      0      8 
short_UT_2                        0.59      0    256    384      0    128 
short_chi                         0.64    128   1360   1728      0    240 
short_dao                         0.51      0    280    288      0      8 
short_kpb                         0.32      0    648    672      0     24 
short_phi                         0.89      0    864    864      0      0 
short_psi                         0.08      0    204    240      0     36 

If you have a currently executing of pending job, you can use qstat to get detained information on its status. The last lines in the output will give an indication of why the job could not be execute yet:

[user1@newton ~]$ qstat
job-ID  prior   name       user         state submit/start at     queue                          slots ja-task-ID 
-----------------------------------------------------------------------------------------------------------------
   1708 1.00000 STDIN      user1        qw    01/05/2009 15:58:42                   
   1709 0.50000 micro.lsf  user1        qw    01/05/2009 16:11:11         
[user1@newton ~]$ qstat -j 1709
==============================================================
job_number:                 1709
exec_file:                  job_scripts/1709
submission_time:            Mon Jan  5 16:11:11 2009
owner:                      user1
uid:                        222
group:                      group
gid:                        333
sge_o_home:                 /home/user1
sge_o_log_name:             user1
sge_o_path:                 /opt/sge/bin/lx24-amd64:/usr/lib64/qt-3.3/bin:/data/apps/pgi/linux86-64/6.0/bin:/usr/kerberos/bin:/opt/intel/fce/10.1.018/bin:/opt/intel/idbe/10.1.018/bin:/usr/local/bin:/bin:/usr/bin:/home/user1/bin
sge_o_shell:                /bin/bash
sge_o_workdir:              /home/user1/workdir
sge_o_host:                 newton
account:                    sge
reserve:                    y
mail_list:                  user1@newton0.local
notify:                     FALSE
job_name:                   micro.lsf
jobshare:                   0
env_list:                   
script_file:                micro.lsf
verify_suitable_queues:     2
project:                    group
scheduling info:            queue instance "short_UT@sun7.local" dropped because it is temporarily not available
                            queue instance "medium_UT@sun7.local" dropped because it is temporarily not available
                            queue instance "admin@sun7.local" dropped because it is temporarily not available
                            queue instance "long_UT@sun7.local" dropped because it is temporarily not available
                            queue instance "all.q@sun12.local" dropped because it is disabled
                            has no permission for queue "short_UT@sun13.local"
                            has no permission for queue "short_UT@sun14.local"
                            has no permission for queue "short_UT@sun15.local"
                            has no permission for queue "short_UT@sun0.local"
                            has no permission for queue "short_UT@sun1.local"
                            has no permission for queue "short_UT@sun10.local"
                            has no permission for queue "short_UT@sun11.local"

Use qhost if you wish to see the status of individual compute nodes in the Grid Engine system. Please note that as a user, you should not general have to know anything about the compute nodes in order to use the system effectively.

HOSTNAME                ARCH         NCPU  LOAD  MEMTOT  MEMUSE  SWAPTO  SWAPUS
-------------------------------------------------------------------------------
global                  -               -     -       -       -       -       -
newton0                 lx24-amd64      4  0.00    3.9G  245.5M  992.0M     0.0
newtonhead              -               -     -       -       -       -       -
ornl0                   lx24-amd64      2  0.00    3.9G  217.5M  992.0M     0.0
ornl1                   lx24-amd64      2  0.00    3.9G  215.5M  992.0M     0.0
ornl2                   lx24-amd64      2  0.01    3.9G  215.2M  992.0M     0.0
ornl3                   lx24-amd64      2  0.00    3.9G  216.6M  992.0M     0.0
ornl4                   lx24-amd64      2  0.00    3.9G  213.5M  992.0M     0.0
ornl5                   lx24-amd64      2  0.00    3.9G  214.4M  992.0M     0.0
ornl6                   lx24-amd64      2  0.00    3.9G  215.4M  992.0M     0.0
ornl7                   lx24-amd64      2  0.00    3.9G  214.5M  992.0M     0.0
ornl8                   lx24-amd64      2  0.00    3.9G  215.7M  992.0M     0.0
ornl9                   lx24-amd64      2  0.00    3.9G  212.9M  992.0M     0.0
ornlhead                -               -     -       -       -       -       -
sun0                    lx24-amd64      8  0.02   15.7G  218.5M  992.0M     0.0
sun1                    lx24-amd64      8  0.00   15.7G  218.3M  992.0M     0.0
sun10                   lx24-amd64      8  0.00   15.7G  218.3M  992.0M     0.0
sun11                   lx24-amd64      8  0.00   15.7G  220.6M  992.0M     0.0
...

Graphical Interaction

The Grid Engine program "qmon" will allow you to do most grid tasks through a graphic interface. Using qmon requires that you log into the cluster with an SSH client that supports X11 forwarding (use the "-X" option for OpenSSH or turn on "X11 forwarding" within your Windows client). You also need to run an X server on your desktop. If you are using a Unix-like operating system (except Mac OSX) then you are probably already running an X server. If you are using Windows, you can use the free Xming server.

Here is an example qmon session that show the equivalent information as the "qstat -g c" command:

Back to Using The Grid Engine