SGE: Jobs stuck in qw state

quarky picture quarky · Mar 3, 2015 · Viewed 7.6k times · Source

I'm trying to submit jobs to SGE. It has been working for me the same way in the past. Now instead, all jobs are stuck in the qw state.

"qstat -g c" output:

> CLUSTER QUEUE   CQLOAD   USED  AVAIL  TOTAL
> all.q           0.38      0    160   1920   
> gpu6.q          -NA-      0      0      4    
> par6.q          0.38    750    135   1800      
> seq6.q          0.41    103    170    416   
> smp3.q          1.01      0      0     96  

"qstat" output looks like always.

Googling only gave me hints for people with root access which I don't have. Suggestions anyone?

Thanks.

Edit: Jobs were submitted via "qsub -q seq6.q scriptname" or alternatively smp3.q or par6.q.

"qstat -j jobid" gives nothing special as far as I can see:

job_number:                 2821318
exec_file:                  job_scripts/2821318
submission_time:            Wed Mar  4 12:07:15 2015
owner:                      username
uid:                        31519
group:                      dch
gid:                        1150
sge_o_home:                 /home/hudson/pg/username
sge_o_log_name:             username
sge_o_path:                 /gpfs/hamilton6/apps/intel_comp_2014/composer_xe_2013_sp1.2.144/bin/intel64:/usr/local/bin:/bin:/usr/bin:/usr/lpp/mmfs/bin:/usr/local/Cluster-Apps/sge/6.1u6/bin/lx24-amd64:/panfs/panasas1.hpc.dur.ac.uk/apps/nag/fll6a21dpl/scripts
sge_o_shell:                /bin/tcsh
sge_o_workdir:              /panfs/panasas1.hpc.dur.ac.uk/username/path
sge_o_host:                 hamilton1
account:                    sge
mail_list:                  username@hamilton1
notify:                     FALSE
job_name:                   scriptname
jobshare:                   0
hard_queue_list:            seq6.q
env_list:                   
script_file:                scriptname
scheduling info:            (Collecting of scheduler job information is turned off)

Answer

epinoche picture epinoche · Mar 11, 2015

I have had the same issue today. We are running Univa Grid Engine for a customer. I configured some complexes for running jobs which are requesting much memory ( h_stack=64M, memory_free=4G,virtual_free=4G) on the masterhost. After this config jobs will hang in the waiting queue. This configuration match many years with 3G on all our execution hosts. I will test this new config (4G) next days. All servers have enough memory! Ingo