A while ago, I posted about a job submission verifier (JSV) for Univa Grid Engine to try to handle job submissions which had less than ideal resource requests by leveraging cgroups. It was based on Daniel Gruber's JSV Go API.
In the 3+ years since that post, we had stopped using the JSV for one reason or another (including a Univa issue with cgroups and interaction with a specific kernel version), and just manually dealt with issues that came up by communicating with the users. Since then, as well, Daniel has updated API to be more Go-like. And we have had a fairly bad round of multithreaded programs submitted as serial jobs using up to 64 threads on our 64-core nodes.
So, I dusted off the old code, refreshed it, and reduced its scope to just deal with two cases: serial jobs, and multithreaded jobs. These types are jobs are defined either by a lack of PE (serial jobs), or a finite set of PEs (multithreaded).
There still is a deficiency in that the JSV cannot really deal with slot ranges. In Grid Engine, it is possible to request a range of slots for jobs, e.g. “-pe multithread 4-12” which would allow a job to be assigned any number of slots from 4 to 12. This is useful for busy clusters and users who would rather their jobs run slower than wait for the full 12 slots to open up.
Anyway, the JSV code is pretty straightforward. Find it here: https://github.com/prehensilecode/pecheck_simple
Together with this, UGE must be configured to have cgroups enabled (see your documentation). Here is the setup on our cluster -- the freezer functionality is disabled as there may be an issue in the interaction with RHEL 6 kernels:
cgroups_params cgroup_path=/cgroup cpuset=true mount=true \
killing=true freezer=false freeze_pe_tasks=false \
forced_numa=true h_vmem_limit=true \
m_mem_free_hard=true m_mem_free_soft=true \
min_memory_limit=250M
The JSV code is short enough that I include it here:
/*
* Requires https://github.com/dgruber/jsv
*/
package main
import (
"strings"
"github.com/dgruber/jsv"
)
func jsv_on_start_function() {
//jsv_send_env()
}
func job_verification_function() {
//
// Set binding on serial jobs (i.e. no PE) to "linear:1
//
var modified_p bool = false
if !jsv.IsParam("pe_name") {
jsv.SetParam("binding_strategy", "linear_automatic")
jsv.SetParam("binding_type", "set")
jsv.SetParam("binding_amount", "1")
jsv.SetParam("binding_exp_n", "0")
modified_p = true
} else {
pe_name, _ := jsv.GetParam("pe_name")
/* XXX the "shm" PE is the single-node multicore PE
* change this to the equivalent for your site;
* the "matlab" PE is identically defined to the "shm" PE
* XXX note that this does not properly deal with a range of number of slots;
* it just takes the max value of the range
*/
if (strings.EqualFold("shm", pe_name) || strings.EqualFold("matlab", pe_name)) {
pe_max, _ := jsv.GetParam("pe_max")
jsv.SetParam("binding_strategy", "linear_automatic")
jsv.SetParam("binding_type", "set")
jsv.SetParam("binding_amount", pe_max)
jsv.SetParam("binding_exp_n", "0")
modified_p = true
}
}
if modified_p {
jsv.Correct("Job was modified")
} else {
jsv.Correct("Job was not modified")
}
return
}
func main() {
jsv.Run(true, job_verification_function, jsv_on_start_function)
}
How-to's and technical news about Linux and open computing, with a sprinkling of Python.
Showing posts with label grid engine. Show all posts
Showing posts with label grid engine. Show all posts
2018-01-11
2017-10-04
Apache Spark integration with Grid Engine (update for Spark 2.2.0)
Apache Spark is a popular (because it is fast) big data engine. The speed comes from keeping data in memory. This is an update to my older post: it is still Spark in standalone mode, using the nodes assigned by GE as the worker nodes. I have an update for using Spark 2.2.0, with Java 1.8.0.
It is mostly the same, except only one file needs to be modified: sbin/slaves.sh The Parallel Environment (PE) startup script update only adds an environment variable for defining where the worker logs go. (Into a Grid Engine job-specific directory under the job directory.) And it now specifies Java 1.8.0.
As before, the modifications to sbin/slaves.sh handle using the proper spark-env script based on the user's shell. Since that spark-env script is set up by the PE script to generate job-specific conf and log directories, everything job-specific is separated.
It is mostly the same, except only one file needs to be modified: sbin/slaves.sh The Parallel Environment (PE) startup script update only adds an environment variable for defining where the worker logs go. (Into a Grid Engine job-specific directory under the job directory.) And it now specifies Java 1.8.0.
As before, the modifications to sbin/slaves.sh handle using the proper spark-env script based on the user's shell. Since that spark-env script is set up by the PE script to generate job-specific conf and log directories, everything job-specific is separated.
2015-11-11
SWIG is great - Python DRMAA2 interface in less than an hour
I have never used SWIG before, surprisingly. I figured creating a Python interface to DRMAA2 would be a good self-tutorial. Turns out to be almost trivial, following the directions here.
My Python (3.5) DRMAA2 interface code is on GitHub - https://github.com/prehensilecode/pydrmaa2. The hardest part, really, was writing the Makefile.
NOTE: no testing at all has been done. This is just a Q&D exercise to use SWIG.
My Python (3.5) DRMAA2 interface code is on GitHub - https://github.com/prehensilecode/pydrmaa2. The hardest part, really, was writing the Makefile.
NOTE: no testing at all has been done. This is just a Q&D exercise to use SWIG.
2015-08-06
Apache Spark integration with Grid Engine
Apache Spark is a fast engine for big data. It can use Hadoop infrastructure (like HDFS), and provides its own map-reduce implementation. It can also be run in standalone mode, without Hadoop or the YARN resource manager.
I have been able to get Spark 1.4.1 running, with some integration into an existing Univa Grid Engine cluster. The integration is not "tight" in that the slave processes are still independently launched with ssh. I was unable to get Spark to work with qrsh. So, without tight integration, usage accounting is not exact.
I also had to make some modifications to the Spark standalone shell scripts in order to have job-specific configuration and log directories. Out of the box, Spark's shell scripts do not completely propagate the environment to the slaves. Job-specific configuration and log directories are needed because multiple users may want to run Spark jobs at the same time.
Additionally, I was not able to figure a way to constrain Spark slave instances to subsets of available processor cores. So, Spark jobs require exclusive use of compute nodes.
So, let's start there. Your GE installation needs to have the "exclusive" complex defined:
#name shortcut type relop requestable consumable default urgency
#------------------------------------------------------------------------------------------
I have been able to get Spark 1.4.1 running, with some integration into an existing Univa Grid Engine cluster. The integration is not "tight" in that the slave processes are still independently launched with ssh. I was unable to get Spark to work with qrsh. So, without tight integration, usage accounting is not exact.
I also had to make some modifications to the Spark standalone shell scripts in order to have job-specific configuration and log directories. Out of the box, Spark's shell scripts do not completely propagate the environment to the slaves. Job-specific configuration and log directories are needed because multiple users may want to run Spark jobs at the same time.
Additionally, I was not able to figure a way to constrain Spark slave instances to subsets of available processor cores. So, Spark jobs require exclusive use of compute nodes.
So, let's start there. Your GE installation needs to have the "exclusive" complex defined:
#name shortcut type relop requestable consumable default urgency
#------------------------------------------------------------------------------------------
exclusive excl BOOL EXCL YES YES 0 1000
The OS on Drexel's Proteus cluster is RHEL 6.4-ish. I use Red Hat's packaging of Oracle Java 1.7.0_85 by default. Running Spark requires the JAVA_HOME environment variable to be set, which I do in the global login script location /etc/profile.d/. I found that using /usr/lib/jvm/java did not work. It needed to be:
JAVA_HOME=/usr/lib/jvm/java-1.7.0-oracle.x86_64
Building Spark 1.4.1 was painless. I used the bundled script to generate a binary distribution tarball:
./make-distribution.sh --name myname --tgz
Untar it into some convenient location.
Next, the sbin/start-slaves.sh and sbin/stop-slaves.sh scripts need to be modified. You can look at my fork at GitHub. As they are, these two scripts just ssh to all the slave nodes to start the slave processes. However, ssh does not pass environment variables, so all the slave processes launch with the default SPARK_HOME. That means all the slave processes read the global Spark config and environment, and log to the global Spark installation log directory.
Because the remote shell is the user shell, we have to figure out the user shell in order to build the command to be executed on the slave hosts. Here is the snippet from sbin/start-slaves.sh:
# Launch the slaves
USERSHELL=$( getent passwd $USER | cut -f7 -d: )
if [ $USERSHELL = "/bin/bash" -o $USERSHELL = "/bin/zsh" -o $USERSHELL = "/bin/ksh" ] ; then
"$sbin/slaves.sh" cd "$SPARK_HOME" \&\& "." "$SPARK_CONF_DIR/spark-env.sh" \&\& "$sbin/start-slave.sh" "spark://$SPARK_MASTER_IP:$SPARK_MASTER_PORT"
elif [ $USERSHELL = "/bin/tcsh" -o $USERSHELL = "/bin/csh" ] ; then
"$sbin/slaves.sh" cd "$SPARK_HOME" \&\& "source" "$SPARK_CONF_DIR/spark-env.csh" \&\& "$sbin/start-slave.sh" "spark://$SPARK_MASTER_IP:$SPARK_MASTER_PORT"
fi
The cluster here has two types of compute nodes: Dell C6145s with 64-core AMD CPUs, and Dell C6220s with 16-core Intel CPUs. So, I created a job class (JC) with two subclasses, and also separate parallel environments (PEs).
The job class is as follows -- all missing lines have the default "{+}UNSPECIFIED" config:
jcname spark
variant_list default intel amd
owner NONE
user_lists NONE
xuser_lists NONE
...
l_hard {+}exclusive=TRUE,h_vmem=4g,m_mem_free=3g, \
[{+}intel=vendor=intel,h_vmem=4g,m_mem_free=3g], \
[{+}amd=vendor=amd,h_vmem=4g,m_mem_free=3g]
...
pe_name {~}spark.intel,[intel=spark.intel],[amd=spark.amd]
...
The spark.intel PE is defined as follows (with the spark.amd PE defined similarly):
pe_name spark.intel
slots 99999
user_lists NONE
xuser_lists NONE
start_proc_args /cm/shared/apps/sge/var/default/common/pescripts/sparkstart.sh
stop_proc_args NONE
allocation_rule 16
control_slaves FALSE
job_is_first_task FALSE
urgency_slots min
accounting_summary FALSE
daemon_forks_slaves FALSE
master_forks_slaves FALSE
The PE start script writes the job-specific environment files, and log4j properties file:
#!/bin/bash
spark_conf_dir=${SGE_O_WORKDIR}/conf.${JOB_ID}
/bin/mkdir -p ${spark_conf_dir}
### for bash-like
sparkenvfile=${spark_conf_dir}/spark-env.sh
echo "#!/usr/bin/env bash" > $sparkenvfile
echo "export JAVA_HOME=/usr/lib/jvm/java-1.7.0-oracle.x86_64" >> $sparkenvfile
echo "export SPARK_CONF_DIR=${spark_conf_dir}" >> $sparkenvfile
echo "export SPARK_MASTER_WEBUI_PORT=8880" >> $sparkenvfile
echo "export SPARK_WORKER_WEBUI_PORT=8881" >> $sparkenvfile
echo "export SPARK_WORKER_INSTANCES=1" >> $sparkenvfile
spark_master_ip=$( cat ${PE_HOSTFILE} | head -1 | cut -f1 -d\ )
echo "export SPARK_MASTER_IP=${spark_master_ip}" >> $sparkenvfile
echo "export SPARK_MASTER_PORT=7077" >> $sparkenvfile
echo "export MASTER_URL=spark://${spark_master_ip}:7077" >> $sparkenvfile
spark_slaves=${SGE_O_WORKDIR}/slaves.${JOB_ID}
echo "export SPARK_SLAVES=${spark_slaves}" >> $sparkenvfile
spark_worker_cores=$( expr ${NSLOTS} / ${NHOSTS} )
echo "export SPARK_WORKER_CORES=${spark_worker_cores}" >> $sparkenvfile
spark_worker_dir=/lustre/scratch/${SGE_O_LOGNAME}/spark/work.${JOB_ID}
echo "export SPARK_WORKER_DIR=${spark_worker_dir}" >> $sparkenvfile
spark_log_dir=${SGE_O_WORKDIR}/logs.${JOB_ID}
echo "export SPARK_LOG_DIR=${spark_log_dir}" >> $sparkenvfile
echo "export SPARK_LOCAL_DIRS=${TMP}" >> $sparkenvfile
chmod +x $sparkenvfile
### for csh-like
sparkenvfile=${spark_conf_dir}/spark-env.csh
echo "#!/usr/bin/env tcsh" > $sparkenvfile
echo "setenv JAVA_HOME /usr/lib/jvm/java-1.7.0-oracle.x86_64" >> $sparkenvfile
echo "setenv SPARK_CONF_DIR ${spark_conf_dir}" >> $sparkenvfile
echo "setenv SPARK_MASTER_WEBUI_PORT 8880" >> $sparkenvfile
echo "setenv SPARK_WORKER_WEBUI_PORT 8881" >> $sparkenvfile
echo "setenv SPARK_WORKER_INSTANCES 1" >> $sparkenvfile
spark_master_ip=$( cat ${PE_HOSTFILE} | head -1 | cut -f1 -d\ )
echo "setenv SPARK_MASTER_IP ${spark_master_ip}" >> $sparkenvfile
echo "setenv SPARK_MASTER_PORT 7077" >> $sparkenvfile
echo "setenv MASTER_URL spark://${spark_master_ip}:7077" >> $sparkenvfile
spark_slaves=${SGE_O_WORKDIR}/slaves.${JOB_ID}
echo "setenv SPARK_SLAVES ${spark_slaves}" >> $sparkenvfile
spark_worker_cores=$( expr ${NSLOTS} / ${NHOSTS} )
echo "setenv SPARK_WORKER_CORES ${spark_worker_cores}" >> $sparkenvfile
spark_worker_dir=/lustre/scratch/${SGE_O_LOGNAME}/spark/work.${JOB_ID}
echo "setenv SPARK_WORKER_DIR ${spark_worker_dir}" >> $sparkenvfile
spark_log_dir=${SGE_O_WORKDIR}/logs.${JOB_ID}
echo "setenv SPARK_LOG_DIR ${spark_log_dir}" >> $sparkenvfile
echo "setenv SPARK_LOCAL_DIRS ${TMP}" >> $sparkenvfile
chmod +x $sparkenvfile
/bin/mkdir -p ${spark_log_dir}
/bin/mkdir -p ${spark_worker_dir}
cat ${PE_HOSTFILE} | cut -f1 -d \ > ${spark_slaves}
### defaults, sp. log directory
echo "spark.eventLog.dir ${spark_log_dir}" > ${spark_conf_dir}/spark-defaults.conf
### log4j defaults
log4j_props=${spark_conf_dir}/log4j.properties
echo "### Suggestion: use "WARN" or "ERROR"; use "INFO" when debugging" > $log4j_props
echo "# Set everything to be logged to the console" >> $log4j_props
echo "log4j.rootCategory=WARN, console" >> $log4j_props
echo "log4j.appender.console=org.apache.log4j.ConsoleAppender" >> $log4j_props
echo "log4j.appender.console.target=System.err" >> $log4j_props
echo "log4j.appender.console.layout=org.apache.log4j.PatternLayout" >> $log4j_props
echo "log4j.appender.console.layout.ConversionPattern=%d{yy/MM/dd HH:mm:ss} %p %c{1}: %m%n" >> $log4j_props
echo "# Settings to quiet third party logs that are too verbose" >> $log4j_props
echo "log4j.logger.org.spark-project.jetty=WARN" >> $log4j_props
echo "log4j.logger.org.spark-project.jetty.util.component.AbstractLifeCycle=ERROR" >> $log4j_props
echo "log4j.logger.org.apache.spark.repl.SparkIMain$exprTyper=INFO" >> $log4j_props
echo "log4j.logger.org.apache.spark.repl.SparkILoop$SparkILoopInterpreter=INFO" >> $log4j_props
And then an example job script looks like:
#!/bin/bash
#$ -S /bin/bash
#$ -P myprj
#$ -M myemail@example.com
#$ -m ea
#$ -j y
#$ -cwd
#$ -jc spark.intel
#$ -l exclusive
#$ -pe spark.intel 32
#$ -l vendor=intel
#$ -l h_rt=0:30:00
#$ -l h_vmem=4g
#$ -l m_mem_free=3g
. /etc/profile.d/modules.sh
module load shared
module load proteus
module load gcc
module load sge/univa
module load python/2.7-current
module load apache/spark/1.4.1
###
### Set up environment for Spark
###
export SPARK_CONF_DIR=${SGE_O_WORKDIR}/conf.${JOB_ID}
. ${SPARK_CONF_DIR}/spark-env.sh
###
### The actual work is done below
###
### Start the cluster: master first, then slaves
echo "Starting master on ${SPARK_MASTER_IP} ..."
start-master.sh
echo "Done starting master."
echo "Starting slave..."
start-slaves.sh
echo "Done starting slave."
### the script which does the actual computation is submitted to the
### standalone Spark cluster
echo "Submitting job..."
spark-submit --master $MASTER_URL wordcount.py
echo "Done job."
### Stop the cluster: slaves first, then master
echo "Stopping slaves..."
stop-slaves.sh
echo "Done stopping slaves"
echo "Stopping master..."
stop-master.sh
echo "Done stopping master."
And, that's it. I have not done extensive testing or benchmarking, so I don't know what the performance is like relative to an installation that runs on Hadoop with HDFS.
2015-08-03
Abaqus integration for Univa Grid Engine (update)
My last post about Abaqus integration with Univa Grid Engine (UGE) had one disadvantage: it did not use qrsh to launch the slave MPI processes. As a result, job resource usage accounting was inaccurate. To fix this, certain parallel environment (PE) settings need to be corrected, and the rsh command that Abaqus uses for launching MPI slaves needs to be set to the wrapper rsh script.
The PE settings which worked worked for me -- see also sge_pe(5)
pe_name abaqus
slots 99999
user_lists NONE
xuser_lists NONE
start_proc_args /cm/shared/apps/sge/var/default/common/pescripts/abaqus.py
stop_proc_args NONE
allocation_rule $round_robin
control_slaves TRUE
job_is_first_task FALSE
urgency_slots min
accounting_summary TRUE
daemon_forks_slaves TRUEmaster_forks_slaves FALSE
The PE settings which worked worked for me -- see also sge_pe(5)
pe_name abaqus
slots 99999
user_lists NONE
xuser_lists NONE
start_proc_args /cm/shared/apps/sge/var/default/common/pescripts/abaqus.py
stop_proc_args NONE
allocation_rule $round_robin
control_slaves TRUE
job_is_first_task FALSE
urgency_slots min
accounting_summary TRUE
daemon_forks_slaves TRUEmaster_forks_slaves FALSE
And the updated PE script (again, setting the mp_host_list is optional). The rsh command is actually the rsh wrapper shell script, which then calls qrsh.
#!/usr/bin/env python
import sys, os
### PE startup script aka prologue to set up Abaqus MPI "hostfile"
### Based on documented env file format
### http://www.simulia.com/support/v67/books/sgb67EF/default.htm?startat=ch04s01.html
machinefile = os.environ['PE_HOSTFILE']
abaqenvfile = "abaqus_v6.env"
machinelines = []
with open(machinefile, "ro") as mf:
for l in mf:
lsplit = l.split()
machinelines.append( [lsplit[0], int(lsplit[1])] )
with open(abaqenvfile, "wo") as envfile:
envfile.write("mp_mode=MPI\n")
envfile.write("mp_rsh_command='/cm/shared/apps/sge/univa/mpi/rsh -n -l %U %H %C'\n")
envfile.write("mp_host_list=%s\n" % (str(machinelines)))
2015-03-17
Ganglia procstat.py fix to handle process names containing underscores
UPDATE: Accepted. Current version here.
This has bugged me for a while: Ganglia's Python module procstat.py which monitors process CPU and memory usage did not show any data for Grid Engine's qmaster, which has a process name of "sge_qmaster". Turns out, this is because it tries to parse out the process name by assuming it does not have underscores in it. This snippet is from the get_stat(name) function in procstat.py:
This has bugged me for a while: Ganglia's Python module procstat.py which monitors process CPU and memory usage did not show any data for Grid Engine's qmaster, which has a process name of "sge_qmaster". Turns out, this is because it tries to parse out the process name by assuming it does not have underscores in it. This snippet is from the get_stat(name) function in procstat.py:
if name.startswith('procstat_'):I just submitted a pull request to change this to something which handles process names with some number of underscores. The snippet to replace the above:
fir = name.find('_')
sec = name.find('_', fir + 1)
proc = name[fir + 1:sec]
label = name[sec + 1:]
if name.startswith('procstat_'):
nsp = name.split('_')
proc = '_'.join(nsp[1:-1])
label = nsp[-1]
2015-02-09
Grid Engine PE script (prologue) for Abaqus
UPDATE: Well, a closer look at some of the files Abaqus generates during its run indicates that Abaqus (or, technically, Platform MPI) is aware of Grid Engine and can figure out the host list by itself.
Abaqus 6.13 uses Platform MPI, but also uses its own "environment file" for the MPI hostfile. (Search for "mp_host_list" at the official documentation.) So, I cooked up this PE script (aka prologue) to write the abaqus_v6.env file in the job directory:
Abaqus 6.13 uses Platform MPI, but also uses its own "environment file" for the MPI hostfile. (Search for "mp_host_list" at the official documentation.) So, I cooked up this PE script (aka prologue) to write the abaqus_v6.env file in the job directory:
#!/usr/bin/env python
import sys, os ### PE startup script to set up Abaqus MPI "hostfile"
### Based on documented env file format machinefile = os.environ['PE_HOSTFILE']
abaqenvfile = "abaqus_v6.env" machinelines = []
with open(machinefile, "ro") as mf:
for l in mf:
lsplit = l.split()
machinelines.append( [lsplit[0], int(lsplit[1])] ) with open(abaqenvfile, "wo") as envfile:
envfile.write("mp_mode=MPI\n")
envfile.write("mp_host_list=%s\n" % (str(machinelines)))
2014-05-14
SSSD setup in a Bright Cluster
UPDATE 2015-05-28: Also, on the master node, the User Portal web application needs a setting in /etc/openldap/ldap.conf:
TLS_REQCERT never
At my current job, we use Bright Cluster Manager and Univa Grid Engine on RHEL 6.5. We were seeing issues where submitted jobs ended up in an "Error" state, especially if many jobs were submitted in a short period, either an array or a shell script loop running qsub iteratively. The error reason was:
can't get password entry for user "juser". Either the user does not exist or NIS error!
However, logging into the assigned compute node and running "id" or even some C code to do user lookups passed.
By default, our installation used nslcd for LDAP lookups. Univa suggested switching to SSSD (System Security Services Daemon) as Red Hat had phased out nslcd. The Fedora site has a good overview.
The switch to using SSSD turned out to be fairly easy, with some hidden hiccups. Running authconfig-tui and keeping the existing settings, and then hitting "OK" immediately turned off nslcd and started up sssd, instead. All the attendant changes were made, too: chkconfig settings, /etc/nsswitch.conf. However, I found that users could not change passwords on the login nodes. They could login, but the passwd command failed with "system offline".
Turns out, SSSD requires an encrypted connection to the LDAP server for password changes. This is a security requirement so that the new password is not sent in the clear from the client node to the LDAP server. (See this forum post by sgallagh.) This means an SSL certificate needs to be created. Self-signed will work if the following line is added to /etc/sssd/sssd.conf:
[domain/default]
...
ldap_tls_reqcert = never
To create the self-signed cert:
root # cd /etc/pki/tls/certs
certs # make slapd.pem
certs # chown ldap:ldap slapd.pem
Then, edit /cm/local/apps/openldap/etc/slapd.conf to add the following lines:
TLSCACertificateFile /etc/pki/tls/certs/ca-bundle.crt
TLSCertificateFile /etc/pki/tls/certs/slapd.pem
TLSCertificateKeyFile /etc/pki/tls/certs/slapd.pem
TLS_REQCERT never
At my current job, we use Bright Cluster Manager and Univa Grid Engine on RHEL 6.5. We were seeing issues where submitted jobs ended up in an "Error" state, especially if many jobs were submitted in a short period, either an array or a shell script loop running qsub iteratively. The error reason was:
can't get password entry for user "juser". Either the user does not exist or NIS error!
However, logging into the assigned compute node and running "id" or even some C code to do user lookups passed.
By default, our installation used nslcd for LDAP lookups. Univa suggested switching to SSSD (System Security Services Daemon) as Red Hat had phased out nslcd. The Fedora site has a good overview.
The switch to using SSSD turned out to be fairly easy, with some hidden hiccups. Running authconfig-tui and keeping the existing settings, and then hitting "OK" immediately turned off nslcd and started up sssd, instead. All the attendant changes were made, too: chkconfig settings, /etc/nsswitch.conf. However, I found that users could not change passwords on the login nodes. They could login, but the passwd command failed with "system offline".
Turns out, SSSD requires an encrypted connection to the LDAP server for password changes. This is a security requirement so that the new password is not sent in the clear from the client node to the LDAP server. (See this forum post by sgallagh.) This means an SSL certificate needs to be created. Self-signed will work if the following line is added to /etc/sssd/sssd.conf:
[domain/default]
...
ldap_tls_reqcert = never
To create the self-signed cert:
root # cd /etc/pki/tls/certs
certs # make slapd.pem
certs # chown ldap:ldap slapd.pem
Then, edit /cm/local/apps/openldap/etc/slapd.conf to add the following lines:
TLSCACertificateFile /etc/pki/tls/certs/ca-bundle.crt
TLSCertificateFile /etc/pki/tls/certs/slapd.pem
TLSCertificateKeyFile /etc/pki/tls/certs/slapd.pem
Also, make sure there is a section - my config did not have access to shadowLastChange:
access to attrs=loginShell,shadowLastChange
by group.exact="cn=rogroup,dc=cm,dc=cluster" read
by self write
by * read
Then, restart the ldap service.
UPDATE Adding some links to official Red Hat documentation: https://access.redhat.com/solutions/42746
UPDATE Adding some links to official Red Hat documentation: https://access.redhat.com/solutions/42746
2014-04-23
Grid Engine Job Submission Verifier (JSV) using Go
UPDATE 2018-01-11 Please see the updated post here.
UPDATE Well, this does not quite work the way I want, due to my sketchy understanding of core binding. It looks like core binding only works for jobs confined to a single execute host. If a job spans more than one, the "-binding striding:32:1" option will prevent the job from running on 2 nodes with 16 slots each. The correct option should be "-binding striding:16:1"
I have a job which wants 32 slots, which can only be satisfied by using 2 hosts with 16 slots each. If I set "-binding striding:32:1", the job fails to be scheduled because "cannot run on host ... because it offers only 16 core(s), but 32 needed".
What seems to work is to specify only the number available per host, i.e. "-binding striding:16:1" Or, perhaps, "-binding pe striding:16:1".
Daniel Gruber at Univa wrote a Go API for Univa Grid Engine job submission verifiers (JSV). His testing indicated that it was a fair bit faster than TCL or Perl, the recommended JSV languages for a production environment.
I decided it was a good enough time as any to dabble a little in Go, seeing as I had a simple problem to solve. Users occasionally make mistakes, and submit parallel jobs without requesting a parallel environment with the appropriate number of slots. It could be that they missed the PE line, and a job is assigned only one slot, but ends up actually using 8 (or 16, or whatever). This means the execute host(s) are over-subscribed when other jobs are also scheduled on those same hosts.
I also wanted to take advantage of Univa's new support for cgroups in order to make sure jobs are restricted in terms of their CPU and memory usage. It also helps with process cleanup when the jobs complete (cleanly or not).
This is pretty straightforward to do. Check the job qsub parameters/options, and set binding appropriately. The source is at my github.
UPDATE Well, this does not quite work the way I want, due to my sketchy understanding of core binding. It looks like core binding only works for jobs confined to a single execute host. If a job spans more than one, the "-binding striding:32:1" option will prevent the job from running on 2 nodes with 16 slots each. The correct option should be "-binding striding:16:1"
I have a job which wants 32 slots, which can only be satisfied by using 2 hosts with 16 slots each. If I set "-binding striding:32:1", the job fails to be scheduled because "cannot run on host ... because it offers only 16 core(s), but 32 needed".
What seems to work is to specify only the number available per host, i.e. "-binding striding:16:1" Or, perhaps, "-binding pe striding:16:1".
Daniel Gruber at Univa wrote a Go API for Univa Grid Engine job submission verifiers (JSV). His testing indicated that it was a fair bit faster than TCL or Perl, the recommended JSV languages for a production environment.
I decided it was a good enough time as any to dabble a little in Go, seeing as I had a simple problem to solve. Users occasionally make mistakes, and submit parallel jobs without requesting a parallel environment with the appropriate number of slots. It could be that they missed the PE line, and a job is assigned only one slot, but ends up actually using 8 (or 16, or whatever). This means the execute host(s) are over-subscribed when other jobs are also scheduled on those same hosts.
I also wanted to take advantage of Univa's new support for cgroups in order to make sure jobs are restricted in terms of their CPU and memory usage. It also helps with process cleanup when the jobs complete (cleanly or not).
This is pretty straightforward to do. Check the job qsub parameters/options, and set binding appropriately. The source is at my github.
/*
* Requires https://github.com/dgruber/jsv
*/
package main
import (
"github.com/dgruber/jsv"
"strings"
"strconv"
)
func jsv_on_start_function() {
//jsv_send_env()
}
func job_verification_function() {
//
// Prevent jobs from accidental oversubscription
//
const intel_slots, amd_slots = 16, 64
var modified_p bool = false
if !jsv.JSV_is_param("pe_name") {
jsv.JSV_set_param("binding_strategy", "linear_automatic")
jsv.JSV_set_param("binding_type", "set")
jsv.JSV_set_param("binding_amount", "1")
jsv.JSV_set_param("binding_exp_n", "0")
modified_p = true
} else {
if !jsv.JSV_is_param("binding_strategy") {
var pe_max int
var v string
v, _ = jsv.JSV_get_param("pe_max")
pe_max, _ = strconv.Atoi(v)
var hostlist string
hostlist, _ = jsv.JSV_get_param("q_hard")
hostlist = strings.SplitAfterN(hostlist, "@", 2)[1]
jsv.JSV_set_param("binding_strategy", "striding_automatic")
jsv.JSV_set_param("binding_type", "pe")
if strings.EqualFold("@intelhosts", hostlist) {
if pe_max < intel_slots {
jsv.JSV_set_param("binding_amount", strconv.Itoa(pe_max))
} else {
jsv.JSV_set_param("binding_amount", strconv.Itoa(intel_slots))
}
} else if strings.EqualFold("@amdhosts", hostlist) {
if pe_max < amd_slots {
jsv.JSV_set_param("binding_amount", strconv.Itoa(pe_max))
} else {
jsv.JSV_set_param("binding_amount", strconv.Itoa(amd_slots))
}
}
jsv.JSV_set_param("binding_step", "1")
modified_p = true
}
}
if modified_p {
jsv.JSV_correct("Job was modified")
// show qsub params
jsv.JSV_show_params()
}
return
}
/* example JSV 'script' */
func main() {
jsv.Run(true, job_verification_function, jsv_on_start_function)
}
2014-03-25
Exclusive host access in Grid Engine
UPDATE - 2014-03-28: Univa Grid Engine 8.1.7 (and possibly earlier) has a simpler way to set this up. One just needs to define the "exclusive" complex, without setting up a separate exclusive queue with a forced complex:
Unfortunately, I just discovered a slight deficiency with this approach. That complex must be attached to specific hosts. This means modifying each exec host using "qconf -me hostname".
ORIGINAL POST BELOW:
Here at Drexel's URCF, we use Univa Grid Engine (Wikipedia article). One of the requirements that frequently comes up is for jobs to have exclusive access to the compute hosts that the jobs occupy. A common reason is that a job may need more memory per process than is available on any single host.
Some resource managers and schedulers like Torque allow one to reserve nodes exclusively. In Grid Engine (GE), it is not a built-in feature. However, there is a way to accomplish the same thing. This post expands a little on Dan Gruber's blog post, for people like me who are new to GE.
Here, we assume there is only one queue named all.q. And we have two host groups: @intelhosts, and @amdhosts.
One can create a Boolean resource, a.k.a. complex, named "exclusive", which can be requested. That resource is forced to have the value TRUE in a new queue called exclusive.q so that only jobs that requests "exclusive" will be sent to that queue.
Once the complex is created, create a new queue named exclusive.q which spans all hosts, with a single slot per host. Set it to be subordinate to all.q -- this means that if there are any jobs in all.q on a host, exclusive.q on that host is suspended. And set the "exclusive" Boolean complex to be TRUE.
Modify all.q and set it subordinate to exclusive.q -- this ensures that if there is a job in exclusive.q on a host, all.q on that host is suspended:
And now, to get a parallel job on Intel hosts, the job script would have something like this:
#name shortcut type relop requestable consumable default urgency exclusive excl BOOL EXCL YES YES 0 1000
Unfortunately, I just discovered a slight deficiency with this approach. That complex must be attached to specific hosts. This means modifying each exec host using "qconf -me hostname".
ORIGINAL POST BELOW:
Here at Drexel's URCF, we use Univa Grid Engine (Wikipedia article). One of the requirements that frequently comes up is for jobs to have exclusive access to the compute hosts that the jobs occupy. A common reason is that a job may need more memory per process than is available on any single host.
Some resource managers and schedulers like Torque allow one to reserve nodes exclusively. In Grid Engine (GE), it is not a built-in feature. However, there is a way to accomplish the same thing. This post expands a little on Dan Gruber's blog post, for people like me who are new to GE.
Here, we assume there is only one queue named all.q. And we have two host groups: @intelhosts, and @amdhosts.
One can create a Boolean resource, a.k.a. complex, named "exclusive", which can be requested. That resource is forced to have the value TRUE in a new queue called exclusive.q so that only jobs that requests "exclusive" will be sent to that queue.
exclusive excl BOOL == FORCED NO 0 0
Once the complex is created, create a new queue named exclusive.q which spans all hosts, with a single slot per host. Set it to be subordinate to all.q -- this means that if there are any jobs in all.q on a host, exclusive.q on that host is suspended. And set the "exclusive" Boolean complex to be TRUE.
qname exclusive.q ... slots 1 subordinate_list all.q=1 complex_values exclusive=TRUE ...
Modify all.q and set it subordinate to exclusive.q -- this ensures that if there is a job in exclusive.q on a host, all.q on that host is suspended:
subordinate_list exclusive.q=1
And now, to get a parallel job on Intel hosts, the job script would have something like this:
#$ -l excl #$ -q *@@intelhosts #$ -pe mvapich 128
Subscribe to:
Posts (Atom)