2014-04-23

Grid Engine Job Submission Verifier (JSV) using Go

UPDATE 2018-01-11 Please see the updated post here.

UPDATE Well, this does not quite work the way I want, due to my sketchy understanding of core binding. It looks like core binding only works for jobs confined to a single execute host. If a job spans more than one, the "-binding striding:32:1" option will prevent the job from running on 2 nodes with 16 slots each. The correct option should be "-binding striding:16:1"

I have a job which wants 32 slots, which can only be satisfied by using 2 hosts with 16 slots each. If I set "-binding striding:32:1", the job fails to be scheduled because "cannot run on host ... because it offers only 16 core(s), but 32 needed".

What seems to work is to specify only the number available per host, i.e. "-binding striding:16:1" Or, perhaps, "-binding pe striding:16:1".

Daniel Gruber at Univa wrote a Go API for Univa Grid Engine job submission verifiers (JSV). His testing indicated that it was a fair bit faster than TCL or Perl, the recommended JSV languages for a production environment.

I decided it was a good enough time as any to dabble a little in Go, seeing as I had a simple problem to solve. Users occasionally make mistakes, and submit parallel jobs without requesting a parallel environment with the appropriate number of slots. It could be that they missed the PE line, and a job is assigned only one slot, but ends up actually using 8 (or 16, or whatever). This means the execute host(s) are over-subscribed when other jobs are also scheduled on those same hosts.

I also wanted to take advantage of Univa's new support for cgroups in order to make sure jobs are restricted in terms of their CPU and memory usage. It also helps with process cleanup when the jobs complete (cleanly or not).

This is pretty straightforward to do. Check the job qsub parameters/options, and set binding appropriately. The source is at my github.

/*
 * Requires https://github.com/dgruber/jsv
 */

package main

import (
    "github.com/dgruber/jsv"
    "strings"
    "strconv"
)

func jsv_on_start_function() {
    //jsv_send_env()
}

func job_verification_function() {

    //
    // Prevent jobs from accidental oversubscription
    //
    const intel_slots, amd_slots = 16, 64

    var modified_p bool = false
    if !jsv.JSV_is_param("pe_name") {
        jsv.JSV_set_param("binding_strategy", "linear_automatic")
        jsv.JSV_set_param("binding_type", "set")
        jsv.JSV_set_param("binding_amount", "1")
        jsv.JSV_set_param("binding_exp_n", "0")
        modified_p = true
    } else {
        if !jsv.JSV_is_param("binding_strategy") {
            var pe_max int
            var v string
            v, _ = jsv.JSV_get_param("pe_max")
            pe_max, _ = strconv.Atoi(v)

            var hostlist string 
            hostlist, _ = jsv.JSV_get_param("q_hard")
            hostlist = strings.SplitAfterN(hostlist, "@", 2)[1]

            jsv.JSV_set_param("binding_strategy", "striding_automatic")
            jsv.JSV_set_param("binding_type", "pe")

            if strings.EqualFold("@intelhosts", hostlist) {
                if pe_max < intel_slots {
                    jsv.JSV_set_param("binding_amount", strconv.Itoa(pe_max))
                } else {
                    jsv.JSV_set_param("binding_amount", strconv.Itoa(intel_slots))
                }
            } else if strings.EqualFold("@amdhosts", hostlist) {
                if pe_max < amd_slots {
                    jsv.JSV_set_param("binding_amount", strconv.Itoa(pe_max))
                } else {
                    jsv.JSV_set_param("binding_amount", strconv.Itoa(amd_slots))
                }
            }

            jsv.JSV_set_param("binding_step", "1")
            modified_p = true
        }
    }

    if modified_p {
        jsv.JSV_correct("Job was modified")

        // show qsub params
        jsv.JSV_show_params()
    }

    return
}

/* example JSV 'script' */
func main() {
    jsv.Run(true, job_verification_function, jsv_on_start_function)
}

No comments:

Post a Comment

Note: Only a member of this blog may post a comment.