2015-03-17

Ganglia procstat.py fix to handle process names containing underscores

This has bugged me for a while: Ganglia's Python module procstat.py which monitors process CPU and memory usage did not show any data for Grid Engine's qmaster, which has a process name of "sge_qmaster". Turns out, this is because it tries to parse out the process name by assuming it does not have underscores in it. This snippet is from the get_stat(name) function in procstat.py:

if name.startswith('procstat_'):
    fir = name.find('_')
    sec = name.find('_', fir + 1)
    proc = name[fir + 1:sec]
    label = name[sec + 1:]
I just submitted a pull request to change this to something which handles process names with some number of underscores. My version is here, and the snippet to replace the above:

if name.startswith('procstat_'):
    nsp = name.split('_')
    proc = '_'.join(nsp[1:-1])
    label = nsp[-1
]

2015-02-11

FlexLM and host names

If you ever get an error with lmstat like:

lmgrd is not running: License server machine is down or not responding. (-96,7:2 "No such file or directory")

but only from some machines outside your domain, check that the SERVER line in your license file specifies the FQDN of the license server. The default is to use just the hostname.

SERVER myserver.mydom.com XXXXXXXXXXXX NNNNN

Most search hits on that error message say things about firewalls.

2015-02-09

Grid Engine PE script (prologue) for Abaqus

UPDATE: Well, a closer look at some of the files Abaqus generates during its run indicates that Abaqus (or, technically, Platform MPI) is aware of Grid Engine and can figure out the host list by itself.

Abaqus 6.13 uses Platform MPI, but also uses its own "environment file" for the MPI hostfile. (Search for "mp_host_list" at the official documentation.) So, I cooked up this PE script (aka prologue) to write the abaqus_v6.env file in the job directory:

#!/usr/bin/env python
import sys, os ### PE startup script to set up Abaqus MPI "hostfile"
### Based on documented env file format machinefile = os.environ['PE_HOSTFILE']
abaqenvfile = "abaqus_v6.env" machinelines = []
with open(machinefile, "ro") as mf:
for l in mf:
lsplit = l.split()
machinelines.append( [lsplit[0], int(lsplit[1])] ) with open(abaqenvfile, "wo") as envfile:
envfile.write("mp_mode=MPI\n")
envfile.write("mp_host_list=%s\n" % (str(machinelines)))

2014-12-30

Storage Manager(s) software

I am really curious about a storage device management GUI named "SMclient". At my previous job at Wake Forest, we used an IBM GPFS storage system. The software used to manage it was SMclient, with SM, I assume, meaning "Storage Manager". It's a Java-based GUI. There is also a command line interface, SMcli.

Here at Drexel's URCF, we have a Dell High Availability NFS storage system, using a MD3260 storage device and HA front-ends using Red Hat's High Availability Add-On and XFS. Anyway, the GUI used to manage the MD3260 storage device is also SMclient, which looks identical to IBM's SMclient. Nothing in the "About" window mentions the history of the software.

Anyone out there know the history of this software?

2014-12-29

Mellanox Infiniband network cards on Linux

Sometimes, when one updates the firmware for Mellanox Infiniband cards, the MAC/hardware address gets changed. This usually happens if the IB card is OEM, i.e. made by Mellanox but stamped with a different company's name.

When the MAC gets changed, the network interface will not come up. The fix is to update the HWADDR field in /etc/sysconfig/network-scripts/ifcfg-ib0 and /etc/sysconfig/network-scripts/ifcfg-ib1. Use "ip link list" to display the new MAC.

2014-12-16

RHEL 6.4 kernel 2.6.32-358.23.2, Mellanox OFED 2.1-1.0.6, and Lustre client 2.5.0

I am planning some upgrades for the cluster that I manage. As part of the updates, it would be good to have MVAPICH2 with GDR (GPU-Direct RDMA -- yes, that's an acronym of an acronym). MVAPICH2-GDR, which is provided only as binary RPMs, only supports Mellanox OFED 2.1.

Now, our cluster runs RHEL6.4, but with most non-kernel and non-glibc packages updated to whatever is in RHEL6.5. The plan is to update everything to whatever is in RHEL6.6, except for the kernel, leaving that at 2.6.32-358.23.2 which is the last RHEL6.4 kernel update. The reason for staying with that version of the kernel is because of Lustre.

We have a Terascala Lustre filesystem appliance. The latest release of TeraOS uses Lustre 2.5.0. Upgrading the server is pretty straightforward, according to the Terascala engineers. Updating the client is a bit trickier. Currently, the Lustre support matrix says that Lustre 2.5.0 is supported only on RHEL6.4.

The plan of attack is this:

  1. Update a base node with all RHEL packages, leaving the kernel at 2.6.32-358.23.2
  2. Upgrade Mellanox OFED from 1.9 to 2.1
  3. Build lustre-client-2.5.0 and upgrade the Lustre client packages

Updating the base node is straightforward. Just use "yum update", after commenting out the exclusions in /etc/yum.conf. If you had updated the <tt>redhat-release-server-6server<tt> package, which defines which RHEL release you have, you can downgrade it. (See RHEL Knowledgebase, subscription required.) First, install the last (as of 2014-12-15) RHEL6.4 kernel, and then do the downgrade:
# yum install kernel-2.6.32-358.23.2.el6
# reboot
# yum downgrade redhat-release-server-6Server

Check with "cat /etc/redhat-release".

Next, install Mellanox OFED 2.1-1.0.6. You can install it directly using the provided installation script, or if you are paranoid like me, you can use the provided script to build RPMs against the exact kernel update you have installed.

Get the tarball directly from Mellanox. Extract, and make new RPMs:
# tar xf MLNX_OFED_LINUX-2.1-1.0.6-rhel6.4-x86_64.tgz
# cd MLNX_OFED_LINUX-2.1-1.0.6-rhel6.4-x86_64
# ./mlnx_add_kernel_support.sh -m .
...
# cp /tmp/MLNX_OFED_LINUX-2.1-1.0.6-rhel6.4-x86_64-ext.tgz .
# tar xf MLNX_OFED_LINUX-2.1-1.0.6-rhel6.4-x86_64-ext.tgz
# cd MLNX_OFED_LINUX-2.1-1.0.6-rhel6.4-x86_64-ext
# ./mlnxofedinstall
# reboot

Strictly speaking, the reboot is unnecessary: you can stop and restart a couple of services and the new OFED will load.

Next, for Lustre. Get the SRPM from Intel (who bought WhamCloud). You will notice that it is for kernel 2.6.32-358.18.1. Not mentioned is the fact that by default, it uses the generic OFED that RedHat rolls into its distribution. To use the Mellanox OFED, a slightly different installation method must be used.

# rpm -Ivh lustre-client-2.5.0-2.6.32_358.18.1.el6.x86_64.src.rpm
# cd ~/rpmbuild/SOURCES
# cp lustre-2.5.0.tar.gz ~/tmp
# cd ~/tmp
# tar xf lustre-2.5.0.tar.gz
# cd lustre-2.5.0
# ./configure --disable-server --with-o2ib=/usr/src/ofa_kernel/default
# make rpms
# cd ~/rpmbuild/RPMS/x86_64
# yum install lustre-client-2.5.0-2.6.32_358.23.2.el6.x86_64.x86_64.rpm \
lustre-client-modules-2.5.0-2.6.32_358.23.2.el6.x86_64.x86_64.rpm \
lustre-client-tests-2.5.0-2.6.32_358.23.2.el6.x86_64.x86_64.rpm \
lustre-iokit-2.5.0-2.6.32_358.23.2.el6.x86_64.x86_64.rpm
To make the lustre module load at boot, I have a kludge: to /etc/init.d/netfs right after the line
STRING=$"Checking network-atttached filesystems"
add
modprobe lustre
Reboot, and then check:
# lsmod | grep lustre
lustre                921744  0
lov                   516461  1 lustre
mdc                   199005  1 lustre
ptlrpc               1295397  6 mgc,lustre,lov,osc,mdc,fid
obdclass             1128062  41 mgc,lustre,lov,osc,mdc,fid,ptlrpc
lnet                  343705  4 lustre,ko2iblnd,ptlrpc,obdclass
lvfs                   16582  8 mgc,lustre,lov,osc,mdc,fid,ptlrpc,obdclass
libcfs                491320  11 mgc,lustre,lov,osc,mdc,fid,ko2iblnd,ptlrpc,obdclass,lnet,lvfs


2014-11-06

Mounting a HTC One on Ubuntu 14.04 Trusty; Re-flashing the ROM

Since I unlocked my HTC One (M7), I do not have a voicemail app since the phone was tied to a specific provider. Now, I'm trying to figure out how to re-flash the device to generic.

The first thing was to try to connect it to my Linux machine. However, upon plugging the phone into the USB port, an error appeared: "Unable to find matching udev device".

Apparently, this is fairly common and happens with other devices. The error boils down to a bug in Ubuntu's default Media Transfer Protocol (MTP) library. The fix is to install a later version.

First, add a new repository:
$ sudo add-apt-repository ppa:webupd8team/unstable
$ sudo apt-get update
Then, install the mtpfs package.

UPDATE: HTC has a free bootloader unlocking utility. You just have to sign up for their free developer program. Then, follow the instructions here: http://www.htcdev.com/bootloader/unlock-instructions

The one thing they do not mention is that you have to have root privileges for the fastboot utility to work, so:
$ sudo ./fastboot oem get_identifier_token
Then, they email you a file for unlocking. Next, re-flash the phone with a vendor-appropriate ROM from here: http://www.htcdev.com/devcenter/downloads

For AT&T (my phone's original ROM) and T-Mobile (my current provider), there is no binary image for flashing. I didn't bother looking to see if you could compile the source. Instead, I went with the Android Ice Cold Project (AICP). I discovered it via this step-by-step article. They also have a download of the full Google Apps.