Linux Follies: bright cluster manager

Showing posts with label bright cluster manager. Show all posts

2022-01-12

Nvidia acquires Bright Computing

Nvidia solidifies its reach into HPC, acquiring Bright Computing. More coverage at HPC Wire. This comes on the heels of its acquisition of Mellanox in April of last year.

2020-09-14

Switching to SSSD (from nslcd) in Bright Cluster Manager 9

In Bright Cluster Manager 9.0, cluster nodes still use nslcd for LDAP authentication. Since we have sssd working in Bright CM 6 (by necessity due to an issue with Univa Grid Engine and nslcd; see previous posts), we might as well change things over to sssd on Bright CM 9, too. The cluster now runs RHEL8.

First, we disable the nslcd service on all nodes. It was a little non-obvious how to do this since trying to remove it in the device services did nothing: the service just kept coming back enabled. I.e. do “remove nslcd ; commit” and then “list” and nslcd just reappears.

Examining that service in the device view showed that it “belongs to a role,” but it is not listed in any role, nor in the category of that node.

[foocluster]% category use compute-cat
[foocluster->category[compute-cat]]% services
[foocluster->category[compute-cat]->services]% list
Service (key) Monitored Autostart
------------------------ ---------- ----------

It turns out that nslcd is part of a hidden role which is not visible to the user. So, you have to write a loop to disable nslcd on each node. Within cmsh:

[foocluster]% device
[foocluster]% foreach -v -n node001..node099 (services; use nslcd; set monitored no ; set autostart no)
[foocluster]% commit

To modify the node image, I modify the image on one node, and then do “grabimage -w” in cmsh on the head node.

You will need to install these packages:

openldap-clients
sssd
sssd-ldap
openssl-perl

Next, sssd setup. This may depend on your installation. The installation here uses the LDAP server set up by Bright CM, which uses SSL for encryption with both server and client certificates. (All self-signed with a dummy CA in the usual way.) The following /etc/sssd/sssd.conf shows only the non-empty sections. Your configuration may need to be different depending on your environment.

[domain/default]
id_provider = ldap
autofs_provider = ldap
auth_provider = ldap
chpass_provider = ldap
ldap_uri = ldaps://fooserver.cm.cluster
ldap_search_base = dc=cm,dc=cluster
ldap_id_use_start_tls = False
ldap_tls_reqcert = demand
ldap_tls_cacertdir = /cm/local/apps/openldap/etc/certs
cache_credentials = True
enumerate = False
entry_cache_timeout = 600
ldap_network_timeout = 3
ldap_connection_expire_timeout = 60

[sssd]
config_file_version = 2
services = nss, pam
domains = default

[nss]
homedir_substring = /home

Then,

# chown root:root /etc/sssd/sssd.conf
# chmod 600 /etc/sssd/sssd.conf

I did not have to change /etc/openldap/ldap.conf

The next step is to switch to using sssd for authentication. But first, stop and disable the nslcd service:

# systemctl stop nslcd
# systemctl disable nslcd

The old authconfig-tui utility is gone. The new one is authselect: you will have to force it to overwrite existing authentication configurations.

# authselect select sssd --force

There are other options to authselect, e.g. “with-mkhomedir”. See authselect(8) and authselect-profiles(5) for details. Other options may also require other packages to be installed.

Then, start and enable the sssd service. Check that user ID info can be retrieved:

# id someuser

Back on the head node, do “grabimage -w”.

Then, modify the node category to add the sssd service, setting it to autostart and to be monitored.

2020-08-04

Scripting Bright Cluster Manager 9.0 with Python

It has been more than 6 years since the previous post about using the Python API to script Bright Cluster Manager (CM). Time for an update.

I have to do the same as before: change the “category” on a whole bunch of nodes.

NB the Developer Manual has some typos, where it makes it look like you can specify categories as strings of their names, e.g. cluster.get_by_type('Node')

2019-10-08

Still more on SSSD in Bright Cluster Manager - cache file

It has been a few years since I got SSSD to work in Bright Cluster Manager 6, and I just figured out one little thing that has been an annoyance for a few years. There has been a spurious group hanging on: it has the same GID as an existing group, but a different group name.

Since Bright CM 6 did not handle SSSD out of the box, it also did not handle the SSSD cache file. More accurately, it did not ignore the file in the software image, and the grabimage command would grab the image to the provisioning server and then propagate it to nodes in the category.

The fix is simple: add /var/lib/sss/db/* to the various exclude list settings in the category.

To reset the cache:
    service sssd stop
    /bin/rm -f /var/lib/sss/db/cache_default.ldb
    service sssd start

I did try "sss_cache -E" which is supposed to clear the cache, but found that it did not work as I expected: the spurious group still appeared with "getent group".

2014-07-01

Using the NVIDIA Python plugin for Ganglia monitoring under Bright Cluster Manager

The github repo for Ganglia gmond Python plugins contains a plugin for monitoring NVIDIA GPUs. This presumes that the NVIDIA Deployment Kit, which contains the NVML (management library), is installed via the normal means into the usual places. If you are using Bright Cluster Manager, you would have used Bright's cuda60/tdk to do the installation. That means that the libnvidia-ml.so library is not in one of the standard library directories. To fix it, just modify the /etc/init.d/gmond init script. Near the top, modify the LD_LIBRARY_PATH:

export LD_LIBRARY_PATH=/cm/local/apps/cuda/libs/current/lib64

The modifications to Ganglia Web, however, are out of date. I will make another post once I figure out how to do modify Ganglia Web to display the NVIDIA metrics.

UPDATE: Well, turns out there seems to be no need to modify the Ganglia Web installation. Under the host view, there is a tab for "gpu metrics" which shows 22 available metrics.

2014-05-14

SSSD setup in a Bright Cluster

UPDATE 2015-05-28: Also, on the master node, the User Portal web application needs a setting in /etc/openldap/ldap.conf:

TLS_REQCERT never

At my current job, we use Bright Cluster Manager and Univa Grid Engine on RHEL 6.5. We were seeing issues where submitted jobs ended up in an "Error" state, especially if many jobs were submitted in a short period, either an array or a shell script loop running qsub iteratively. The error reason was:

can't get password entry for user "juser". Either the user does not exist or NIS error!

However, logging into the assigned compute node and running "id" or even some C code to do user lookups passed.

By default, our installation used nslcd for LDAP lookups. Univa suggested switching to SSSD (System Security Services Daemon) as Red Hat had phased out nslcd. The Fedora site has a good overview.

The switch to using SSSD turned out to be fairly easy, with some hidden hiccups. Running authconfig-tui and keeping the existing settings, and then hitting "OK" immediately turned off nslcd and started up sssd, instead. All the attendant changes were made, too: chkconfig settings, /etc/nsswitch.conf. However, I found that users could not change passwords on the login nodes. They could login, but the passwd command failed with "system offline".

Turns out, SSSD requires an encrypted connection to the LDAP server for password changes. This is a security requirement so that the new password is not sent in the clear from the client node to the LDAP server. (See this forum post by sgallagh.) This means an SSL certificate needs to be created. Self-signed will work if the following line is added to /etc/sssd/sssd.conf:

[domain/default]
...
ldap_tls_reqcert = never

To create the self-signed cert:
root # cd /etc/pki/tls/certs
certs # make slapd.pem
certs # chown ldap:ldap slapd.pem

Then, edit /cm/local/apps/openldap/etc/slapd.conf to add the following lines:
TLSCACertificateFile /etc/pki/tls/certs/ca-bundle.crt
TLSCertificateFile /etc/pki/tls/certs/slapd.pem
TLSCertificateKeyFile /etc/pki/tls/certs/slapd.pem

Also, make sure there is a section - my config did not have access to shadowLastChange:

access to attrs=loginShell,shadowLastChange

by group.exact="cn=rogroup,dc=cm,dc=cluster" read

by self write

by * read

Then, restart the ldap service.

UPDATE Adding some links to official Red Hat documentation: https://access.redhat.com/solutions/42746

2014-01-15

Scripting Bright Cluster Manager

At my new position as Sr. SysAdmin at Drexel's University Research Computing Facility (URCF), we use Bright Cluster Manager. I am new to Bright, and I am finding it very nice, indeed. One of its best features is programmatic access via a Python API. In about half an hour, I figured out enough to modify the node categories of all the nodes in the cluster.

Node categories group nodes which have similar configurations and roles. Example configuration may be a list of remote filesystem mounts, and an example role may be Grid Engine compute node with 64 job slots. The cluster at URCF has 64-core AMD nodes and 16-core Intel nodes, so I created a category for each of these. Then, I needed to change the node categories from the default to the architecture-specific categories. The script below did it for the Intel nodes.