Nvidia solidifies its reach into HPC, acquiring Bright Computing. More coverage at HPC Wire. This comes on the heels of its acquisition of Mellanox in April of last year.
How-to's and technical news about Linux and open computing, with a sprinkling of Python.
2022-01-12
2020-09-14
Switching to SSSD (from nslcd) in Bright Cluster Manager 9
In Bright Cluster Manager 9.0, cluster nodes still use nslcd for LDAP authentication. Since we have sssd working in Bright CM 6 (by necessity due to an issue with Univa Grid Engine and nslcd; see previous posts), we might as well change things over to sssd on Bright CM 9, too. The cluster now runs RHEL8.
First, we disable the nslcd service on all nodes. It was a little non-obvious how to do this since trying to remove it in the device services did nothing: the service just kept coming back enabled. I.e. do “remove nslcd ; commit” and then “list” and nslcd just reappears.
Examining that service in the device view showed that it “belongs to a role,” but it is not listed in any role, nor in the category of that node.
[foocluster]% category use compute-cat
[foocluster->category[compute-cat]]% services
[foocluster->category[compute-cat]->services]% list
Service (key) Monitored Autostart
------------------------ ---------- ----------
It turns out that nslcd is part of a hidden role which is not visible to the user. So, you have to write a loop to disable nslcd on each node. Within cmsh:
[foocluster]% device
[foocluster]% foreach -v -n node001..node099 (services; use nslcd; set monitored no ; set autostart no)
[foocluster]% commit
To modify the node image, I modify the image on one node, and then do “grabimage -w” in cmsh on the head node.
You will need to install these packages:
- openldap-clients
- sssd
- sssd-ldap
- openssl-perl
Next, sssd setup. This may depend on your installation. The installation here uses the LDAP server set up by Bright CM, which uses SSL for encryption with both server and client certificates. (All self-signed with a dummy CA in the usual way.) The following /etc/sssd/sssd.conf shows only the non-empty sections. Your configuration may need to be different depending on your environment.
[domain/default]
id_provider = ldap
autofs_provider = ldap
auth_provider = ldap
chpass_provider = ldap
ldap_uri = ldaps://fooserver.cm.cluster
ldap_search_base = dc=cm,dc=cluster
ldap_id_use_start_tls = False
ldap_tls_reqcert = demand
ldap_tls_cacertdir = /cm/local/apps/openldap/etc/certs
cache_credentials = True
enumerate = False
entry_cache_timeout = 600
ldap_network_timeout = 3
ldap_connection_expire_timeout = 60
[sssd]
config_file_version = 2
services = nss, pam
domains = default
[nss]
homedir_substring = /home
Then,
# chown root:root /etc/sssd/sssd.conf
# chmod 600 /etc/sssd/sssd.conf
I did not have to change /etc/openldap/ldap.conf
The next step is to switch to using sssd for authentication. But first, stop and disable the nslcd service:
# systemctl stop nslcd
# systemctl disable nslcd
The old authconfig-tui utility is gone. The new one is authselect: you will have to force it to overwrite existing authentication configurations.
# authselect select sssd --force
There are other options to authselect, e.g. “with-mkhomedir”. See authselect(8) and authselect-profiles(5) for details. Other options may also require other packages to be installed.
Then, start and enable the sssd service. Check that user ID info can be retrieved:
# id someuser
Back on the head node, do “grabimage -w”.
Then, modify the node category to add the sssd service, setting it to autostart and to be monitored.
2020-08-04
Scripting Bright Cluster Manager 9.0 with Python
2019-10-08
Still more on SSSD in Bright Cluster Manager - cache file
Since Bright CM 6 did not handle SSSD out of the box, it also did not handle the SSSD cache file. More accurately, it did not ignore the file in the software image, and the grabimage command would grab the image to the provisioning server and then propagate it to nodes in the category.
The fix is simple: add /var/lib/sss/db/* to the various exclude list settings in the category.
To reset the cache:
service sssd stop
/bin/rm -f /var/lib/sss/db/cache_default.ldb
service sssd start
I did try "sss_cache -E" which is supposed to clear the cache, but found that it did not work as I expected: the spurious group still appeared with "getent group".
2014-07-01
Using the NVIDIA Python plugin for Ganglia monitoring under Bright Cluster Manager
export LD_LIBRARY_PATH=/cm/local/apps/cuda/libs/current/lib64The modifications to Ganglia Web, however, are out of date. I will make another post once I figure out how to do modify Ganglia Web to display the NVIDIA metrics.
UPDATE: Well, turns out there seems to be no need to modify the Ganglia Web installation. Under the host view, there is a tab for "gpu metrics" which shows 22 available metrics.
2014-05-14
SSSD setup in a Bright Cluster
TLS_REQCERT never
At my current job, we use Bright Cluster Manager and Univa Grid Engine on RHEL 6.5. We were seeing issues where submitted jobs ended up in an "Error" state, especially if many jobs were submitted in a short period, either an array or a shell script loop running qsub iteratively. The error reason was:
can't get password entry for user "juser". Either the user does not exist or NIS error!
However, logging into the assigned compute node and running "id" or even some C code to do user lookups passed.
By default, our installation used nslcd for LDAP lookups. Univa suggested switching to SSSD (System Security Services Daemon) as Red Hat had phased out nslcd. The Fedora site has a good overview.
The switch to using SSSD turned out to be fairly easy, with some hidden hiccups. Running authconfig-tui and keeping the existing settings, and then hitting "OK" immediately turned off nslcd and started up sssd, instead. All the attendant changes were made, too: chkconfig settings, /etc/nsswitch.conf. However, I found that users could not change passwords on the login nodes. They could login, but the passwd command failed with "system offline".
Turns out, SSSD requires an encrypted connection to the LDAP server for password changes. This is a security requirement so that the new password is not sent in the clear from the client node to the LDAP server. (See this forum post by sgallagh.) This means an SSL certificate needs to be created. Self-signed will work if the following line is added to /etc/sssd/sssd.conf:
[domain/default]
...
ldap_tls_reqcert = never
To create the self-signed cert:
root # cd /etc/pki/tls/certs
certs # make slapd.pem
certs # chown ldap:ldap slapd.pem
Then, edit /cm/local/apps/openldap/etc/slapd.conf to add the following lines:
TLSCACertificateFile /etc/pki/tls/certs/ca-bundle.crt
TLSCertificateFile /etc/pki/tls/certs/slapd.pem
TLSCertificateKeyFile /etc/pki/tls/certs/slapd.pem
UPDATE Adding some links to official Red Hat documentation: https://access.redhat.com/solutions/42746
2014-01-15
Scripting Bright Cluster Manager
Node categories group nodes which have similar configurations and roles. Example configuration may be a list of remote filesystem mounts, and an example role may be Grid Engine compute node with 64 job slots. The cluster at URCF has 64-core AMD nodes and 16-core Intel nodes, so I created a category for each of these. Then, I needed to change the node categories from the default to the architecture-specific categories. The script below did it for the Intel nodes.