At my current job, we use Bright Cluster Manager and Univa Grid Engine on RHEL 6.5. We were seeing issues where submitted jobs ended up in an "Error" state, especially if many jobs were submitted in a short period, either an array or a shell script loop running qsub iteratively. The error reason was:
can't get password entry for user "juser". Either the user does not exist or NIS error!
However, logging into the assigned compute node and running "id" or even some C code to do user lookups passed.
By default, our installation used nslcd for LDAP lookups. Univa suggested switching to SSSD (System Security Services Daemon) as Red Hat had phased out nslcd. The Fedora site has a good overview.
The switch to using SSSD turned out to be fairly easy, with some hidden hiccups. Running authconfig-tui and keeping the existing settings, and then hitting "OK" immediately turned off nslcd and started up sssd, instead. All the attendant changes were made, too: chkconfig settings, /etc/nsswitch.conf. However, I found that users could not change passwords on the login nodes. They could login, but the passwd command failed with "system offline".
Turns out, SSSD requires an encrypted connection to the LDAP server for password changes. This is a security requirement so that the new password is not sent in the clear from the client node to the LDAP server. (See this forum post by sgallagh.) This means an SSL certificate needs to be created. Self-signed will work if the following line is added to /etc/sssd/sssd.conf:
ldap_tls_reqcert = never
To create the self-signed cert:
root # cd /etc/pki/tls/certs
certs # make slapd.pem
certs # chown ldap:ldap slapd.pem
Then, edit /cm/local/apps/openldap/etc/slapd.conf to add the following lines:
Also, make sure there is a section - my config did not have access to shadowLastChange:
access to attrs=loginShell,shadowLastChange
by group.exact="cn=rogroup,dc=cm,dc=cluster" read
by self write
by * read
Then, restart the ldap service.