I found a similar issue within the cluster I have
access to. The problem came down to the fact that
linux nscd does not cache any group information for
lsap queries (though could also be for nis). Becuase
of the jobs running, 240 nodes would request a bunch
of files with every file needing gid info, the ldap
server would get bogged down in repeated requests for
gid numbers, many that should have been cached.
Perhaps you are running into the same issue and if you
find a resolution, would you mind pasing it alonf.
Else, if this is not an issue for you, sorry to have
bothered you.