We run 4 2.4.16 servers as 2 provider/consumer pairs, one pair for our
staff systems and one pair for our teaching facilities.
They are all on Solaris10u7 xen virtual hosts.
The staff pair run fine
The consumer on the teaching pair runs fine
The provider on the teaching pair runs fine until it gets hit by a heavy
load, eg start of a lab when ~100 PCs try and authenticate their user. At
this point it refuses to serve LDAP requests. Traffic is still coming in
to the box and existing connections seem OK.
The break point is about 35PCs, below that there isn't a problem.
Restarting slapd cures the problem and off we go until the start of the
next big lab.
I've run at various log levels but not been able to see any obvious
messages. All I see, even when everything is fine, are messages of the
form
send_search_entry: conn 11639 ber write failed.
connection_read(38): no connection!
The slapd.conf (minux the syncprov bit) is:
include /usr/local/etc/openldap/schema/core.schema
include /usr/local/etc/openldap/schema/cosine.schema
include /usr/local/etc/openldap/schema/inetorgperson.schema
include /usr/local/etc/openldap/schema/nis.schema
include /usr/local/etc/openldap/schema/duaconf.schema
include /usr/local/etc/openldap/schema/local.schema
pidfile /var/openldap/run/slapd.pid
argsfile /var/openldap/run/slapd.args
conn_max_pending 200
idletimeout 60
sizelimit 2000
loglevel 256
database bdb
suffix "dc=my,dc=domain"
rootdn "cn=me,dc=my,dc=domain"
rootpw {SSHA}guess
directory /var/openldap/openldap-data
index cn,entryCSN,entryUUID,gidNumber,ipHostNumber,memberUid eq
index objectclass,uid,uidNumber,uniqueMember eq
cachefree 16
cachesize 1500
checkpoint 0 60
dncachesize 1500
idlcachesize 3000
access to attrs=userPassword
by self write
by anonymous auth
by dn.base="cn=fred,ou=Profile,dc=my,dc=domain"
read
by * none
access to *
by self write
by users read
by * read
The only entry in DB_CONFIG is set_cachesize 0 26214400 0
cache hits are at 99%
I'm stumped for a cause/solution, can anyone either give me a pointer as
to what to look for in the logs or suggest a possible cause. Could it be
hitting the 256 open file limit?