Hi everyone,
I have noticed a problem regarding the new writetimeout feature in
OpenLDAP 2.4.17. I have a setup with one master and four replicas running
on Solaris 10, using the hdb-backend with BDB 4.4 from the opencsw
repository. Clients are Sun, Linux and Mac boxes in every possible
variation.
I updated my servers last Friday from version 2.4.15 and now some
cronjobs accessing the directory were sporadically failing over the
weekend.
I monitored the server logs (log level 'stats') while running the
involved scripts repeatedly and got something like the following results
after fail-runs on all servers:
(IPs and DNs altered)
Jul 20 11:42:43 ldapserver slapd[9053]: [ID 848112 local4.debug]
conn=12479 fd=79 ACCEPT from IP=192.168.1.1:50210 (IP=0.0.0.0:389) Jul
20 11:42:43 ldapserver slapd[9053]: [ID 270379 local4.debug] conn=12479
op=0 EXT oid=1.3.6.1.4.1.1466.20037 Jul 20 11:42:43 ldapserver
slapd[9053]: [ID 560212 local4.debug] conn=12479 op=0 STARTTLS Jul 20
11:42:43 ldapserver slapd[9053]: [ID 875301 local4.debug] conn=12479
op=0 RESULT oid= err=0 text= Jul 20 11:42:43 ldapserver slapd[9053]: [ID
105384 local4.debug] conn=12479 fd=79 TLS established tls_ssf=256 ssf=256
Jul 20 11:42:43 ldapserver slapd[9053]: [ID 215403 local4.debug]
conn=12479 op=1 BIND dn="uid=dummyuser,ou=System,dc=example,dc=com"
method=128 Jul 20 11:42:43 ldapserver slapd[9053]: [ID 600343
local4.debug] conn=12479 op=1 BIND
dn="uid=dummyuser,ou=System,dc=example,dc=com" mech=SIMPLE ssf=0 Jul 20
11:42:43 ldapserver slapd[9053]: [ID 588225 local4.debug] conn=12479
op=1 RESULT tag=97 err=0 text= Jul 20 11:42:43 ldapserver slapd[9053]:
[ID 469902 local4.debug] conn=12479 op=2 SRCH
base="ou=people,dc=example,dc=com" scope=2 deref=2
filter="(objectClass=posixAccount)" Jul 20 11:42:43 ldapserver
slapd[9053]: [ID 744844 local4.debug] conn=12479 op=2 SRCH attr=uid
userpassword uidnumber gidnumber gecos homedirectory loginshell Jul 20
11:43:00 ldapserver slapd[9053]: [ID 485650 local4.debug] conn=12479
fd=79 closed (writetimeout)
I don't have the writetimeout keyword configured on any of the boxes and
the affected script doesn't do any writes anyway. Also, the problem only
seems to arise if the client takes a while to process a search result.
The failing scripts are Net::LDAP based Perl scripts running on some old
SPARC boxes, so it took them up to half a minute and more to complete.
Setting writetimeout to a high enough value seems solve the problem, but
referring to the docs, this shouldn't happen with the keyword unset or
set to 0.
Is this a bug or did I miss something? Did anyone else encounter this so
far?