[Date Prev][Date Next] [Chronological] [Thread] [Top]

Re: DN index delete failed

To: openldap-software@openldap.org
Subject: Re: DN index delete failed
From: Norman Gaywood <ngaywood@une.edu.au>
Date: Mon, 28 Jul 2008 11:49:45 +1000
Content-disposition: inline
In-reply-to: <20080725053203.GA11657@turing.une.edu.au>
Organization: University of New England, Armidale NSW, Australia.
References: <20080725053203.GA11657@turing.une.edu.au>
User-agent: Mutt/1.5.18 (2008-05-17)

Thanks for the responses. To sum up, after a few more "DN index delete
failed" messages on my LDAP master server, I slapcat'ed, deleted and
slapadd'ed my databases and everything seems fine again.

The responses I received were all generally trying to figure out what
I might have done to cause the problem. Be it errors in rpm upgrade,
running database maintenance programs as the wrong user, or some other
event.

In fact, there was an event a few days before the errors that may have
caused the problem. On one of my slaves, a full disk caused replication
to fail. The slave began like this:

Jul 22 08:10:01 dyaus slapd2.3[2132]: bdb(dc=une,dc=edu,dc=au): write: 0x9a829f8d, 6878: No space left on device 
Jul 22 08:10:01 dyaus slapd2.3[2132]: bdb(dc=une,dc=edu,dc=au): write: 0x9a829f8d, 6878: No space left on device 
Jul 22 08:10:01 dyaus slapd2.3[2132]: null_callback: error code 0x50 
Jul 22 08:10:01 dyaus slapd2.3[2132]: syncrepl_entry: rid 001 be_modify failed (80) 
Jul 22 08:10:01 dyaus slapd2.3[2132]: do_syncrepl: rid 001 retrying (1 retries left) 
Jul 22 08:10:02 dyaus slapd2.3[2132]: bdb(dc=une,dc=edu,dc=au): write: 0x9a829f8d, 7034: No space left on device 
Jul 22 08:10:02 dyaus slapd2.3[2132]: bdb(dc=une,dc=edu,dc=au): write: 0x9a829f8d, 7034: No space left on device 
Jul 22 08:10:02 dyaus slapd2.3[2132]: null_callback: error code 0x50 
Jul 22 08:10:02 dyaus slapd2.3[2132]: syncrepl_updateCookie: rid 001 be_modify failed (80) 

Jul 22 08:30:35 dyaus slapd2.3[2132]: bdb(dc=une,dc=edu,dc=au): write: 0x9a829f8d, 30286: No space left on device 
Jul 22 08:30:35 dyaus slapd2.3[2132]: bdb(dc=une,dc=edu,dc=au): write: 0x9a829f8d, 30286: No space left on device 
Jul 22 08:30:35 dyaus slapd2.3[2132]: null_callback: error code 0x50 
Jul 22 08:30:35 dyaus slapd2.3[2132]: do_syncrepl: rid 001 retrying (1 retries left)

On the master at the corresponding time, I have:

Jul 22 08:10:01 janus slapd[2049]: conn=31060 fd=17 ACCEPT from IP=129.180.1.107:33547 (IP=0.0.0.0:389) 
Jul 22 08:10:01 janus slapd[2049]: conn=31060 op=0 BIND dn="cn=syncuser,dc=une,dc=edu,dc=au" method=128 
Jul 22 08:10:01 janus slapd[2049]: conn=31060 op=0 BIND dn="cn=SyncUser,dc=une,dc=edu,dc=au" mech=SIMPLE ssf=0 
Jul 22 08:10:01 janus slapd[2049]: conn=31060 op=0 RESULT tag=97 err=0 text= 
Jul 22 08:10:01 janus slapd[2049]: conn=31060 op=1 SRCH base="dc=une,dc=edu,dc=au" scope=2 deref=0 filter="(objectClass=*)" 
Jul 22 08:10:01 janus slapd[2049]: conn=31060 op=1 SRCH attr=* structuralObjectClass entryCSN 

A delay on this connection:

Jul 22 08:30:35 janus slapd[2049]: conn=31060 op=2 UNBIND 
Jul 22 08:30:35 janus slapd[2049]: conn=31060 fd=17 closed 

Then on the master, the logs started to look like this. Note that there
were no disk problems on the master.

Jul 22 09:39:40 janus slapd[2049]: send_search_entry: conn 33464  ber write failed. 
Jul 22 09:39:40 janus slapd[2049]: conn=33464 fd=17 closed (connection lost on write) 
Jul 22 09:39:40 janus slapd[2049]: connection_read(17): no connection! 

Followed, on each replication attempt, by:

Jul 22 09:40:14 janus slapd[2049]: send_search_entry: conn 33498  ber write failed. 
Jul 22 09:40:14 janus slapd[2049]: conn=33498 fd=17 closed (connection lost on write) 

Jul 22 09:40:16 janus slapd[2049]: send_search_entry: conn 33501  ber write failed. 
Jul 22 09:40:16 janus slapd[2049]: conn=33501 fd=17 closed (connection lost on write) 

Jul 22 09:44:34 janus slapd[2049]: send_search_entry: conn 33759  ber write failed. 
Jul 22 09:44:34 janus slapd[2049]: conn=33759 fd=17 closed (connection lost on write) 

etc. Many repeats of this, until the master server crashed and I restarted it:

Jul 22 10:06:50 janus slapd[2049]: send_ldap_response: ber write failed 
Jul 22 10:38:15 janus slapd[29187]: hdb_db_open: unclean shutdown detected; attempting recovery. 
Jul 22 10:38:17 janus slapd[29187]: slapd starting 

While I was searching for what might have caused the crash. I got a few
more:

Jul 22 10:41:36 janus slapd[29187]: send_search_entry: conn 71  ber write failed. 
Jul 22 10:41:36 janus slapd[29187]: conn=71 fd=17 closed (connection lost on write) 

On the slave many no space errors until:

Jul 22 10:48:10 dyaus slapd2.3[2132]: bdb(dc=une,dc=edu,dc=au): write: 0x9a829f8d, 2072163: No space left on device 
Jul 22 10:48:10 dyaus slapd2.3[2132]: => bdb_idl_delete_key: c_del id failed: No space left on device (28) 
Jul 22 10:48:10 dyaus slapd2.3[2132]: Attribute index delete failure
Jul 22 10:48:10 dyaus slapd2.3[2132]: null_callback: error code 0x50 
Jul 22 10:48:10 dyaus slapd2.3[2132]: syncrepl_entry: rid 001 be_modify failed (80) 
Jul 22 10:48:10 dyaus slapd2.3[2132]: do_syncrepl: rid 001 retrying (1 retries left) 
Jul 22 10:48:11 dyaus slapd2.3[2132]: do_syncrep1: rid 001 ldap_sasl_bind_s failed (-1) 
Jul 22 10:48:11 dyaus slapd2.3[2132]: do_syncrepl: rid 001 retrying 
Jul 22 10:48:12 dyaus slapd2.3[2132]: do_syncrep1: rid 001 ldap_sasl_bind_s failed (-1) 

Yes the master had crashed again, hence the bind fail. At this point I had
discovered the disk problem and had made space available on the slave.

The master was restarted. Here is the crash and restart:

Jul 22 10:48:09 janus slapd[29187]: send_ldap_response: ber write failed 
Jul 22 10:48:09 janus slapd[29187]: ch_calloc of 1 elems of 46912585677634 bytes failed 
Jul 22 10:50:40 janus slapd[29859]: hdb_db_open: unclean shutdown detected; attempting recovery. 
Jul 22 10:50:40 janus slapd[29859]: slapd starting 

Things seemed to be back to normal. My concern then was that a bad slave
had caused the master to crash. I then noticed that Buchan Milne had
just released a new version of his RPMS, so I upgraded the master. From
the yum logs you can see the version transition:

Jan 11 09:45:02 Updated: openldap.x86_64 2.3.39-3.rhel5
Jul 22 11:04:15 Updated: openldap - 2.3.43-1.rhel5.x86_64

A few days later I started getting the error I reported originally on
the master:

Jul 25 02:41:51 janus slapd[31075]: conn=7142 op=1 DEL dn="uid=pcass,ou=People,dc=une,dc=edu,dc=au" 
Jul 25 02:41:51 janus slapd[31075]: conn=7142 op=1 RESULT tag=107 err=80 text=DN index delete failed 

Whew, that took some time to put that error report together. I hope I
didn't miss anything vital.

In summary I think I could say that with openldap.x86_64 2.3.39 I could
crash the master syncrepl server by causing disk write errors on a
slave. I don't know if this problem still exists in 2.3.43.

-- 
Norman Gaywood, Computer Systems Officer
University of New England, Armidale, NSW 2351, Australia

ngaywood@une.edu.au            Phone: +61 (0)2 6773 3337
http://mcs.une.edu.au/~norm    Fax:   +61 (0)2 6773 3312

Please avoid sending me Word or PowerPoint attachments.
See http://www.gnu.org/philosophy/no-word-attachments.html

References:
- DN index delete failed
  - From: Norman Gaywood <ngaywood@une.edu.au>

Prev by Date: Re: make output of ldapsearch readable (disable base64)
Next by Date: Re: Re: slapd dies immediately with signal 9
Index(es):
- Chronological
- Thread