[Date Prev][Date Next] [Chronological] [Thread] [Top]

Deleted DNs, and the great quest.

To: openldap-technical@openldap.org
Subject: Deleted DNs, and the great quest.
From: Jorgen Lundman <lundman@lundman.net>
Date: Wed, 27 Apr 2011 11:38:33 +0900
User-agent: Mozilla/5.0 (Macintosh; U; Intel Mac OS X 10.6; en-GB; rv:1.9.1.18) Gecko/20110320 SeaMonkey/2.0.13


SunOS ldapslave01.unix 5.10 Generic_141445-09 i86pc i386 i86pc
db-4.8.30.NC.tar.gz
openldap-2.4.23.tgz

Hello list,

We have had a rare bug hit us now and then, that appears to be quite persistent.I know it isn't fun to read through someone else's setup and troubles, but I amhoping to at least get some hints at what to look at next.


Let us imagine our setup as:

uid=user1,o=a.gmail.com,ou=mail,dc=company,dc=com
uid=user2,o=a.gmail.com,ou=mail,dc=company,dc=com
uid=user1,o=b.gmail.com,ou=mail,dc=company,dc=com
uid=user1,o=c.gmail.com,ou=mail,dc=company,dc=com

(Not literally gmail.com, replacing customer domains manually, but same "style"of service).


ldapmaster -> slave1  ->  localhost slapd (pop/imap)
           -> slave2  ->  localhost slapd (postfix)
           -> slave3  ->  localhost slapd (ftp)
           -> slave4  ->  localhost slapd (smtp/forward)

All using syncrepl. One master, only ever updated by the provisioning engine.

4 immediate slaves (full ldap tree), then about 150 work servers, runninglocalhost slapd with only applicable tree. Ie, only "ou=mail" for themail-related servers.



What happens now and then, is that:

o=b.gmail.com,ou=mail,dc=company,dc=com

gets deleted. Gone.

But you can access record "uid=user1,o=b.gmail.com,ou=mail,dc=company,dc=com"just fine. But not "o=b.gmail.com,ou=mail,dc=company,dc=com". How is thatpossible? Mismatching indexes?

Amusingly, POP and IMAP work fine since they authenticate against the final userrecord "uid=user1,o=b.gmail.com,ou=mail,dc=company,dc=com". but Postfix willfail, as it verifies that the domain exists"o=b.gmail.com,ou=mail,dc=company,dc=com". syncrepl for accounts under tree"o=b.gmail.com,ou=mail,dc=company,dc=com" appear to stop until the record is fixed.

There is an easy fix, if we modify the record"o=b.gmail.com,ou=mail,dc=company,dc=com" on the master, it gets pushed out.

It is not always the same record, but it does seem to be a class of certainrecords. And the missing records seem to happen most on certain servers. 80% ofthe time, on the vmx servers. It far worse when they disappear on the slaves,since many localhost slapds will then not get synced.

Now, this happens more frequently when we do deletions. Usually around the firstof the month, since that is when we delete customers who left. Alas, we movedthe entire LDAP DNS tree to its own LDAP master, so have been deleting this treefor 20 days, and will keep deleting for 12 more days. So, at the moment, we havethis error once, or twice, a day.

At the moment, I do not really know what goes wrong. I am hoping that perhaps Ican get some good ideas on where to look, and what to look for.

This morning, we lost "i.gmail.com" on vmx02. If I grep for it in the logs, weonly find:


Apr 27 09:17:51 vmx02.unix slapd[13990]: [ID 338579 local4.debug] nonpresent_cal

lback: rid=275 nonpresent UUID 82d9570a-cdd6-102f-9364-65cc77dfbe36, dno=i.gmail.com,ou=mail,dc=company,dc=com

Apr 27 09:18:11 vmx02.unix slapd[13990]: [ID 905397 local4.debug]syncrepl_del_nonpresent: rid=275 be_deleteo=i.gmail.com,ou=mail,dc=company,dc=com (66)




# grep nonpresent_callback /var/log/slaplog| wc  -l
  517707

# grep syncrepl_del_nonpresent /var/log/slaplog|wc -l
       9


What is also curious, is the UUID, for example:

slaplog.20110424.gz:Apr 24 15:31:26 vmx02.unix slapd[13990]: [ID 338579local4.debug] nonpresent_callback: rid=275 nonpresent UUID82d9570a-cdd6-102f-9364-65cc77dfbe36, dn o=i.gmail.com,ou=mail,dc=company,dc=com

slaplog.20110424.gz:Apr 24 23:41:00 vmx02.unix slapd[13990]: [ID 338579local4.debug] nonpresent_callback: rid=275 nonpresent UUID82d9570a-cdd6-102f-9364-65cc77dfbe36, dn o=i.gmail.com,ou=mail,dc=company,dc=com

Persistent uuid? Is it ok that UUID is repeated? Are unique over time, orsometimes recycled?



If I move from vmx02, to its LDAP provider, in this case, ldapslave03 (rid = 329)


Apr 27 09:12:27 ldapslave03.unix slapd[20759]: [ID 338579 local4.debug] nonprese

nt_callback: rid=329 present UUID d6215e0e-1c55-102f-8da7-0b6e3175d2bb, dno=i.gmail.com,ou=mail,dc=company,dc=com

/export/backup/ldapslave03.unix# gzgrep d6215e0e-1c55-102f-8da7-0b6e35d2bbslaplog.20110*

Jan  6 07:15:39
Feb  8 18:47:48
Apr 24 15:22:21
Apr 24 23:36:22

Is it worth pursuing the repeated UUID? Is that some stuck record that keepsretrying over and over? Is it that is constantly trying to delete these records,and correctly failing, except now and then, it succeeds like that of vmx02 thismorning?




Then there is the slightly different message, with error "66".

Apr 27 09:20:06 ldapslave03.unix slapd[20759]: [ID 976560 local4.debug]syncprov_sendresp: cookie=


Apr 27 09:20:06 ldapslave03.unix last message repeated 1 time

Apr 27 09:20:06 ldapslave03.unix slapd[20759]: [ID 905397 local4.debug] syncrepl
_del_nonpresent: rid=329 be_delete o=i.gmail.com,ou=ftp,dc=company,dc=com (66)

Apr 27 09:20:06 ldapslave03.unix slapd[20759]: [ID 976560 local4.debug]syncprov_sendresp: cookie=


Apr 27 09:20:06 ldapslave03.unix last message repeated 4 times

I believe 66 is LDAP_NOT_ALLOWED_ON_NONLEAF, which is correct. It has thousandsof accounts in there, so it definitely should not delete it. But why does it tryto delete it in the first place? It is not us trying to delete half the tree(well, you know, pretty sure...ish.. it isn't us) so where is it coming from?







Or should I look at that empty "cookie="? Why that empty, it starts back at:

Apr 27 09:17:29 ldapslave03.unix slapd[20759]: [ID 859137 local4.debug]slap_queue_csn: queing 21798d80 20110427000316.304365Z#000000#000#000000

Apr 27 09:17:29 ldapslave03.unix slapd[20759]: [ID 976560 local4.debug]syncprov_sendresp: cookie=rid=268,csn=20110427000316.304365Z#000000#000#000000

Apr 27 09:17:29 ldapslave03.unix slapd[20759]: [ID 976560 local4.debug]syncprov_sendresp: cookie=rid=278,csn=20110427000316.304365Z#000000#000#000000

Apr 27 09:17:29 ldapslave03.unix slapd[20759]: [ID 775493 local4.debug]slap_graduate_commit_csn: removing a2bee60 0110427000316.304365Z#000000#000#000000

Apr 27 09:17:29 ldapslave03.unix slapd[20759]: [ID 905397 local4.debug]syncrepl_del_nonpresent: rid=329 be_deleteuid=$account,ou=users,ou=radius,dc=company,dc=com (0)

Apr 27 09:17:29 ldapslave03.unix slapd[20759]: [ID 502419 local4.debug] Entryuid=userX,o=$customerdomain.com,ou=mail,dc=company,dc=com CSN20110426223343.974235Z#000000#000#000000 older or equal to ctx20110426223343.974235Z#000000

#000#000000

Apr 27 09:17:29 ldapslave03.unix last message repeated 1 time

Apr 27 09:17:29 ldapslave03.unix slapd[20759]: [ID 976560 local4.debug]syncprov_sendresp: cookie=


Apr 27 09:17:29 ldapslave03.unix last message repeated 1 time

then a few hundred lines of empty "cookie=" lines.

It is also worth to note that very close to the time that it lost the recordtoday, we have:

Apr 27 09:11:25 vmx02.unix slapd[13990]: [ID 763815 local4.debug]connection_input: conn=6946 deferring operation: too many executing

and it is almost as if, it kills the replication connection randomly when itgets busy. Whatever record it was working on at the time is half processed?

Alas, there are no entries at all for any of these on the master. Possiblybecause master's loglevel is "sync", which don't seem to do much when it is theprovider. Recommended log level would be?

Now, there is version openldap-2.4.24 out, with a very large list of fixes, butI did not see anything immediately obvious. If there is a fix that "looks good"to fix this, I can go through the upgrade. But it is a 3 week process to make ithappen on all servers.

Worse is, I have tried to replicate the trouble in a test network, running theexact same data, and binaries, with a master, slave and localhost slapd all onone machine. But I can not make the trouble happen in a simulated environment.Possibly as I am not generating the same load. Quite annoying.

slapd.conf is pushed out with git, and is the same everywhere, except thatslaves have the syncrepl section, with separate rids, and memory cache size areset differently.



include         /usr/local/etc/openldap/schema/core.schema
include         /usr/local/etc/openldap/schema/cosine.schema
include         /usr/local/etc/openldap/schema/nis.schema
include         /usr/local/etc/openldap/schema/qmail.schema
include         /usr/local/etc/openldap/schema/local.schema
include         /usr/local/etc/openldap/schema/RADIUS-LDAPv3.schema
include         /usr/local/etc/openldap/schema/AmavisdLDAP.schema
include         /usr/local/etc/openldap/schema/analyse.schema
include         /usr/local/etc/openldap/schema/formail.schema
include         /usr/local/etc/openldap/schema/dlz.schema
include         /usr/local/etc/openldap/schema/filter.schema
pidfile         /usr/local/var/run/slapd.pid
argsfile        /usr/local/var/run/slapd.args
loglevel        sync

database monitor
access to dn.subtree="cn=Monitor"
       by dn="cn=admin,dc=company,dc=com" read
       by * none

access to attrs=userPassword
        by self write
        by anonymous auth
        by * none

access to *
        by self write
        by dn="cn=admin,dc=company,dc=com" write
        by * read

password-hash {CRYPT}

database        hdb
suffix          "dc=company,dc=com"
rootdn          "cn=admin,dc=company,dc=com"
directory       /usr/local/var/openldap-data
rootpw $secret

index   objectClass     eq
index   uid                     eq
index   uidNumber               eq
index   mail                    eq
index   mailAlternateAddress    pres,eq
index   deliveryMode            eq
index   accountStatus           eq
index   gecos                   eq
index   radiusGroupName         eq
index   o                       pres,eq
index   entryCSN,entryUUID      eq
index   gidNumber               eq
index   DNSType                 eq
index   DNSIPAddr               eq
index   DNSData                 eq
index   DNSHostName             eq

checkpoint 128 15
cachesize 6000
idlcachesize 18000

overlay syncprov
syncprov-checkpoint 100 10
syncprov-sessionlog 1000

dbconfig set_lk_detect DB_LOCK_DEFAULT
dbconfig set_lg_max 52428800
dbconfig set_cachesize 4 0 1
dbconfig set_flags db_log_autoremove
dbconfig set_lk_max_objects 1500
dbconfig set_lk_max_locks 1500
dbconfig set_lk_max_lockers 1500

syncrepl   rid=( $IP-4TH-OCTET + 256 )
                provider=ldap://172.20.12.163
                type=refreshAndPersist
                interval=00:00:00:30
                searchbase="dc=company,dc=com"
                filter="(objectClass=*)"
                attrs="*,+"
                scope=sub
                schemachecking=off
                bindmethod=simple
                binddn="cn=admin,dc=company,dc=com"
                credentials="$secret"
                retry="60 10 300 +"

updateref       ldap://172.20.12.113

(Actually, 172.20.12.163 and 172.20.12.113 are the same server, just aliases onsame nic. But updateref is never actually used, we only one writer that talksdirectly to master).

Follow-Ups:
- Re: Deleted DNs, and the great quest.
  - From: Quanah Gibson-Mount <quanah@zimbra.com>

Prev by Date: Problem regarding OpenLdap installation and Berkeley DB
Next by Date: Re: Deleted DNs, and the great quest.
Index(es):
- Chronological
- Thread