[Date Prev][Date Next] [Chronological] [Thread] [Top]

Re: (ITS#8125) MMR throws away valid changes causing drift



--On Friday, May 15, 2015 7:53 PM +0000 quanah@zimbra.com wrote:

Capturing what was discussed of this issue in IRC at the time:

<hyc> too bad. yes, there is a case where a refresh can delete entries that 
shouldn't have been deleted
<hyc> we discussed it here a month or so ago
<hyc> but if all servers were in persist phase with delta, that shouldn't 
happen
<JoBbZ> hm.. should have been out of refresh
<JoBbZ> it did a refresh @ zimbra.log.30:May  1 14:46:02 zm-mmr01 
slapd[43243]: do_syncrep2: rid=101 LDAP_RES_INTERMEDIATE - REFRESH_DELETE
<JoBbZ> but that was hours before
<hyc> and it was also behind by 1.5 hours
<hyc> ...
<JoBbZ> is there a message logged when refresh mode ends?
<hyc> that's the message.
<hyc> at least, that is the message from the provider that it's done 
sending refresh messages
<hyc> then the consumer runs syncrepl_del_nonpresent based on the provider 
messages
<JoBbZ> zimbra.log.30:May  1 18:37:33 zm-mmr01 slapd[43243]: 
syncrepl_del_nonpresent: rid=102 be_delete 
uid=skoelzer,ou=people,dc=unitedfiber,dc=email (0)
<JoBbZ> to
<JoBbZ> zimbra.log.30:May  1 18:37:35 zm-mmr01 slapd[43243]: 
syncrepl_del_nonpresent: rid=101 be_delete 
uid=fsodell2,ou=people,dc=ntin,dc=net (32)
<JoBbZ> why would refresh mode end @ 14:146, and then 4 hours later, it 
does del_nonpresent?
<hyc> it might have had a lot of refresh messages to parse
<hyc> but 4 hours seems excessive, yeah
<JoBbZ> but it was replicating during that time too
<JoBbZ> May  1 15:00:00 zm-mmr01 slapd[43243]: syncrepl_message_to_op: 
rid=102 be_modify uid=gibbonsj,ou=people,dc=bevcomm,dc=net (0)
<JoBbZ> for example
<JoBbZ> but thousands and thousands
<hyc> that's rid=102
<hyc> you were looking at rid=101
<JoBbZ> May  1 18:42:18 zm-mmr01 slapd[43243]: do_syncrep2: rid=102 
LDAP_RES_INTERMEDIATE - REFRESH_DELETE
<JoBbZ> May  1 18:42:20 zm-mmr01 slapd[43243]: do_syncrep2: rid=101 
LDAP_RES_INTERMEDIATE - REFRESH_DELETE
<JoBbZ> they both finished around the same time
<hyc> this is 4way so where's the 3rd rid
<JoBbZ> May  1 14:46:02 zm-mmr01 slapd[43243]: do_syncrep2: rid=100 
LDAP_RES_INTERMEDIATE - REFRESH_DELETE
<JoBbZ> hm
<JoBbZ> this is interesting..
<JoBbZ> May  1 14:18:45 zm-mmr01 slapd[36133]: do_syncrep2: rid=100 
LDAP_RES_INTERMEDIATE - REFRESH_DELETE
<JoBbZ> May  1 14:19:59 zm-mmr01 slapd[36133]: do_syncrep2: rid=101 
LDAP_RES_INTERMEDIATE - REFRESH_DELETE
<JoBbZ> May  1 14:19:59 zm-mmr01 slapd[36133]: do_syncrep2: rid=102 
LDAP_RES_INTERMEDIATE - REFRESH_DELETE
<JoBbZ> so it did one sync then
<JoBbZ> May  1 14:46:02 zm-mmr01 slapd[43243]: do_syncrep2: rid=100 
LDAP_RES_INTERMEDIATE - REFRESH_DELETE
<JoBbZ> May  1 14:46:02 zm-mmr01 slapd[43243]: do_syncrep2: rid=102 
LDAP_RES_INTERMEDIATE - REFRESH_DELETE
<JoBbZ> May  1 14:46:02 zm-mmr01 slapd[43243]: do_syncrep2: rid=101 
LDAP_RES_INTERMEDIATE - REFRESH_DELETE
<JoBbZ> another sync then
<JoBbZ> a sync here
<JoBbZ> May  1 18:37:34 zm-mmr01 slapd[43243]: do_syncrep2: rid=102 
LDAP_RES_INTERMEDIATE - REFRESH_DELETE
<JoBbZ> May  1 18:37:35 zm-mmr01 slapd[43243]: do_syncrep2: rid=101 
LDAP_RES_INTERMEDIATE - REFRESH_DELETE
<JoBbZ> and one at
<JoBbZ> May  1 18:42:18 zm-mmr01 slapd[43243]: do_syncrep2: rid=102 
LDAP_RES_INTERMEDIATE - REFRESH_DELETE
<JoBbZ> May  1 18:42:20 zm-mmr01 slapd[43243]: do_syncrep2: rid=101 
LDAP_RES_INTERMEDIATE - REFRESH_DELETE
<JoBbZ> ugh
<hyc> no rid=100 from the last 2?
<JoBbZ> nope
<JoBbZ> let me go look at the rid assignments rq
<hyc> that's suspicious
<JoBbZ> hm.. i bet it is losing connections
<JoBbZ> 101/102 are in nashville
<JoBbZ> 100 is in greenville, along with this server
<JoBbZ> yeah, full fallback triggered at 18:37:04
<JoBbZ> sigh
<JoBbZ> but this was a modrdn and an account add
<JoBbZ> no deletes
<hyc> was the Add the missing entry?
<JoBbZ> yeah
<hyc> yeah, that's the known issue then
<JoBbZ> ok, what about the modrdn?
<hyc> which isn't in an ITS so we should probably use the one you just filed
<hyc> haven't thought about the modrdn but it's probably the same thing.
<JoBbZ> ok
<hyc> in refresh without syncprov sessionlog, the provider has to send the 
consumer a list of entryUUIDs of every entry that exists on the provider
<hyc> the consumer then searches every entry in the consumer, compares to 
the list of entryUUIDs from the provider, and deletes any entry that's not 
in the provider's list
<hyc> the problem is that an entry can be missing from this particular 
provider's list because its update was coming from a different provider
<hyc> so it gets deleted even though it was properly received from the 
other provider
<JoBbZ> ok, makes sense
<hyc> if you have a syncprov sessionlog that's large enough to cover the 
span of disconnected time
<hyc> then the provider sends a list of deleted entryUUIDs instead of 
present UUIDs
<hyc> and this bug can't happen then


--Quanah

--

Quanah Gibson-Mount
Product Architect
Symas Corporation
Packaged, certified, and supported LDAP solutions powered by OpenLDAP:
<http://www.symas.com>