[Date Prev][Date Next]
[Chronological]
[Thread]
[Top]
Re: (ITS#8281) Syncrepl refresh failure when slapd is restarted midstream
- To: openldap-its@OpenLDAP.org
- Subject: Re: (ITS#8281) Syncrepl refresh failure when slapd is restarted midstream
- From: hyc@symas.com
- Date: Fri, 23 Oct 2015 04:51:03 +0000
- Auto-submitted: auto-generated (OpenLDAP-ITS)
quanah@openldap.org wrote:
> Full_Name: Quanah Gibson-Mount
> Version: RE24 Sept 11, 2015
> OS: Linux 2.6
> URL: ftp://ftp.openldap.org/incoming/
> Submission from: (NULL) (75.111.52.177)
>
>
> Seeing a scenario where if slapd is stopped on a new MMR node while a full
> REFRESH is occurring, the state of that refresh is not tracked, and the wrong
> CSN value is stored.
> This dataset has 15,000 users. We see it get up to user 625:
>
> Oct 20 16:13:09 q2 slapd[18724]: syncrepl_entry: rid=100 be_search (0)
> Oct 20 16:13:09 q2 slapd[18724]: syncrepl_entry: rid=100
> uid=user625,ou=people,dc=q1,dc=aon,dc=zimbraview,dc=com
> Oct 20 16:13:09 q2 slapd[18724]: slap_queue_csn: queueing 0x44c7e30
> 20151020185526.862768Z#000000#000#000000
> Oct 20 16:13:09 q2 slapd[18724]: slap_graduate_commit_csn: removing 0x44c87c0
> 20151020185526.862768Z#000000#000#000000
> Oct 20 16:13:09 q2 slapd[18724]: syncrepl_entry: rid=100 be_add
> uid=user625,ou=people,dc=q2C2Cdc=aon,dc=zimbraview,dc=com (0)
> Oct 20 16:13:09 q2 slapd[18724]: slapd stopped.
>
>
> Then when slapd is restarted:
>
> Oct 20 16:13:16 q2 slapd[18970]: do_syncrep2: rid=100
> cookie=rid=100,sid=001,csn=20151020201231.263989Z#000000#001#000000
> Oct 20 16:13:16 q2 slapd[18970]: sp_p_queue_csn: queueing 0x309dfd8
> 20151020201231.263989Z#000000#001#000000
> Oct 20 16:13:16 q2 slapd[18970]: slap_queue_csn: queueing 0x5054008
> 20151020201231.263989Z#000000#001#000000
> Oct 20 16:13:16 q2 slapd[18970]: slap_graduate_commit_csn: removing 0x49353c0
> 20151020201231.263989Z#000000#001#000000
> Oct 20 16:13:16 q2 slapd[18970]: slap_graduate_commit_csn: removing 0x4935060
> 20151020201231.263989Z#000000#001#000000
> Oct 20 16:13:16 q2 slapd[18970]: syncrepl_message_to_op: rid=100 be_add
> cn=q2.aon.zimbraview.com,cn=servers,cn=zimbra (0)
>
> which causes it to skip the other 14,000+ users.
After investigating the server setup, there are a few problems here.
The new server was being configured with sid=001 which was already assigned to
the original master. That's clearly going to screw things up.
Aside from that, the new server was converted to MMR using dynamic config and
we have a sequencing problem - it adds the syncprov overlay first, and then
adds the syncrepl config. This is actually the only safe order, since the
consumer will start as soon as it's added and syncprov will already be in
place, ready to propagate changes as needed.
But .. syncprov does a check in syncprov_db_open() to decide whether it should
generate an initial contextCSN on a new DB. This step is ignored if the
backend is configured for MMR (and must be ignored). The problem is that this
node *will be* configured for MMR, but it isn't yet, because the consumer
hasn't been dynamically configured yet. So syncprov generates its own
contextCSN, which is checkpointed on shutdown. The real contextCSN from the
master hasn't been received yet since the refresh is still in progress when
the server is stopped, so on next restart this consumer will present its
generated contextCSN, which is newer than the original master's, and so it
won't resume refreshing from where it left off.
Generating a new contextCSN at startup is of questionable worth. We discussed
this a bit 'way back in 2004
http://www.openldap.org/lists/openldap-devel/200408/msg00035.html Perhaps we
should just not do it; if a single-master provider starts up empty and a
consumer tries to talk to it and both have an empty cookie, the provider
should just respond "you're up to date".
--
-- Howard Chu
CTO, Symas Corp. http://www.symas.com
Director, Highland Sun http://highlandsun.com/hyc/
Chief Architect, OpenLDAP http://www.openldap.org/project/