[Date Prev][Date Next]
[Chronological]
[Thread]
[Top]
Re: (ITS#8396) syncprov hourly fails to answer syncrepl
--On Thursday, April 07, 2016 11:57 PM +0000 quanah@zimbra.com wrote:
Full summary:
the syncprov checkpoint operation causes the CSN to be lost for the first
write operation to occur after the checkpoint. It is important to note
that no data is lost, all changes replicate as they should.
However, the replica CSN is not updated in this scenario, making it appear
that the replica is out of sync with the master. Adding the syncprov
overlay to a replica database works around this issue by forcing the
replica to track its internal CSNs, rather than relying on broadcasts from
the master.
It is trivial to reproduce this issue by setting a short checkpoint
interval with the syncprov-checkpoint parameter.
Example of the problem:
We have a script modifying the userPassword attribute of an entry every 45
seconds. We have a syncprov-checkpoint set to happen every 5 minutes.
>From the log we can see:
Apr 7 18:00:38 zre-ldap002 slapd[29904]: syncprov_sendresp: cookie=rid=100
Apr 7 18:05:53 zre-ldap002 slapd[29904]: syncprov_sendresp: cookie=rid=100
Apr 7 18:11:09 zre-ldap002 slapd[29904]: syncprov_sendresp: cookie=rid=100
Apr 7 18:16:25 zre-ldap002 slapd[29904]: syncprov_sendresp: cookie=rid=100
Apr 7 18:17:55 zre-ldap002 slapd[29904]: syncprov_sendresp: cookie=rid=100
Apr 7 18:21:41 zre-ldap002 slapd[29904]: syncprov_sendresp: cookie=rid=100
Apr 7 18:26:57 zre-ldap002 slapd[29904]: syncprov_sendresp: cookie=rid=100
Apr 7 18:32:13 zre-ldap002 slapd[29904]: syncprov_sendresp: cookie=rid=100
Stopping the script after the 18:32:13 operation, and examining the CSN
values on each server, we see the following.
master:
[zimbra@zre-ldap003 scripts]$ ldapsearch -x -LLL -H
ldap://zre-ldap002.eng.zimbra.com -s base -b "dc=uvm,dc=edu" contextCSN
dn: dc=uvm,dc=edu
contextCSN: 20160407233212.979013Z#000000#000#000000
replica:
[zimbra@zre-ldap003 scripts]$ ldapsearch -x -LLL -H ldapi:// -s base -b
"dc=uvm,dc=edu" contextCSN
dn: dc=uvm,dc=edu
contextCSN: 20160407233127.886702Z#000000#000#000000
Note that the CSNs are 45 seconds apart -- The interval of how often our
writes are occurring. So the write op /prior/ to the checkpoint is the CSN
value that is left on the replica in this case, as it ignores the empty CSN
syncprov send response (thus not updating its CSN).
While it is of course best practice to run the syncprov overlay on the
replica to enforce internal CSN cohesion, it still should not be required,
and this is clearly a bug that can cause admins to incorrectly believe that
their servers are having replication issues.
--Quanah
--
Quanah Gibson-Mount
Platform Architect
Zimbra, Inc.
--------------------
Zimbra :: the leader in open source messaging and collaboration
A division of Synacor, Inc