[Date Prev][Date Next]
[Chronological]
[Thread]
[Top]
contextCSN propagation problems
The worst problems we had after upgrading to 2.4.x seems to be over now,
and the replication appear to work (at least mostly) as it should. One
problem that still remains is that the contextCSN attributes themselves
don't propagate as I wish they would. Before I start coding or filing
more ITSes I would like to have comments as to whether my analysis of
the problem seems correct or not.
First, a description of our configuration is needed. I think a test
script for this configuration would be nice to have, so I'm going to
create that before I do anything else. I currently think of it as
syncrepl-asymmetric, but suggestions for better names are welcome.
Now the configuration, which is centered around a central master server.
It has a glued database with a set of subordinates. The central master
is the (sole) master for most of these subordinates, but there are also
a set of remote site-masters that each has one subordinate database that
they are the master for and that is replicated back to the central master.
The site-masters has a similar glued configuration, and they replicates
the glue entry from the central master. Different rootdn values on the
subordinates managed by syncrepl and the one the site-master is the
master for prevents syncrepl from wiping out the content of that
database during the present phase. None of the site-masters receives all
the subordinate databases the central master has, and which they receive
varies. This is controlled by acl rules on the central master.
On each of the sites (including where the central master is) there are
search-only servers that replicates the glue suffix from their
site-master (or the central master). The search servers has a single
database (for historical reasons), but their layout shouldn't matter
very much.
All of the master servers uses syncprov on the glue database, and
everyone except the central master uses syncrepl on the glue database.
The central master cannot use it there, as that would have caused it to
wipe out those subordinates that aren't on the site-master it replicates
from during the refresh phase. Different serverIDs are used on the
master servers, so an updated contextCSN set should include as many
values as there are master servers.
Now to the problem. If a modification is made (on the central master) to
a subordinate database that isn't replicated to one of its consumers it
will not receive the updated contextCSN value from the central master.
Which means that I cannot monitor the contextCSN values to verify that
the replication is working as it should. And the consumers will (after
a restart) present an outdated contextCSN set during the refresh phase,
even though their database content is up to date. But this will be
corrected when/if an update is made to a subordinate db that is
replicated to the consumer.
A worse problem is when the modifications is made to a subordinate db
the central master replicates from one of the site-masters. In this
case there will never be any updates from that site-master that should
be propagated. So the consumers will, until it restarts, be stuck with
the contextCSN value matching that remote site-master which it received
after the present phase.
Or, oh well, that is not quite true... Due to a bug in syncprov it
fails (on the central master) to detect and filter out the contextCSN
update from syncrepl on a subordinate db. Which means that the central
master will send an update of the glue entry to its consumers, so that
the value is propagated from the central master to its immediate
consumers. But it will not propagate further from the site-masters to
the search servers on their site. As this bug makes things work at
least partly as I wish I'm a bit reluctant to have it fixed yet...
So far my proposed solution to this problem is that syncprov_matchops()
should, when a modification fails to match the test_filter() nor is a
deleted entry, send a sync info protocol message with the updated
contextCSN in the newCookie field to its consumers. Does this sound
like a valid solution? There seem to be support in syncrepl.c for
receiving these messages, and in syncprov.c for sending them. But it
never actually does it as far as I can see.
--
Rein Tollevik
Basefarm AS