[Date Prev][Date Next]
[Chronological]
[Thread]
[Top]
Re: (ITS#5454) syncrepl refreshAndPersist stops receiving
Howard Chu skrev:
> rein@basefarm.no wrote:
>> Hm, I hope I have found the race condition that causes this :-) I'm now
>> running with the patch at the end to see if that solves it, only time
>> will tell..
And we haven't seen any replication problems since this was implemented
friday afternoon :-). But it is still a bit too early to conclude that
it has been fixed.
>> The race is that between the time selecting on the syncrepl socket is
>> enabled by the call to connection_client_enable() and the release of the
>> si_mutex a new message may arrive. If so, the next call to do_syncrepl
>> may fail in its attempt to trylock the mutex and no-one will re-enable
>> selecting on it again. My patch delays enabling of the socket until the
>> mutex has been released, which looks safe to me. Or can the access to
>> si->si_conn without a lock be a problem?
>
> How about just moving the enable to after the runqueue manipulation is
> done?
That would still leave a glitch in the window open, unless the initial
trylock is changed to a regular lock as your checking message suggests.
But, doing that could result in all the threads ending up waiting for
the lock if do_syncrepl of a refresh only replication is configured to
run too often. I assume the trylock is there to avoid this? What about
starting out with deferring the next scheduled call, as a scheduled call
is not wanted until the first has completed anyhow? And a rescedule
is already being done before do_syncrepl finishes.
Deferring should not be necessary if rtask->next_sched.tv_sec is zero,
i.e it has already been deferred. That would remove these calls for all
the cases when do_syncrepl is called as a result of a new message on a
persistent syncrepl connection.
> Just need to be sure that do_syncrepl() isn't entered again before
> si->si_conn gets initialized.
I'm not quite sure what you mean here. As I see it, the problem is that
do_syncrepl() must not be called after si->si_conn is enabled and the
lock is still held.
> It also occurs to me that we probably don't even need to manipulate the
> slapd runqueue in persist mode, when si->si_conn is already set. I.e.,
> in that case we can only have gotten here because of a listener event,
> and not because of a runqueue schedule.
That is probably true.
Rein