[Date Prev][Date Next]
[Chronological]
[Thread]
[Top]
Re: (ITS#5451) syncprov_checkpoint deadlock bugfix?
rein@basefarm.no skrev:
>> We have seen deadlock/hang situations in our master server today, which looks
>> like a deadlock caused by the si_csn_rwlock lock being held while
>> syncprov_checkpoint() is running. The patch at the end should (I hope) fix
>> this.
>
> Hm, I looks as if I was a bit to quick here, it just hung got again :-(.
> And this time without anyone competing for this lock, only the lock on
> the glue suffix entry. I wonder if it can be the upgrade from db 4.6.18
> to 4.6.21.1 I dit yesterday that is the real problem. I'll try to
> downgrade and see if that helps.
And it didn't :-( The problem seem to be that something readlocks our
glue suffix entry before forgetting about the lock. Which quickly
causes the entire server to deadlock when the writelock required to
update the contextCSN in the suffix entry locks out all the readers.
So far the problem seem to be triggered by someone attempting to modify
an entry in a subordinate syncrepl consumer backend that results in a
referral to the backend master. But I haven't had very much time to
look into this problem yet, so I'm still on very thin ice here. I'll
return with a new ITS when I have found out more.
We are currently running with a workaround that simply grants the
writelocks on the glue suffix entry without actually doing it. As the
glue entry is the only entry in that database it should be pretty safe,
and a potential corruption of the database is not any big problem.
> Please put this case on hold. Sorry!
It currently looks as this patch addresses the symptom and not the real
problem, although I'm not sure what could happen if a checkpoint is
triggered while the suffix entry is locked by another thread. I do
believe it should be considered if not a bug so an enhancement, as
holding locks for as short time as possible is always a good thing.
You'll have to choose whether to close this ITS or use the patch.
Rein