[Date Prev][Date Next]
[Chronological]
[Thread]
[Top]
(ITS#5451) syncprov_checkpoint deadlock bugfix?
Full_Name: Rein Tollevik
Version: CVS head
OS: linux and solaris
URL: ftp://ftp.openldap.org/incoming/
Submission from: (NULL) (81.93.160.250)
We have seen deadlock/hang situations in our master server today, which looks
like a deadlock caused by the si_csn_rwlock lock being held while
syncprov_checkpoint() is running. The patch at the end should (I hope) fix
this.
The scenario is a server with a the syncprov overlay on a glue database with
multiple bdb backends as subordinate. One of them is a syncrepl consumer,
although I don't know if that matters.
A modify operation on the syncrepl consumer causes a checkpoint of the
contextCSN. The resulting modify on the glue backend hangs in
bdb_cache_modify() awaiting the bdb_cache_entry_db_relock() to return a write
lock on the glue suffix entry. At the same time is a modify on one of the other
backends hanging in syncprov_op_response() waiting for a lock on the
si_csn_rwlock which the first thread holds. The remaining threads quickly
blocks waiting for a read lock on the glue suffix entry (the most common search
base). I recon the read locks are queued to give priority to the thread waiting
for the write lock.
What I haven't figured out is who that holds the lock on the glue suffix entry
in the first place... Anyway, holding the si_csn_rwlock while the contextCSN is
written to the database should not be necessary. But I don't know what
consequences it whould have if more than one thread should end up checkpointing
competing values, so I have added a new flag to the syncprov_info structure to
prevent simultaneons checkpoints from occuring rather than using locks to do
it.
Rein Tollevik
Basefarm AS
Index: OpenLDAP/servers/slapd/overlays/syncprov.c
diff -u OpenLDAP/servers/slapd/overlays/syncprov.c:1.1.1.13
OpenLDAP/servers/slapd/overlays/syncprov.c:1.12
--- OpenLDAP/servers/slapd/overlays/syncprov.c:1.1.1.13 Mon Mar 31 17:29:48
2008
+++ OpenLDAP/servers/slapd/overlays/syncprov.c Tue Apr 1 21:54:05 2008
@@ -128,6 +128,7 @@
int si_numops; /* number of ops since last checkpoint */
int si_nopres; /* Skip present phase */
int si_usehint; /* use reload hint */
+ int si_check; /* are we checkpointing? */
time_t si_chklast; /* time of last checkpoint */
Avlnode *si_mods; /* entries being modified */
sessionlog *si_logs;
@@ -1308,8 +1309,10 @@
slap_callback cb = {0};
BackendDB be;
+ ldap_pvt_thread_rdwr_rlock( &si->si_csn_rwlock );
mod.sml_numvals = si->si_numcsns;
mod.sml_values = si->si_ctxcsn;
+ ldap_pvt_thread_rdwr_runlock( &si->si_csn_rwlock );
mod.sml_nvalues = NULL;
mod.sml_desc = slap_schema.si_ad_contextCSN;
mod.sml_op = LDAP_MOD_REPLACE;
@@ -1335,6 +1338,8 @@
if ( mod.sml_next != NULL ) {
slap_mods_free( mod.sml_next, 1 );
}
+
+ si->si_check = 0;
}
static void
@@ -1612,7 +1617,7 @@
}
si->si_numops++;
- if ( si->si_chkops || si->si_chktime ) {
+ if ( !si->si_check && (si->si_chkops || si->si_chktime) ) {
if ( si->si_chkops && si->si_numops >= si->si_chkops ) {
do_check = 1;
si->si_numops = 0;
@@ -1627,12 +1632,14 @@
}
}
}
+
+ if ( do_check )
+ si->si_check = 1;
+
ldap_pvt_thread_rdwr_wunlock( &si->si_csn_rwlock );
if ( do_check ) {
- ldap_pvt_thread_rdwr_rlock( &si->si_csn_rwlock );
syncprov_checkpoint( op, rs, on );
- ldap_pvt_thread_rdwr_runlock( &si->si_csn_rwlock );
}
opc->sctxcsn.bv_len = maxcsn.bv_len;