[Date Prev][Date Next] [Chronological] [Thread] [Top]

(ITS#5451) syncprov_checkpoint deadlock bugfix?



Full_Name: Rein Tollevik
Version: CVS head
OS: linux and solaris
URL: ftp://ftp.openldap.org/incoming/
Submission from: (NULL) (81.93.160.250)


We have seen deadlock/hang situations in our master server today, which looks
like a deadlock caused by the si_csn_rwlock lock being held while
syncprov_checkpoint() is running.  The patch at the end should (I hope) fix
this.

The scenario is a server with a the syncprov overlay on a glue database with
multiple bdb backends as subordinate.  One of them is a syncrepl consumer,
although I don't know if that matters.

A modify operation on the syncrepl consumer causes a checkpoint of the
contextCSN.  The resulting modify on the glue backend hangs in
bdb_cache_modify() awaiting the bdb_cache_entry_db_relock() to return a write
lock on the glue suffix entry.  At the same time is a modify on one of the other
backends hanging in syncprov_op_response() waiting for a lock on the
si_csn_rwlock which the first thread holds.  The remaining threads quickly
blocks waiting for a read lock on the glue suffix entry (the most common search
base). I recon the read locks are queued to give priority to the thread waiting
for the write lock.

What I haven't figured out is who that holds the lock on the glue suffix entry
in the first place...  Anyway, holding the si_csn_rwlock while the contextCSN is
written to the database should not be necessary. But I don't know what
consequences it whould have if more than one thread should end up checkpointing
competing values, so I have added a new flag to the syncprov_info structure to
prevent simultaneons checkpoints from occuring rather than using locks to do
it.

Rein Tollevik
Basefarm AS

Index: OpenLDAP/servers/slapd/overlays/syncprov.c
diff -u OpenLDAP/servers/slapd/overlays/syncprov.c:1.1.1.13
OpenLDAP/servers/slapd/overlays/syncprov.c:1.12
--- OpenLDAP/servers/slapd/overlays/syncprov.c:1.1.1.13	Mon Mar 31 17:29:48
2008
+++ OpenLDAP/servers/slapd/overlays/syncprov.c	Tue Apr  1 21:54:05 2008
@@ -128,6 +128,7 @@
 	int		si_numops;	/* number of ops since last checkpoint */
 	int		si_nopres;	/* Skip present phase */
 	int		si_usehint;	/* use reload hint */
+	int		si_check;	/* are we checkpointing? */
 	time_t	si_chklast;	/* time of last checkpoint */
 	Avlnode	*si_mods;	/* entries being modified */
 	sessionlog	*si_logs;
@@ -1308,8 +1309,10 @@
 	slap_callback cb = {0};
 	BackendDB be;
 
+	ldap_pvt_thread_rdwr_rlock( &si->si_csn_rwlock );
 	mod.sml_numvals = si->si_numcsns;
 	mod.sml_values = si->si_ctxcsn;
+	ldap_pvt_thread_rdwr_runlock( &si->si_csn_rwlock );
 	mod.sml_nvalues = NULL;
 	mod.sml_desc = slap_schema.si_ad_contextCSN;
 	mod.sml_op = LDAP_MOD_REPLACE;
@@ -1335,6 +1338,8 @@
 	if ( mod.sml_next != NULL ) {
 		slap_mods_free( mod.sml_next, 1 );
 	}
+
+	si->si_check = 0;
 }
 
 static void
@@ -1612,7 +1617,7 @@
 		}
 
 		si->si_numops++;
-		if ( si->si_chkops || si->si_chktime ) {
+		if ( !si->si_check && (si->si_chkops || si->si_chktime) ) {
 			if ( si->si_chkops && si->si_numops >= si->si_chkops ) {
 				do_check = 1;
 				si->si_numops = 0;
@@ -1627,12 +1632,14 @@
 				}
 			}
 		}
+
+		if ( do_check )
+			si->si_check = 1;
+
 		ldap_pvt_thread_rdwr_wunlock( &si->si_csn_rwlock );
 
 		if ( do_check ) {
-			ldap_pvt_thread_rdwr_rlock( &si->si_csn_rwlock );
 			syncprov_checkpoint( op, rs, on );
-			ldap_pvt_thread_rdwr_runlock( &si->si_csn_rwlock );
 		}
 
 		opc->sctxcsn.bv_len = maxcsn.bv_len;