[Date Prev][Date Next] [Chronological] [Thread] [Top]

2.3.11/BDB bdb_cache_find_id deadlock



Hey,

Server: OpenLDAP 2.3.11
Backend: BDB 4.2.52 + patches

Server is replicated to from a master, and otherwise used for
read-operations only.

I'm looking at a deadlock we're currently suffering from.  Some threads
are still serving, but the majority are stuck, with this backtrace:

Thread 47 (Thread 1643199408 (LWP 2240)):
#0  0x400007a2 in _dl_sysinfo_int80 () from /lib/ld-linux.so.2
#1  0x4026fa86 in pthread_cond_wait@@GLIBC_2.3.2 () from
/lib/tls/libpthread.so.0
#2  0x40039b7f in __db_pthread_mutex_lock_openldap_slapd_rhl_42 () from
/usr/lib/tls/i686/libslapd_db-4.2.so
#3  0x400b6f7c in __lock_get_openldap_slapd_rhl_42 () from
/usr/lib/tls/i686/libslapd_db-4.2.so
#4  0x400b6864 in __lock_get_openldap_slapd_rhl_42 () from
/usr/lib/tls/i686/libslapd_db-4.2.so
#5  0x400b67a2 in __lock_get_pp_openldap_slapd_rhl_42 () from
/usr/lib/tls/i686/libslapd_db-4.2.so
#6  0x08102f6d in bdb_cache_entry_db_relock ()
#7  0x08103836 in bdb_cache_find_id ()
#8  0x080d7047 in bdb_search ()
#9  0x0807644c in fe_op_search ()
#10 0x08075bd3 in do_search ()
#11 0x08073d5a in connection_done ()
#12 0x08175848 in ldap_pvt_thread_pool_destroy ()
#13 0x4026d341 in start_thread () from /lib/tls/libpthread.so.0
#14 0x4034efee in clone () from /lib/tls/libc.so.6


In back-bdb/cache.c bdb_cache_find_id(), we have:

  if ( locker2 != locker ) {
     /* If we're using the per-thread txn, release all
        * of its page locks now.
     */
     DB_LOCKREQ list;
     list.op = DB_LOCK_PUT_ALL;
     list.obj = NULL;
     bdb->bi_dbenv->lock_vec( bdb->bi_dbenv, locker2,
                              0, &list, 1, NULL );
     /* If this txn was deadlocked, we must abort it
      * and invalidate this per-thread txn.
      */
    if ( rc == DB_LOCK_DEADLOCK ) {
      bdb_txn_get( op, bdb->bi_dbenv, &ltid, 1 );
    }
  }

Shouldn't the call to 'lock_vec' be setting 'rc' here ?

The only other thing I can think of is that we are downgrading the wrong
lock in the preceeding code:

 that would have mattered due to the DB_LOCK_PUT_ALL applying to the
transaction.

     if ( rc == 0 ) {
        /* If we succeeded, downgrade back to a readlock. */
        rc = bdb_cache_entry_db_relock( bdb->bi_dbenv, locker,
                                        *eip, 0, 0, lock );
     } else {

I would have thought this call was redundant in the case where locker !=
locker2, since DB_LOCK_PUT_ALL would clear up the write-lock we claim
for the transaction.

I notice that this code has disappeared with revision 1.106 of cache.c
though, so perhaps that clears the issue I'm seeing as well.

Any thoughts ?

Regards,


Nick.