[Date Prev][Date Next]
[Chronological]
[Thread]
[Top]
Re: slapd hangs at 100% cpu in sched_yield (ITS#2030)
I had a look at the current code for OPENLDAP_REL_ENG_2_1 via web cvs, and it does not look like the cause of the problem has been fixed. The problem is that the bdb backend may or may not successfully get a lock when you ask for for one. On ~line 74 of servers/slapd/back-bdb/search.c the ldap server runs LOCK_ID(), but does not check the return value, and continues, assuming it has a valid lock. This lock is passed from function to function until we hit ~line 849-921 in cache.c, where
state != CACHE_ENTRY_READY
and
rc = bdb_cache_entry_db_lock ( env, locker, ep, rw, 0, lock ); fails because the lock is invalid, so we log, yield and try again. This is an endless loop unless we can guarantee that we have a valid bdb lock in the first place.
This only affects busy systems, as we have been running 2.1.3 on a couple of servers that do not get many requests for a while now without any problems. On busy servers, the back-bdb code runs out of bdb locks under load, and so I would assume that we should check that we get a valid lock every time we ask for one, which is exactly what my patch does (although there may be better ways of doing this). I will test again when 2.1.4 is released, and let you know whether you have fixed the problem in the openldap source (as I came up with a test on our system which would consistently lock the ldap server within 1-2 minutes of starting it).
Steven
On Tue, 20 Aug 2002 07:01:13 Kurt D. Zeilenga wrote:
>This problem has likely been fixed in HEAD and (available via CVS). Please test.
>I don't think your patch is an appropriate fix.
>At 11:30 PM 2002-08-18, steven.wilton@team.eftel.com wrote:
>>Full_Name: Steven Wilton
>>Version: 2.1.3
>>OS: Linux (Debian 3.0)
>>URL: ftp://ftp.openldap.org/incoming/
>>Submission from: (NULL) (203.24.100.137)
>>
>>
>>This is actually a fix for the problem where slapd hangs using 100% CPU
>load
>>with the busy thread doing sched_yield() continuously. I ran slapd with
>the "-d
>>1" flag, and came up with the following:
>>
>>=> bdb_back_search
>>bdb(o=EFTEL): Lock table is out of available locker entries
>>bdb_dn2entry_rw("o=eftel")
>>=> bdb_dn2id_matched( "o=eftel" )
>>====> bdb_cache_find_entry_dn2id("o=eftel"): 1 (1 tries)
>>bdb(o=EFTEL): Locker does not exist
>>====> bdb_cache_find_entry_id( 1 ): 1 (busy) 2
>>locker = 1429
>>bdb(o=EFTEL): Locker does not exist
>>====> bdb_cache_find_entry_id( 1 ): 1 (busy) 2
>>locker = 1429
>>
>>
>>The last 3 lines then continue endlessly. The problem is that the
>locker is not
>>allocated correctly in the first place, due to the load on the server.
>I came
>>up with the following patch, which seems to have fixed the problem (I
>can't get
>>slapd to hang any more under the same load). I had thought about
>inserting a
>>small (100ms) sleep before the sched_yield, but am not sure what the
>most
>>portable way of doing this is.
>>
>>--- openldap-2.1.3/servers/slapd/back-bdb/back-bdb.h.orig Mon Aug
>19
>>13:01:27 2002
>>+++ openldap-2.1.3/servers/slapd/back-bdb/back-bdb.h Mon Aug 19
>13:01:56
>>2002
>>@@ -153,7 +153,7 @@
>> #define TXN_COMMIT(txn,f) txn_commit((txn), (f))
>> #define TXN_ABORT(txn) txn_abort((txn))
>> #define TXN_ID(txn) txn_id(txn)
>>-#define LOCK_ID(env, locker) lock_id(env, locker)
>>+#define LOCK_ID(env, locker) while(lock_id(env, locker))
>>{ldap_pvt_thread_yield();}
>> #define LOCK_ID_FREE(env, locker) lock_id_free(env, locker)
>> #else
>> #define LOCK_DETECT(env,f,t,a) (env)->lock_detect(env, f, t, a)
>>@@ -165,7 +165,7 @@
>> #define TXN_COMMIT(txn,f) (txn)->commit((txn),
>(f))
>> #define TXN_ABORT(txn) (txn)->abort((txn))
>> #define TXN_ID(txn) (txn)->id(txn)
>>-#define LOCK_ID(env, locker) (env)->lock_id(env, locker)
>>+#define LOCK_ID(env, locker) while((env)->lock_id(env,
>locker))
>>{ldap_pvt_thread_yield();}
>> #define LOCK_ID_FREE(env, locker) (env)->lock_id_free(env, locker)
>> #endif
>>