[Date Prev][Date Next] [Chronological] [Thread] [Top]

RE: OpenLDAP high CPU usage when performing mass changes



Hi Howard,

>> At a guess, based on the minimal amount of information here, you've run into the glibc malloc fragmentation issue,
>> and switching to tcmalloc might avoid the problem.
What's the quickest way to validate this on the running-at-99%-slapd, prior to falling back on tcmalloc?
Can the proc's smaps reveal this? Like if we're seeing loads many 64MB regions?

Thanks
++Cyrille

-----Original Message-----
From: openldap-technical-bounces@OpenLDAP.org [mailto:openldap-technical-bounces@OpenLDAP.org] On Behalf Of Howard Chu
Sent: Friday, March 16, 2012 8:32 AM
To: Jeffrey Crawford
Cc: OpenLDAP technical list
Subject: Re: OpenLDAP high CPU usage when performing mass changes

Jeffrey Crawford wrote:
> We are using openldap 2.4.26 with BDB 4.8 and have replication set up 
> in mirror mode for our main ldap database. There are a couple of other 
> replicas that have a subset of the data that the main cluster has but 
> we are seeing the following behavior on all of them.
>
> When performing mass updates via LDAP, lets say on the order of 30,000 
> entries being added to existing entries. We've noticed that the CPU 
> use of the slapd instances goes through the roof (between 65% and 95% 
> continuously), and seems to stay there until it is restarted.

When the CPU usage goes high like that it should be pretty easy to see where it's going, by getting a gdb stack trace of the running process.

At a guess, based on the minimal amount of information here, you've run into the glibc malloc fragmentation issue, and switching to tcmalloc might avoid the problem.

> The Problem is that this system has to be highly available, even for 
> writing and when these updates "shock" the system, the response time 
> goes way down when the process are turning like that. I don't think 
> they are trying to catch up to the data changes because if I let them 
> run a while after the updates are done. (Talking like 1hr) and then 
> restart the instances, they go back to their normal state.

If you have the SYNC loglevel enabled, it should be obvious whether update traffic is the cause or not.

> So far the only way I've been able to mitigate the issues is to 
> reconfigure our ldap proxy instances to a machine that is having less 
> trouble, restart the instances that are chugging along, then repoint 
> the proxies back to the one just started, and start the others. Not exactly a quick operation.
>
> I've played with cache settings for both OpenLDAP and BDB and have 
> gotten the frequency of this issue reduced but I can't seem to get rid 
> of it completely and it shows up quite often after large data 
> manipulations. I'm at a loss of how to debug since nothing is 
> crashing. Any suggestions on how to find out what's causing this would 
> be very helpful. The logs are not throwing any warnings or posting 
> messages that would seem out of the ordinary and I have played with 
> the log settings but nothing seems to relate to anything that might explain why we are seeing CPU usage to go so high.

I would suggest you try out back-mdb in RE24. MDB uses 1/4 the total memory of BDB and it performs far fewer mallocs, so glibc malloc fragmentation should not be a problem. (I would have suggested 2.4.30, but the ITS#7190 fix is rather important if you have large volumes of delete operations. The other MDB-related ITSs, #7191 and #7196, are only crucial for non-X86 and non-Linux
platforms.)

-- 
   -- Howard Chu
   CTO, Symas Corp.           http://www.symas.com
   Director, Highland Sun     http://highlandsun.com/hyc/
   Chief Architect, OpenLDAP  http://www.openldap.org/project/