[Date Prev][Date Next]
[Chronological]
[Thread]
[Top]
Openldap scalability problems
I'm running into a scalability problem with openldap 2.2.15 - and possibly what appears to be a memory leak. I'm currently at a dead end - so if anyone can offer any tips, I would appreciate it greatly. Apologies for the lengthy post.
Our ldap server hosts about a dozen different databases. One of which is a very large database (its about 14 GB when loaded and indexes) using Berkeley db. It has several million entries in it (not exactly sure how many)
All client access to this database is readonly - it doesn't change after we load it. It is loaded with slapadd in a batch process.
During my current task, the only database I am accessing is this large one, and I am the only one using the server.
We have a custom schema for our data structure. I can post it if it becomes important.
What I am trying to do is export the entire database to a different format, accessing the content through Java's ldap API's.
I open two connections to the ldap server - one to do the initial (paged results) query - that will return every single concept code in the database. There should be about 300,000 of these total.
As I iterate over these - I use the second ldap connection to get other details about the concept - This usually ends up being 2-3 additional queries per concept. Then, I go on to the next concept - fetching the next page of results as necessary.
When I get a certain distance through the results (it varies between about 50,000 and 220,000 - usually takes about 5 to 6 hours to get this far) one of the queries to get additional details takes a really really long time. As in - it never comes back (properly). I think I return from the call when the idletimeout value is hit or possibly when my client side timeout value is hit - but I'm not sure yet. I'm trying to add additional debugging to help me sort out this problem. Then, of course, when I try to get additional results, I find that my connections have been closed, which causes exceptions, and generally mucks things up. And maybe I'm mistaken - but I am under the belief that the paged result cookies are only good while a connection is open - so there is no way that I can pick up from where I left off, so to say, and continue on.
When I look at the LDAP process in top while I am "hung" and waiting for results - it looks like there has been a tremendous memory leak (sorry for the variable width fonts messing this up)
top - 15:16:38 up 50 days, 7:05, 19 users, load average: 0.11, 0.08, 0.36
Tasks: 577 total, 1 running, 162 sleeping, 0 stopped, 414 zombie
Cpu(s): 1.2% us, 1.9% sy, 0.0% ni, 96.3% id, 0.4% wa, 0.0% hi, 0.2% si
Mem: 3115804k total, 3093044k used, 22760k free, 56548k buffers
Swap: 6152872k total, 207992k used, 5944880k free, 1939880k cached
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
18882 openldap 18 0 2343m 970m 1.4g S 0.0 31.9 0:00.16 slapd
This snapshot was taken while I was waiting for a query to return. You can see it is using 0% of the CPU time - but the memory usage is pretty ridiculous. It certainly seems to me that it shouldn't be using 2+ GB of RAM to simply to a scan across the database. And it doesn't appear to be doing a darn thing, yet my client code is still waiting for it. If I run the query manually (where it hung) in a gui LDAP client it does return, but its quite slow. One of my guesses is that I am starting to do some paging death - since the machine only had 3GB of real memory.
Can anyone offer me any advice? Maybe I just have a really dumb configuration... I would love it if the fix is that simple. So here is my full config:
Here are my config options from the slapd.conf file that I use to start the server:
idletimeout 1800
threads 150
sizelimit 1000000
*There there are about a dozen databases, each configured with a block like this:
*(this block is for the database that is currently giving my problems)
*****
database bdb
suffix "service=Snomed-CT,dc=LexGrid,dc=org"
directory /home/ldap/database/production/dbSNOMED-CT
limits * size.pr=10000 size.prtotal=none
checkpoint 512 30
index objectClass eq
index conceptCode eq
index language pres,eq
index dc eq
index sourceConcept,targetConcept,association,presentationId eq
index text,entityDescription pres,eq,sub,subany
****
*The file then ends with top level database:
database bdb
suffix "dc=org"
directory ../../database/production/dbusers
limits * size.pr=10000 size.prtotal=none
checkpoint 512 30
*Oh - there is also includes at the top of the config file for the custom schema, which I am leaving out.
Here is my DB_CONFIG file from the problematic database:
set_flags DB_TXN_NOSYNC
set_flags DB_TXN_NOT_DURABLE
set_cachesize 0 102400000 1
Thanks,
Dan