[Date Prev][Date Next] [Chronological] [Thread] [Top]

Re: performance

To: OpenLDAP-devel@openldap.org
Subject: Re: performance
From: Howard Chu <hyc@symas.com>
Date: Sat, 10 Nov 2007 07:14:11 -0800
In-reply-to: <472B63D3.5000109@symas.com>
References: <4718321A.60207@symas.com> <4719E4D6.1090307@symas.com> <471AD1F3.4060407@boreham.org> <471AF4CF.3000308@symas.com> <471CD358.4020802@hp.com> <4721A81C.7090709@symas.com> <47233031.1050802@symas.com> <4728FC95.9070608@symas.com> <472B63D3.5000109@symas.com>
User-agent: Mozilla/5.0 (X11; U; Linux i686; rv:1.9a9pre) Gecko/2007100901 SeaMonkey/2.0a1pre

Howard Chu wrote:

The time to slapadd 5 million entries for back-hdb was 29 minutes:
time ../servers/slapd/slapd -Ta -f slapd.conf.5M -l ~hyc/5Mssha.ldif -q

real    29m15.828s
user    20m22.110s
sys     6m56.433s

Using a shared memory BDB cache instead of mmap'd files brought this time down to 16 minutes. (For 1M entries, it was only 3 minutes.)

For ADAM it was about 3 hours:

time ldifde.exe -i -h -f 5M.ldif -s localhost -q 8
743.57u 475.56s 2:59:44.78 11.3%


For regular Active Directory it was just under 24 hours.

I noticed something familiar while looking at ldapsearch output from AD and ADAM. Even though their ISAM database uses B-trees for its indices, and they use integers for their primary keys (they call them DNTs, DN Tags, but they're essentially the same as our entryIDs, though theirs are fixed at 32 bits and ours are 64 bits on a 64 bit machine), they don't preserve the entry input order. In fact, it looks like they're passing their 4-byte integer keys directly to their B-tree routines in native little-endian byte order, which totally screws up the sorting and locality properties of the B-tree. I noticed this same mis-behavior in the back-bdb code within about a week of starting to work on it. (Patched it in November 2001 back-bdb/init.c:1.42, made the corresponding fix to back-ldbm/init.c:1.28 in December 2001...)

So essentially they have a fundamental design flaw that is costing them a huge amount of both write and read performance. It's even more amusing that they recognized they had a problem, but rather than fix the actual problem, they wrote a Defragmentation process that has to run on a regular schedule (or can be invoked manually when the server is shutdown) to try and put a bandaid over the problem. Good old Microsoft.

Summary: B-trees are balanced trees that use ordered keys. If you store these keys in this order, a retrieval should return them in the same order: 1 2 3 4

This characteristic offers a number of performance advantages - when you're doing a bulk load, it is essentially the same as a sequential write - there's very little disk seeking needed. When you're trying to find ranges of entries within a node, you can use binary search to locate the desired targets instead of having to do a pure linear search. And if you're retrieving a range of entries they'll reside next to each other, which makes caching more effective.

But this only works if you insert keys in their natural, sequential order, or your B-tree sort/comparison function knows something special about your keys' characteristics. Most B-tree implementations just use memcmp for this purpose. What happens when you feed little-endian integers to memcmp is incredibly bad, although the effects won't be seen until you get to more than 256 items in the database. The first 255 entries will all get added in sequentially, without difficulty. But when #256 comes, instead of just tacking on after #255, it has to seek back to the beginning and do an insertion.

The sort order with little-endian integers looks like (in hex)
0100
0200
...
ff00

for entries 1-255. Then 256 comes along as 0001 which clearly belongs at the beginning of the above list. And 257, 0101 slots in right after entry #1. If all of the nodes were optimally filled up to this point, the majority of them are going to need to be split as you add entries from 256-511. And the same rewind seeking/splitting will occur again for 512-767, and so on for everything up to entry #65535. Then entry #65536 aggravates the problem even further, as does entry #16777216.

I.e., this flaw turns what should be the best possible insert order for a B-tree into the worst possible insert order. It's something that's easily overlooked, when you don't think about what you're doing, but its effect is so obvious you really can't miss it. No wonder fragmentation is such a severe problem in AD and ADAM...

I think this also explains to a large degree why, even though all of these packages use B-trees, OpenLDAP imports data an order of magnitude faster than ADAM, and nearly two orders of magnitude faster than AD. It also explains why our import speeds scale pretty much linearly with the number of objects, without extensive multithreading tricks, while theirs scales sub-linearly, even using multiple threads in parallel. (When you compare actual CPU time used, instead of wall clock time, the difference is another 3-4x greater at 1M entries. More than that at 5M.)

By now there's not much point in testing ActiveDirectory any further, it's so miserably bad there'd be nothing to gain. ADAM isn't quite as bulky as AD but it still has a lot of the same design constraints and flaws. -- -- Howard Chu Chief Architect, Symas Corp. http://www.symas.com Director, Highland Sun http://highlandsun.com/hyc/ Chief Architect, OpenLDAP http://www.openldap.org/project/

References:
- Re: performance
  - From: Howard Chu <hyc@symas.com>

Prev by Date: Re: performance
Next by Date: more perf numbers
Index(es):
- Chronological
- Thread