[Date Prev][Date Next] [Chronological] [Thread] [Top]

Re: wasteful data structures: AVL tree

To: openldap-devel@openldap.org
Subject: Re: wasteful data structures: AVL tree
From: Emmanuel Lécharny <elecharny@gmail.com>
Date: Thu, 29 Jan 2015 13:28:09 +0100
Dkim-signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=message-id:date:from:user-agent:mime-version:to:subject:references :in-reply-to:content-type:content-transfer-encoding; bh=1uk5KV6T6EyU+lpkFc/NtB8v/EdvKzDArDYJm8fYVac=; b=xMfFYcfUeH3hUQQXBcKQCCDucoX+k1jDChaEMDQVKeXdrJSbRwgdsJAcIeRgHvFV9A bHMWOwFYsLSoV5FIjXbxhXARGTxTKH1K+MaeF7rXqvZ7AAuVRYrTJGjBgYG/SGUyMgbu fL6Wl4lEst08BL34T6kVHog5xZh6JmHbtplS+VgWPqqN72sAusV6yjjfPKfCMegVi3s/ yDyT0OPRRp+ngcglS1pykgRKf8bSKCsz0aiUGyaI1Fbxuh3F+Fwplf55Tn7q9Bk93Rge lEDXQVUsPGFzgBfGQimRg107vjXGMx/E4NhZt4PrdUZx5qXdA98Lpb1CSpT/lhejtl7P RIkg==
In-reply-to: <54C9E07C.5070000@symas.com>
References: <54C9A6F4.3020707@symas.com> <54C9DD58.3000808@gmail.com> <54C9E07C.5070000@symas.com>
User-agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10.9; rv:31.0) Gecko/20100101 Thunderbird/31.4.0

Le 29/01/15 08:25, Howard Chu a écrit :
> Emmanuel Lécharny wrote:
>> Le 29/01/15 04:20, Howard Chu a écrit :
>>> ITS#8038 (syncrepl hanging onto its presentlist) only came to my
>>> attention due to the amount of memory involved. On a refresh of a DB
>>> with 2.8M entries I saw the consumer using about 320MB just for the
>>> presentlist. This list consists solely of 16 byte entryUUIDs; 2.8M
>>> items should have used no more than 48MB. An AVL node itself is 28
>>> bytes on 64-bit platform, plus 16 bytes for the struct berval wrapped
>>> around the UUID.
>>>
>>> I'm looking into adding an in-memory B+tree library to liblutil. For
>>> the type of fixed-size records we're usually storing in AVL trees, a
>>> Btree will be much more compact and higher performance since it will
>>> need rebalancing far less frequently.
>>>
>> Why using a B+tree ? A hash map wouldn't be a more appropriate data
>> structure ? EntryUUID ordering seems overkilling...
>
> I'm not fond of hashes, they're always cache-unfriendly and most of
> them have very poor dynamic growth behavior. Since we don't know in
> advance how many IDs are being stored, growth/resizing is a major
> concern. Tree structures are generally preferred because they have
> very good incremental growth performance, and B+trees have the best
> CPU cache behavior.
Hashes have three problems :
- first, as you say, growing a hash is a matter of copying the hash
completely (most of the time)
- second, they can degenerate
- third, they have an average emptiness of roughly 30%

Now, on average, with data that are well distributed, they have some
major advantages :
- they are faster than any other data structure, with a O(1) average
lookup cost
- the memory that it uses is minimal, as it's generally backed with an
array containing the data plus a flag that indicates a follow link if
the bucket is shared by more than one element
- adding and deleting elements in a hash map is generally not expensive

If you compare it with a B+tree, which is stable in O(logN), it's
faster, uses less memory, and it's easier to implement. The most
criticial point being that addition or removal from a hash is way less
expensive than for e BTree. You can also protect the hash against
concurrent access way more than a B+Tree, by splitting the buckets in
blocks of sub-buckets, with a lock being set on each separated block.

At this point, some real world experiment is needed to validate those
approaches.

References:
- wasteful data structures: AVL tree
  - From: Howard Chu <hyc@symas.com>
- Re: wasteful data structures: AVL tree
  - From: Emmanuel Lécharny <elecharny@gmail.com>
- Re: wasteful data structures: AVL tree
  - From: Howard Chu <hyc@symas.com>

Prev by Date: Re: LMDB and text encoding
Next by Date: Re: LMDB and text encoding
Index(es):
- Chronological
- Thread