[Date Prev][Date Next] [Chronological] [Thread] [Top]

Indexing of UTF-8 attributes

To: <openldap-software@OpenLDAP.org>
Subject: Indexing of UTF-8 attributes
From: "James Courtney" <Jcourtney@inphonic.com>
Date: Fri, 20 Jun 2003 12:19:41 -0700
Content-class: urn:content-classes:message
Thread-index: AcM3YOZgCxhfKWrKSG+MP9YKxBOePg==
Thread-topic: Indexing of UTF-8 attributes

Hello all,
I'm quite new to the LDAP world but our company uses a product based on OpenLDAP (Mirapoint messaging server integrated with some Cisco legacy voicemail/telephony systems) which defines several attributes which must be searched, sometimes with wildcards. For these searches we have applied eq and sub index types. This platform also allows us to specify an index type subinitial which is a "starts with" index. One of our attributes is resulting in very expensive and lengthy searches despite being indexed equivalently to other attributes. I'll give two examples now:

ATTRIBUTE SAMPLE VALUES
-------------------- ----------------------------------
givenname "fred", "jamey", "mary", "johannes"
mailalternateaddress "FAX=1234567890@domain.com"

The givenname attribute searches quickly when indexed and the rather strange looking mailalternateaddress is the attribute with poor search performance and is used by our application in ways I admit I don't fully understand. We've integrated several pieces of software together to create our system and it's one or more of these pieces that make use of this attribute.

Pretty much all searches on givenname work efficiently and complete in a fraction of a second.

Equality searches on mailalternateaddress work very efficiently and complete in a fraction of a second.

Without a sub or subinitial index on the mailalternateaddress attribute a wildcard query like "(mailalternateaddress=FAX=8668123456*)" (which is what the application in question routinely tries to do) results in a 3-3.5 minute effort on the part of our QA LDAP system.

With a sub index on the mailalternateaddress attribute a query like "(mailalternateaddress=FAX=8662123456*)" runs in just over 1 minute and a query like "(mailalternateaddress=FAX=8775123456*)" runs in 50+ seconds. This is still basically unacceptable for our production environment.

With a subinitial index (I don't see this supported for OpenLDAP in general) on the mailalternateaddress attribute a query like "(mailalternateaddress=FAX=8662123456*)" runs in about 1 minute, 10+ seconds and a query like "(mailalternateaddress=FAX=8775123456*)" run in about the same.

So I am suspicious that the common prefix of our mailalternateaddress attributes is fouling things up...

All of the mailalternateaddress attributes in our system (one for each of about 20,000 users) starts with "FAX=" and nearly all start with "FAX=866" as nearly all of the telephone numbers our users are assigned are start with 866. This is 7 characters that are identical for nearly all of the 20,000 user entries. Since the 8th character for all is a digit we know there are only 10 possibilities for the 8th character. To give a better idea here are the results of a query from our system which indicate the number of telephone numbers starting with each of the 4-digit prefixes in our system:

PREFIX COUNT
------ -----
8662 5584
8663 2200
8664 1857
8665 2171
8666 559
8667 2407
8668 3934
8772 1141
8773 208
8774 363
8775 212
8776 242
8777 314
8778 346

My theory was that the string indexing is performed on the first several (let's say 8 characters for UTF-8 or any 8-bit character encoding) to produce unique values (hash bins) up to the number supported by an 8-byte long-int. It appears something less simplistic is being done though as that would really only be good for a subinitial index, not a true substring index and these perform basically identically. It's possible that the platform we're using claims a subinitial index type which it internally maps to the sub or substring index type of the underlying OpenLDAP store. I'm not sure why this would be done though. Probably the strongest evidence debunking my theory is that the numbers prefixed with 8775 don't query more quickly than those starting with 8662 which should be the case. HMMM...

Caching does help significantly with the mailalternateaddress queries but I'm not sure at all that we can set our production system up to keep such large queries in cache permanently. Far better would be a working index.

Before I get many emails telling me how asinine the type of query being performed is and questioning the use of such an attribute I would like to point out that I agree but remind folks that this is not an application that we have written. We do have access to source code which could be changed but it's fairly opaque and I'd like to know that I can't improve the LDAP performance significantly before I seriously consider code changes.

Any light you LDAP gurus out there can shed on my quandary regarding index performance would be most appreciated. Thanks to all in advance!

Jamey

James Courtney
InPhonic Inc.
Hayward, CA

Follow-Ups:
- Re: Indexing of UTF-8 attributes
  - From: Dieter Kluenter <dieter@dkluenter.de>

Prev by Date: openldap down and pam ldap
Next by Date: RE: access to schema info from backend
Index(es):
- Chronological
- Thread