[Date Prev][Date Next]
[Chronological]
[Thread]
[Top]
Re: substring index oddity
John Madden wrote:
"uid=*0371*" dn
# numResponses: 125
# numEntries: 124
real 0m0.052s
Further research on the "allidsthreshold" concept mentioned in the old list thread
lead me to SLAPD_LDBM_MIN_MAXIDS, which, at 8192-4, is likely too low for a
million objects that were created sequentially. Unfortunately, I'm running Debian
for a reason -- going back to compiling from source (as I do now) is a last
resort.
(Since I'm using bdb, is the #define even relevant?)
No. Nothing about LDBM has any relevance to back-bdb.
The, uh, "good news" is that the numEntries I'm seeing for the "bad" query is far
below 8188, just 1111. So perhaps this isn't an allids problem? For reference,
the searches with numEntries:
uid=*222* : 29 seconds
# numEntries: 3700
As already noted, by default a substring term must have at least 4
characters in order for indexing to have any effect.
uid=*222 : 0.063 seconds
# numEntries: 1000
uid=*2*22 : 0.14 seconds
# numEntries: 3439
And then, just for fun I did:
uid=*2 : 29 seconds
# numEntries: 100000
subinitial and subfinal default to a minimum of two characters, so this
search didn't use the index either.
uid=*22 : 0.41 seconds
# numEntries: 10000
...So 10,000 entries can be returned off an index search, well over the 8188. Is
there another allids-like limit someplace?
There is something like that limit, the default is around 131072 in
back-bdb (131072 in memory, 65536 for the on-disk index slots) but it
doesn't quite mean the same as allids meant in back-ldbm. That is, when
back-ldbm hit its limit, it would return ALLIDS which literally meant
"all entries in the database." When back-bdb gets to this limit, it
turns into a range. The range has a low end at the first/lowest of the
~131,000 IDs, and a high end at the last/highest of the ~131,000 IDs.
Once you get into operating as a range, you may get degraded
performance, but not quite as bad as back-ldbm. That is, the performance
degrades dependong on how discontiguous the original list of IDs was. If
they were all contiguous, then there is no degradation. If there were
large gaps in the ID list, then performance degrades.
e.g., if the threshold for converting to a range was only 4 IDs, and you
had the list 1,2,3,4: when you add ID 5 to the list and get the range
1-5, there is no performance penalty. But if you had the list 1,6,10,15
and add the ID 20 to the list and get the range 1-20, then performance
suffers because the search will be examining 20 entries instead of just
5. It's still better than examining every entry in the DB, the way
back-ldbm does, but people running larger DBs probably still should
increase these limits.
--
-- Howard Chu
Chief Architect, Symas Corp. http://www.symas.com
Director, Highland Sun http://highlandsun.com/hyc
OpenLDAP Core Team http://www.openldap.org/project/