[Date Prev][Date Next]
[Chronological]
[Thread]
[Top]
substring filter/index algorithms
Does someone on this list know how to implement and tune substring
indexing, or know where to ask? It looks to me like OpenLDAP's
substring filtering algorithm, and partly the indexing algorithm, can be
significantly improved. I'd appreciate advice on how to do this - or
advice not to bother:-)
Currently it works like this (thanks to Howard for the explanation):
A subany index will store a hash of each 4-byte substring of an
indexed string, e.g. "abcdef" gets index-values for *abcd*, *bcde* and
*cdef*. A subinitial index gets initial substrings of 2-4 characters:
ab*, abc* and abcd*. Similar for subfinal. A search for
(cn=*123456789*) is treated as (&(cn=*1234*)(cn=*3456*)(cn=*5678*))
for indexing purposes, stepping 2 characters at a time.
The numbers 4 and 2 can now be configured.
I can see several things to improve with that, but I'd rather not
reinvent an outdated approach:
- Unless I've missed something, the final "9" in the filter above
is not used for indexing.
It would be better to include (cn=*6789*) (&...) filter above, in
addition to or instead of the final term.
- With a short filter like (cn=*abcde*), one might want to squeeze
all info one can get out of it, and use (&(cn=*abcd*)(cn=*bcde*)).
With a long filter like (cn=*Hallvard*Furuseth*), maybe
(&(cn=*Hall*)(cn=*vard*)(cn=*Furu*)(cn=*seth*)) is quite enough.
So perhaps instead of configuring that filtering splits substrings by
steps of e.g. 2 characters, maybe it would be more useful to configure
some sort of measure of how much information to try to extract from
the index.
- If one wants indices to work for 2- or 3-character strings, one cannot
retain 4-character hashes. So the above filter must either check
search for (&(cn=*123*)(cn=*345*)(cn=*567*)(cn=*789*)), which gives
less overlap and thus more false positives, or it must check
(&(cn=*123*)(cn=*234*)(cn=*345*)..........(cn=*789*)), which gives a
lot more index lookups.
So it would be nice to be able to configure that e.g. both 3- and
4-character substrings are indexed, though that does use more disk
space and more hash values.
- Some of the above would be nice to configure per index instead of
globally, and that 'how much information is wanted' bit might
depend on how much information to expect from the other filter
components, but I'm not going there at this time unless someone
tells me it's easy:-)
--
Hallvard