[Date Prev][Date Next] [Chronological] [Thread] [Top]

Re: normalised UTF-8, should it be "decomposed", or "composed"?

To: Howard Chu <hyc@highlandsun.com>
Subject: Re: normalised UTF-8, should it be "decomposed", or "composed"?
From: Stig Venaas <Stig@OpenLDAP.org>
Date: Wed, 20 Feb 2002 15:39:51 +0100
Cc: John Hughes <john@Calva.COM>, "'OpenLDAP DEVEL'" <openldap-devel@OpenLDAP.org>
Content-disposition: inline
In-reply-to: <NMEFLNHODBAOPDKNNJALEEPJCFAA.hyc@highlandsun.com>; from hyc@highlandsun.com on Wed, Feb 20, 2002 at 06:23:56AM -0800
References: <NMEFLNHODBAOPDKNNJALEEPICFAA.hyc@highlandsun.com> <NMEFLNHODBAOPDKNNJALEEPJCFAA.hyc@highlandsun.com>
User-agent: Mutt/1.2.5i

On Wed, Feb 20, 2002 at 06:23:56AM -0800, Howard Chu wrote:
> Thinking about this more, it might make sense to add this behavior onto
> the existing approxMatch stuff. Currently the approx code strips any
> 8 bit characters from the input strings. To make it slightly more general,
> we could first decompose the strings using compatibility mapping (NFKD).
> It looks like the liblunicode currently doesn't handle compatibility
> decompositions though.

Yes, I agree. I had some plans on this myself, but never got that far.
I don't have time to add NFKD now I think (need to check how much work
it would be), but what we easily can (and should do) right away, is to
simply skip the composition part in approximate match (leaving us with
NFD) and then strip 8-bit characters. I'll look into this very soon.
Before releasing 2.1 we should try to finish things that affect indexes
so that people don't need to recreate them later. Optimizations like
checking for normalized forms can easily be done between minor versions.

Stig

Follow-Ups:
- Re: normalised UTF-8, should it be "decomposed", or "composed"?
  - From: "Kurt D. Zeilenga" <Kurt@OpenLDAP.org>
- RE: normalised UTF-8, should it be "decomposed", or "composed"?
  - From: "Howard Chu" <hyc@highlandsun.com>

References:
- RE: normalised UTF-8, should it be "decomposed", or "composed"?
  - From: "Howard Chu" <hyc@highlandsun.com>
- RE: normalised UTF-8, should it be "decomposed", or "composed"?
  - From: "Howard Chu" <hyc@highlandsun.com>

Prev by Date: Re: normalised UTF-8, should it be "decomposed", or "composed"?
Next by Date: RE : normalised UTF-8, should it be "decomposed", or "composed"?
Index(es):
- Chronological
- Thread