[Date Prev][Date Next] [Chronological] [Thread] [Top]

Re: LMDB and text encoding

To: Hallvard Breien Furuseth <h.b.furuseth@usit.uio.no>
Subject: Re: LMDB and text encoding
From: Timur Kristóf <timur.kristof@gmail.com>
Date: Thu, 29 Jan 2015 14:35:23 +0100
Cc: openldap-devel@openldap.org
Dkim-signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:in-reply-to:references:date:message-id:subject:from:to :cc:content-type; bh=KLJYUUY/JPmo2Z4S34oZpFY6rCP974Har7sU0g9nHVQ=; b=wjdLmSAIK4ZnoZPdKWbxEUbvnQ3kVEJwmJuT+oVcUzy3SoPW3R6wf/PNBgrN/3FPQd vYLmlOf55c8rEwOQfhoZc73jnmM87kXeyzC4pXVdTC6K38ODH5QQvo7nOo+rceNCL5Nx oA7Ep4sPlN0pepkuA8oNWhOcna8NK42Q3fhZztFSTHP1SVSG5PcQ3k6rSbu94R+ItPvu YLbNPNyUrLeLj/rTM0knWHo6No6p5iJGsL+g96TM99xYPz9cOWaWhqPlCYkcpQyDoXju TZueCt8sWKNNVLW9Vgbs+1Wz/by7yE+AjbZiLI8YaeY1XoKOpBZfOJWqO6pp4z6ltQqa kdvA==
In-reply-to: <54CA0928.2010807@usit.uio.no>
References: <CAFF-SiUrJKGvG_z5vKgn13KX6oSbWQmLDj0VqGXMsuzJT5JBEg@mail.gmail.com> <54CA083A.5070505@usit.uio.no> <54CA0928.2010807@usit.uio.no>

I've had a brief chat with Hallvard on IRC. We came up with several
possible solutions, although each of them has its drawbacks. Writing
cross-platform code that supports unicode is always a messy business.
I vote for option 4, but would like to hear everyone's opinions before
starting to work on any of them.

1) Separate widechar functions

Make functions such as mdb_env_open_w that would call the widechar
APIs. The drawback of this approach is that it would require a lot of
duplicate code, which is hard to maintain. It would also pollute the
lmdb header file.

2) New flag

Introduce a new flag (such as MDB_USE_WCHAR) that would tell
mdb_dbi_open to cast the path parameter to wchar_t* under the hood and
call the widechar variant of the windows api.

Advantage: only the string concatenation code would need to be duplicated
Drawback: it is really-really ugly

3) Require UTF-16 on Windows

Since Microsoft discourages the use of their ANSI apis, we could say
that we require UTF-16 on windows. We can make a type such as
mdb_uchar_t that we would typedef to char on unix and wchar_t on
windows and then we could change the function signatures to use this
type.

Drawback: users that want to write cross-platform code would need to
ifdef their calls to mdb_env_open

4) Require UTF-8 on Windows

Let's say we require the path parameter to be encoded in UTF-8, even
on windows. Then under the hood we can convert it to UTF-16 and call
the widechar APIs. This doesn't lead to loss of performance because
windows itself converts to UTF-16 anyway if you use their ANSI
functions.
This is the least ugly and perhaps the easiest-to-implement solution
we found. It is easy to make UTF-8 (most libraries can produce it, or
the user could use u8"..." from C++11, etc.)

Advantage: this is the easiest to implement; code that worked before
(with ASCII paths) will work without modification, and we don't need
to duplicate any code.

References:
- LMDB and text encoding
  - From: Timur Kristóf <timur.kristof@gmail.com>
- Re: LMDB and text encoding
  - From: Hallvard Breien Furuseth <h.b.furuseth@usit.uio.no>
- Re: LMDB and text encoding
  - From: Hallvard Breien Furuseth <h.b.furuseth@usit.uio.no>

Prev by Date: Re: wasteful data structures: AVL tree
Next by Date: Re: LMDB and text encoding
Index(es):
- Chronological
- Thread