[Date Prev][Date Next] [Chronological] [Thread] [Top]

Re: large write amplification

To: Леонид Юрьев <leo@yuriev.ru>, "Shu, Xinxin" <xinxin.shu@intel.com>
Subject: Re: large write amplification
From: Howard Chu <hyc@symas.com>
Date: Tue, 5 May 2015 08:42:16 +0100
Cc: "openldap-technical@openldap.org" <openldap-technical@openldap.org>
In-reply-to: <CAO2+NUDgzUjeL2uuX=dvB7Fvm-VyqHbLurJEw1fgZ6OyHr-HHA@mail.gmail.com>
References: <75674D092A819E4189E91166C74CB90D01537600@shsmsx102.ccr.corp.intel.com> <CAO2+NUDgzUjeL2uuX=dvB7Fvm-VyqHbLurJEw1fgZ6OyHr-HHA@mail.gmail.com>
User-agent: Mozilla/5.0 (X11; Linux x86_64; rv:40.0) Gecko/20100101 Firefox/40.0 SeaMonkey/2.37a1

Леонид Юрьев wrote:

Hi, Xinxin.

I will try to answer briefly, without a details:

- To allow readers be never blocked by a writer, LMDB provides a
snapshot of data, indexes and directory for each completed
transaction.

- Most of a db-pages (which is not changed by a particular
transaction) are "shared" between such snapshots. But any changes of
data itself and reflection to btree-indexes (include a particular
table, free-db, main-db and so forth) require a new pages to be used
and written to the disk.

- In a large db a small "one-byte" change may make "dirty" a lot of
db-pages (usualy 4K each). For example, one add/del/mod operation in
LDAP-db with size of few GB,  requires about 50-100 page-level IOPS.

Correct, up to this last point. The degree of amplification is greatlyoverstated.


See http://symas/com/mdb/ondisk/

The number of pages touched depends on the height of the B+tree, whichis O(logN) of the number of records. Even a tree of multiple terabytesis unlikely to reach beyond a height of 5.

The minimum write amplification may be on the order of 8 pages for atrivial write. But it also tends to be the maximum write amplification too.


Leonid.

P.S.
For highload uses-cases I made a few changes in our fork of OpenLDAP/LMDB.
A one of these features we called "LIFO reclaiming".
It give us 10-50 times performance boost, especially by engaging
benefits of write-back cache of storage subsystem.
Nowadays we used it in our production (telco) environment.
But currently ones is not safe for all cases, see
https://github.com/ReOpen/ReOpenLDAP/issues/2 and
https://github.com/ReOpen/ReOpenLDAP/issues/1.

The LIFO approach inherently breaks the safety guarantees of the LMDBconcurrency design, as I have already explained.


--
  -- Howard Chu
  CTO, Symas Corp.           http://www.symas.com
  Director, Highland Sun     http://highlandsun.com/hyc/
  Chief Architect, OpenLDAP  http://www.openldap.org/project/

References:
- large write amplification
  - From: "Shu, Xinxin" <xinxin.shu@intel.com>
- Re: large write amplification
  - From: Леонид Юрьев <leo@yuriev.ru>

Prev by Date: RE: large write amplification
Next by Date: Re: Fast key range iteration
Index(es):
- Chronological
- Thread