[Date Prev][Date Next] [Chronological] [Thread] [Top]

Re: better malloc strategies



Happy New Year!

Howard Chu wrote:
I've tested again with libhoard 3.5.1, and it's actually superior to libumem for both speed and fragmentation. Here are some results against glibc 2.3.3, libhoard, and libumem and tcmalloc:

With the latest code in HEAD the difference in speed between Hoard, Umem, and TCmalloc pretty much disappears. Compared to the results from November, they're about 5-15% faster in the single-threaded case, and significantly faster in the multi-threaded case. Looking at glibc's performance it's pretty clear that our new Entry and Attribute slabs helped a fair amount, but weren't a complete cure. At this point I think the only cure for glibc malloc is not to use it...


November 22 figures:
        glibc            hoard            umem             tcmalloc
        size  time       size  time       size  time       size  time
start    732              740              734              744	
single  1368  01:06.04   1802  00:24.52   1781  00:38.46   1454  00:23.86
single  1369  01:47.15   1804  00:17.12   1808  00:18.93   1531  00:17.10
single  1384  01:48.22   1805  00:16.66   1825  00:24.56   1548  00:16.58
single  1385  01:48.65   1805  00:16.61   1825  00:16.17   1548  00:16.61
single  1384  01:48.87   1805  00:16.50   1825  00:16.37   1548  00:16.74
single  1384  01:48.63   1806  00:16.50   1825  00:16.22   1548  00:16.78
single  1385  02:31.95   1806  00:16.50   1825  00:16.30   1548  00:16.67
single  1384  02:43.56   1806  00:16.61   1825  00:16.20   1548  00:16.68
four    2015  02:00.42   1878  00:46.60   1883  00:28.70   1599  00:34.24
four    2055  01:17.54   1879  00:47.06   1883  00:39.45   1599  00:41.09
four    2053  01:21.53   1879  00:40.91   1883  00:37.90   1599  00:41.45
four    2045  01:20.48   1879  00:30.58   1883  00:39.59   1599  00:56.45
four    2064  01:26.11   1879  00:30.77   1890  00:47.71   1599  00:40.74
four    2071  01:29.01   1879  00:40.78   1890  00:44.53   1610  00:40.87
four    2053  01:30.59   1879  00:38.44   1890  00:39.31   1610  00:34.12
four    2056  01:28.11   1879  00:29.79   1890  00:39.53   1610  00:53.65
CPU1          15:23.00         02:20.00         02:43.00         02:21.00
CPU final     26:50.43         08:13.99         09:20.86         09:09.05

The start size is the size of the slapd process right after startup, with the process totally idle. The id2entry DB is 1.3GB with about 360,000 entries, BDB cache at 512M, entry cache at 70,000 entries, cachefree 7000. The subsequent statistics are the size of the slapd process after running a single search filtering on an unindexed attribute, basically spanning the entire DB. The entries range in size from a few K to a couple megabytes.

The process sizes are in megabytes. There are 380836 entries in the DB. The CPU1 line is the amount of CPU time the slapd process accumulated after the last single-threaded test. The multi-threaded test consists of starting the identical search 4 times in the background of a shell script, using "wait" to wait for all of them to complete, and timing the script. The CPU final line is the total amount of CPU time the slapd process used at the end of all tests.


Here are the numbers for HEAD as of today:
        glibc            hoard            umem             tcmalloc
        size  time       size  time       size  time       size  time
start    649              655              651              660	
single  1305  01:02.21   1746  00:32.73   1726  00:21.93   1364  00:19.97
single  1575  00:13.53   1786  00:12.57   1753  00:13.74   1396  00:12.92
single  1748  00:14.23   1797  00:13.34   1757  00:14.54   1455  00:13.84
single  1744  00:14.06   1797  00:13.45   1777  00:14.44   1473  00:13.92
single  1533  01:48.20   1798  00:13.45   1777  00:14.15   1473  00:13.90
single  1532  01:27.63   1797  00:13.44   1790  00:14.14   1473  00:13.89
single  1531  01:29.70   1798  00:13.42   1790  00:14.10   1473  00:13.87
single  1749  00:14.45   1798  00:13.41   1790  00:14.11   1473  00:13.87
four    2202  00:33.63   1863  00:23.11   1843  00:23.49   1551  00:23.37
four    2202  00:38.63   1880  00:23.23   1859  00:22.59   1551  00:23.71
four    2202  00:39.24   1880  00:23.34   1859  00:22.77   1564  00:23.57
four    2196  00:38.72   1880  00:23.23   1859  00:22.71   1564  00:23.65
four    2196  00:39.41   1881  00:23.40   1859  00:22.67   1564  00:23.96
four    2196  00:38.82   1880  00:23.13   1859  00:22.79   1564  00:23.41
four    2196  00:39.02   1881  00:23.18   1859  00:22.83   1564  00:23.27
four    2196  00:38.90   1880  00:23.12   1859  00:22.82   1564  00:23.48
CPU1          06:44.07         01:53.00         02:01.00         01:56.34
CPU final     12:56.51         05:48.56         05:52.21         05:47.77

Looking at the glibc numbers really makes you wonder what it's doing, running "OK" for a while, then chewing up CPU, then coming back. Seems to be a pretty expensive garbage collection pass. As before, the system was otherwise idle and there was no paging/swapping/disk activity during the tests. The overall improvements probably come mainly from the new cache-replacement code, and splitting/rearranging some cache locks. (Oprofile has turned out to be quite handy for identifying problem areas... Though it seems that playing with the cacheline alignment only netted a 0.5-1% effect, pretty forgettable.)

It's worth noting that we pay a pretty high overhead cost for using BDB locks here, but that cost seems to amortize out as the number of threads increases. Put another way, this dual-core machine is scaling as if it were a quad - it takes about a 4x increase in job load to see a 2x increase in execution time. E.g., 16 concurrent searches complete in about 41 seconds for Hoard. 32 complete in only 57 seconds. That's starting to approach the speed of the back-hdb entry cache - when all the entries are in the entry cache (as opposed to the BDB cache) a single search completes in about .7 seconds, which is hitting around the 500,000 entries/second mark.

--
  -- Howard Chu
  Chief Architect, Symas Corp.  http://www.symas.com
  Director, Highland Sun        http://highlandsun.com/hyc
  OpenLDAP Core Team            http://www.openldap.org/project/