Re: hash functions distribution test

From: Aleksandar Lazic <al-haproxy#none.at>
Date: Sun, 30 Dec 2007 14:18:40 +0100


Hi Willy,

On Son 30.12.2007 08:50, Willy Tarreau wrote:
>Hi Aleks,
>
>On Fri, Dec 28, 2007 at 11:16:13AM +0100, Aleksandar Lazic wrote:
>> >What is important is that the spreadth of the number of entries is
>> >relatively low around the average. Except for those obviously bad
>> >above, all are between avg/2 and avg*2.
>>
>> Hm, I'am not the expert in hash-function but could this be better?
>
>yes, the smallest the distance between min and max, the smoothest the
>distribution. The most important of course is to have the lowest MAX,
>but monitoring both values generally tell you what *may* happen with
>other data sets.

Ok I have know added 3 output values:

I called it "Gaussian bell" ;-)
gb_max=
gb_min=

and the diff between max/avg, avg/min
diff=

The new version is know avalilabe at http://none.at/test_hash_dis_v4.tgz

 From the datas below I have the following ranking;-)

SuperFastHash2
APHash

For urls ;-)

>> For example that a hashfunction generate a more or less equal
>> min=max, could this be the 'perfect hashing' issue?
>
>yes, could be. But once again, this will change with other data :-)

Of course ;-)
That's the reason why anybody can download an test it ;-)

>> I think we should reduce the length of the used URL not to stop
>> before the query string, imho.
>
>Not necessarily, it depends on what you want to achieve. For some
>algorithms, it's better to use the full string, for others it's better
>to truncate.

Ok an optin will be nice like:

use [no] query_string

or so for haproxy ;-)

>> Do you mean the haproxy_uri_hash fuction with the 11 bits?!
>
>No I mean that most hash functions report 32-bit values, which means
>that the result is distributed between 0 and 2^32-1. But as the
>functions rely on bit manipulation, there is generally little relation
>between high and low bits, so if you take them separately, they will
>most likely not be much randomized. It's just the global set of bits
>which constitutes a good random value.
>
>If you only use 2048 values by applying a modulo on the value, you end
>up only selecting the lowest 11 bits of the result, completely ignoring
>the higher bits. There are various methods to mix all the bits before
>returning, but all of them consume a lot of CPU cycles.
>
>The alternative well-known solution consists in using a prime integer
>for the hash size. Instead of using 2048, you can either use the
>immediate lower prime integer (2039) or the next one (2053). You can
>set this in TABLESIZE, and carefully remove TABLEMASK and TABLESHIFT
>which are not used anywhere else.

Done, thanks for hint ;-)

>BTW, I've just checked and I can confirm that chtbl_insert() does a
>simple modulo :
>
> bucket = htbl->h(data) % htbl->buckets;

Ok.

>> Hm, ok what this mean, sorry I'am not so fit with statistiks :-(
>
>This just indicates that there is a relation between all returned
>values, and that some of them are returned more often than others. This
>is most likely accentuated by the modulo which eliminates high bits.

Well, that's sounds bad. The most 'test suits' for hashtables use the modulo to find the bucket, what is a better way from your point of view?!

>For your convenience, I've tried to apply %2048 and %2039 to the logs
>of the requests coming to my site (25k requests, all in the same dirs
>because there aren't many things on my site). Here's the diff between
>modulo 2048 and 2039. You will notice that except for sax_hash, djb33x,
>bernstein and fnv_hash, the distribution has narrowed for others (sd
>reduced). Also, note how the poor haproxy_server_hash and hashpjw are
>reduced by this simple operation!

Thanks I have also run the tests now with 2039 and got the follwoing resuslts, with some new hash functions ;-)

I have sorted it by time.

---
TABLESIZE is 2039
Starting haproxy_server_hash
   time=5620 ms
   Table size is 244359
   tot=244359, used=2039, min=83, max=166, avg=119.843, sd=10.993, gb_max=1.385, gb_min=1.444 diff=0.059

Starting JSHash
   time=5681 ms
   Table size is 244359
   tot=244359, used=2039, min=84, max=156, avg=119.843, sd=11.221, gb_max=1.302, gb_min=1.427 diff=0.125

Starting SuperFastHash2
   time=5694 ms
   Table size is 244359
   tot=244359, used=2039, min=86, max=158, avg=119.843, sd=10.475, gb_max=1.318, gb_min=1.394 diff=0.075

Starting APHash
   time=5705 ms
   Table size is 244359
   tot=244359, used=2039, min=85, max=156, avg=119.843, sd=10.741, gb_max=1.302, gb_min=1.410 diff=0.108

Starting SuperFastHash
   time=5707 ms
   Table size is 244359
   tot=244359, used=2039, min=86, max=158, avg=119.843, sd=10.475, gb_max=1.318, gb_min=1.394 diff=0.075

Starting kr_hash
   time=5716 ms
   Table size is 244359
   tot=244359, used=2039, min=82, max=158, avg=119.843, sd=11.298, gb_max=1.318, gb_min=1.461 diff=0.143

Starting hash_djbx33
   time=5725 ms
   Table size is 244359
   tot=244359, used=2039, min=76, max=156, avg=119.843, sd=10.753, gb_max=1.302, gb_min=1.577 diff=0.275

Starting bernstein
   time=5727 ms
   Table size is 244359
   tot=244359, used=2039, min=87, max=158, avg=119.843, sd=10.880, gb_max=1.318, gb_min=1.378 diff=0.059

Starting sax_hash
   time=5744 ms
   Table size is 244359
   tot=244359, used=2039, min=84, max=155, avg=119.843, sd=10.792, gb_max=1.293, gb_min=1.427 diff=0.133

Starting haproxy_uri_hash
   time=5755 ms
   Table size is 244359
   tot=244359, used=2039, min=84, max=159, avg=119.843, sd=10.810, gb_max=1.327, gb_min=1.427 diff=0.100

Starting RSHash
   time=5779 ms
   Table size is 244359
   tot=244359, used=2039, min=85, max=158, avg=119.843, sd=11.003, gb_max=1.318, gb_min=1.410 diff=0.092

Starting oat_hash
   time=5764 ms
   Table size is 244359
   tot=244359, used=2039, min=87, max=159, avg=119.843, sd=10.931, gb_max=1.327, gb_min=1.378 diff=0.051

Starting fnv_32a_str
   time=5784 ms
   Table size is 244359
   tot=244359, used=2039, min=85, max=163, avg=119.843, sd=10.884, gb_max=1.360, gb_min=1.410 diff=0.050

Starting PJWHash
   time=5789 ms
   Table size is 244359
   tot=244359, used=2039, min=86, max=159, avg=119.843, sd=11.074, gb_max=1.327, gb_min=1.394 diff=0.067

Starting ELFHash
   time=5789 ms
   Table size is 244359
   tot=244359, used=2039, min=87, max=164, avg=119.843, sd=10.673, gb_max=1.368, gb_min=1.378 diff=0.009

Starting wt_hash
   time=5796 ms
   Table size is 244359
   tot=244359, used=2039, min=82, max=162, avg=119.843, sd=11.267, gb_max=1.352, gb_min=1.461 diff=0.110

Starting hashword
   time=5815 ms
   Table size is 244359
   tot=244359, used=2039, min=85, max=153, avg=119.843, sd=10.797, gb_max=1.277, gb_min=1.410 diff=0.133

Starting fnv_hash
   time=5820 ms
   Table size is 244359
   tot=244359, used=2039, min=79, max=159, avg=119.843, sd=10.846, gb_max=1.327, gb_min=1.517 diff=0.190

Starting hashpjw
   time=5837 ms
   Table size is 244359
   tot=244359, used=2039, min=90, max=157, avg=119.843, sd=10.828, gb_max=1.310, gb_min=1.332 diff=0.022
---

Now sorted it by sd.

---
TABLESIZE is 2039
Starting SuperFastHash2
   time=5694 ms
   Table size is 244359
   tot=244359, used=2039, min=86, max=158, avg=119.843, sd=10.475, gb_max=1.318, gb_min=1.394 diff=0.075

Starting SuperFastHash
   time=5707 ms
   Table size is 244359
   tot=244359, used=2039, min=86, max=158, avg=119.843, sd=10.475, gb_max=1.318, gb_min=1.394 diff=0.075

Starting ELFHash
   time=5789 ms
   Table size is 244359
   tot=244359, used=2039, min=87, max=164, avg=119.843, sd=10.673, gb_max=1.368, gb_min=1.378 diff=0.009

Starting APHash
   time=5705 ms
   Table size is 244359
   tot=244359, used=2039, min=85, max=156, avg=119.843, sd=10.741, gb_max=1.302, gb_min=1.410 diff=0.108

Starting hash_djbx33
   time=5725 ms
   Table size is 244359
   tot=244359, used=2039, min=76, max=156, avg=119.843, sd=10.753, gb_max=1.302, gb_min=1.577 diff=0.275

Starting sax_hash
   time=5744 ms
   Table size is 244359
   tot=244359, used=2039, min=84, max=155, avg=119.843, sd=10.792, gb_max=1.293, gb_min=1.427 diff=0.133

Starting hashword
   time=5815 ms
   Table size is 244359
   tot=244359, used=2039, min=85, max=153, avg=119.843, sd=10.797, gb_max=1.277, gb_min=1.410 diff=0.133

Starting haproxy_uri_hash
   time=5755 ms
   Table size is 244359
   tot=244359, used=2039, min=84, max=159, avg=119.843, sd=10.810, gb_max=1.327, gb_min=1.427 diff=0.100

Starting hashpjw
   time=5837 ms
   Table size is 244359
   tot=244359, used=2039, min=90, max=157, avg=119.843, sd=10.828, gb_max=1.310, gb_min=1.332 diff=0.022

Starting fnv_hash
   time=5820 ms
   Table size is 244359
   tot=244359, used=2039, min=79, max=159, avg=119.843, sd=10.846, gb_max=1.327, gb_min=1.517 diff=0.190

Starting bernstein
   time=5727 ms
   Table size is 244359
   tot=244359, used=2039, min=87, max=158, avg=119.843, sd=10.880, gb_max=1.318, gb_min=1.378 diff=0.059

Starting fnv_32a_str
   time=5784 ms
   Table size is 244359
   tot=244359, used=2039, min=85, max=163, avg=119.843, sd=10.884, gb_max=1.360, gb_min=1.410 diff=0.050

Starting oat_hash
   time=5764 ms
   Table size is 244359
   tot=244359, used=2039, min=87, max=159, avg=119.843, sd=10.931, gb_max=1.327, gb_min=1.378 diff=0.051

Starting haproxy_server_hash
   time=5620 ms
   Table size is 244359
   tot=244359, used=2039, min=83, max=166, avg=119.843, sd=10.993, gb_max=1.385, gb_min=1.444 diff=0.059

Starting JSHash
   time=5681 ms
   Table size is 244359
   tot=244359, used=2039, min=84, max=156, avg=119.843, sd=11.221, gb_max=1.302, gb_min=1.427 diff=0.125

Starting RSHash
   time=5779 ms
   Table size is 244359
   tot=244359, used=2039, min=85, max=158, avg=119.843, sd=11.003, gb_max=1.318, gb_min=1.410 diff=0.092

Starting PJWHash
   time=5789 ms
   Table size is 244359
   tot=244359, used=2039, min=86, max=159, avg=119.843, sd=11.074, gb_max=1.327, gb_min=1.394 diff=0.067

Starting wt_hash
   time=5796 ms
   Table size is 244359
   tot=244359, used=2039, min=82, max=162, avg=119.843, sd=11.267, gb_max=1.352, gb_min=1.461 diff=0.110

Starting kr_hash
   time=5716 ms
   Table size is 244359
   tot=244359, used=2039, min=82, max=158, avg=119.843, sd=11.298, gb_max=1.318, gb_min=1.461 diff=0.143
---

Now sorted it by diff.

---
TABLESIZE is 2039
Starting ELFHash
   time=5789 ms
   Table size is 244359
   tot=244359, used=2039, min=87, max=164, avg=119.843, sd=10.673, gb_max=1.368, gb_min=1.378 diff=0.009

Starting hashpjw
   time=5837 ms
   Table size is 244359
   tot=244359, used=2039, min=90, max=157, avg=119.843, sd=10.828, gb_max=1.310, gb_min=1.332 diff=0.022

Starting fnv_32a_str
   time=5784 ms
   Table size is 244359
   tot=244359, used=2039, min=85, max=163, avg=119.843, sd=10.884, gb_max=1.360, gb_min=1.410 diff=0.050

Starting oat_hash
   time=5764 ms
   Table size is 244359
   tot=244359, used=2039, min=87, max=159, avg=119.843, sd=10.931, gb_max=1.327, gb_min=1.378 diff=0.051

Starting haproxy_server_hash
   time=5620 ms
   Table size is 244359
   tot=244359, used=2039, min=83, max=166, avg=119.843, sd=10.993, gb_max=1.385, gb_min=1.444 diff=0.059

Starting bernstein
   time=5727 ms
   Table size is 244359
   tot=244359, used=2039, min=87, max=158, avg=119.843, sd=10.880, gb_max=1.318, gb_min=1.378 diff=0.059

Starting PJWHash
   time=5789 ms
   Table size is 244359
   tot=244359, used=2039, min=86, max=159, avg=119.843, sd=11.074, gb_max=1.327, gb_min=1.394 diff=0.067

Starting SuperFastHash2
   time=5694 ms
   Table size is 244359
   tot=244359, used=2039, min=86, max=158, avg=119.843, sd=10.475, gb_max=1.318, gb_min=1.394 diff=0.075

Starting SuperFastHash
   time=5707 ms
   Table size is 244359
   tot=244359, used=2039, min=86, max=158, avg=119.843, sd=10.475, gb_max=1.318, gb_min=1.394 diff=0.075

Starting RSHash
   time=5779 ms
   Table size is 244359
   tot=244359, used=2039, min=85, max=158, avg=119.843, sd=11.003, gb_max=1.318, gb_min=1.410 diff=0.092

Starting haproxy_uri_hash
   time=5755 ms
   Table size is 244359
   tot=244359, used=2039, min=84, max=159, avg=119.843, sd=10.810, gb_max=1.327, gb_min=1.427 diff=0.100

Starting APHash
   time=5705 ms
   Table size is 244359
   tot=244359, used=2039, min=85, max=156, avg=119.843, sd=10.741, gb_max=1.302, gb_min=1.410 diff=0.108

Starting wt_hash
   time=5796 ms
   Table size is 244359
   tot=244359, used=2039, min=82, max=162, avg=119.843, sd=11.267, gb_max=1.352, gb_min=1.461 diff=0.110

Starting JSHash
   time=5681 ms
   Table size is 244359
   tot=244359, used=2039, min=84, max=156, avg=119.843, sd=11.221, gb_max=1.302, gb_min=1.427 diff=0.125

Starting sax_hash
   time=5744 ms
   Table size is 244359
   tot=244359, used=2039, min=84, max=155, avg=119.843, sd=10.792, gb_max=1.293, gb_min=1.427 diff=0.133

Starting hashword
   time=5815 ms
   Table size is 244359
   tot=244359, used=2039, min=85, max=153, avg=119.843, sd=10.797, gb_max=1.277, gb_min=1.410 diff=0.133

Starting kr_hash
   time=5716 ms
   Table size is 244359
   tot=244359, used=2039, min=82, max=158, avg=119.843, sd=11.298, gb_max=1.318, gb_min=1.461 diff=0.143

Starting fnv_hash
   time=5820 ms
   Table size is 244359
   tot=244359, used=2039, min=79, max=159, avg=119.843, sd=10.846, gb_max=1.327, gb_min=1.517 diff=0.190

Starting hash_djbx33
   time=5725 ms
   Table size is 244359
   tot=244359, used=2039, min=76, max=156, avg=119.843, sd=10.753, gb_max=1.302, gb_min=1.577 diff=0.275
---

Cheers

Aleks
Received on 2007/12/30 14:18

This archive was generated by hypermail 2.2.0 : 2007/12/30 14:30 CET