On Sun, Dec 30, 2007 at 02:18:40PM +0100, Aleksandar Lazic wrote:
> Hi Willy,
>
> On Son 30.12.2007 08:50, Willy Tarreau wrote:
> >Hi Aleks,
> >
> >On Fri, Dec 28, 2007 at 11:16:13AM +0100, Aleksandar Lazic wrote:
> >>>What is important is that the spreadth of the number of entries is
> >>>relatively low around the average. Except for those obviously bad
> >>>above, all are between avg/2 and avg*2.
> >>
> >>Hm, I'am not the expert in hash-function but could this be better?
> >
> >yes, the smallest the distance between min and max, the smoothest the
> >distribution. The most important of course is to have the lowest MAX,
> >but monitoring both values generally tell you what *may* happen with
> >other data sets.
>
> Ok I have know added 3 output values:
>
> I called it "Gaussian bell" ;-)
> gb_max=
> gb_min=
>
> and the diff between max/avg, avg/min
> diff=
>
> The new version is know avalilabe at http://none.at/test_hash_dis_v4.tgz
>
> From the datas below I have the following ranking;-)
>
> SuperFastHash2
> APHash
>
> For urls ;-)
You know, those are the type of exercices that I must not try on the week-end because I cannot stop. This morning I thought "just a quick test" ...
I've written a small program to graph the distribution of the number of entries. It outputs general statistics on stderr and (entries, freq) values on stdout.
It allowed me to update wt_hash several times and get a smoother one now (for my values, ...). Interestingly, it returns good shapes whatever the modulo value, even on powers of 2 because it mixes bits everywhere. I attach the program for your experimentations.
> Ok an optin will be nice like:
>
> use [no] query_string
>
> or so for haproxy ;-)
I would prefer to create a new algorithm name (unless we find a lot of other options which justify having options for algorithms).
> >This just indicates that there is a relation between all returned
> >values, and that some of them are returned more often than others. This
> >is most likely accentuated by the modulo which eliminates high bits.
>
> Well, that's sounds bad. The most 'test suits' for hashtables use the
> modulo to find the bucket, what is a better way from your point of
> view?!
either choose an algorithm which is insensible to the hash size (for instance, jenkins hash is known to be quite good), or use a prime integer. But using primes causes real divides while powers of two are cheaper.
> >For your convenience, I've tried to apply %2048 and %2039 to the logs
> >of the requests coming to my site (25k requests, all in the same dirs
> >because there aren't many things on my site). Here's the diff between
> >modulo 2048 and 2039. You will notice that except for sax_hash, djb33x,
> >bernstein and fnv_hash, the distribution has narrowed for others (sd
> >reduced). Also, note how the poor haproxy_server_hash and hashpjw are
> >reduced by this simple operation!
>
> Thanks I have also run the tests now with 2039 and got the follwoing
> resuslts, with some new hash functions ;-)
[...]
A pattern will begin to get out of those tests.
Cheers,
Willy
This archive was generated by hypermail 2.2.0 : 2007/12/30 17:00 CET