Hi Aleks,
On Thu, Dec 27, 2007 at 02:24:43AM +0100, Aleksandar Lazic wrote:
> Hi all ;-),
>
> today I have finished the first attempt to see which hash function
> distribute well in a hashtable ;-)
OK that's interesting. I've modified your code to report statistics instead of the entries values. The number of buckets used, the min and max entries per bucket, the avg entries/bucket and the standard deviation on the number of entries per bucket are reported. BTW, I had to remove 3 functions from the test because they did never complete after several minutes :
I suspect that there are some collisions which upset the chtbl_insert function.
Here's what it reports for an old squid access_log :
willy#pcw:hash_div$ ./test_hash_dis
Starting wt_hash
time=4005 ms
Table size is 184867
tot=184867, used=2048, min=60, max=127, avg=90.267090, sd=10.006124
Starting SuperFastHash2
time=4057 ms
Table size is 184867
tot=184867, used=2048, min=58, max=126, avg=90.267090, sd=9.504802
Starting SuperFastHash
time=3976 ms
Table size is 184867
tot=184867, used=2048, min=58, max=126, avg=90.267090, sd=9.504802
Starting hashpjw
time=818 ms
Table size is 184867
tot=184867, used=1884, min=0, max=930, avg=90.267090, sd=156.832357
Starting hash_djbx33
time=3991 ms
Table size is 184867
tot=184867, used=2048, min=54, max=129, avg=90.267090, sd=9.903117
Starting bernstein
time=4125 ms
Table size is 184867
tot=184867, used=2048, min=55, max=126, avg=90.267090, sd=9.861861
Starting fnv_32a_str
time=4021 ms
Table size is 184867
tot=184867, used=2048, min=60, max=126, avg=90.267090, sd=9.290003
Starting hashword
time=4158 ms
Table size is 184867
tot=184867, used=2048, min=58, max=123, avg=90.267090, sd=9.469444
Starting sax_hash
time=4024 ms
Table size is 184867
tot=184867, used=2048, min=59, max=120, avg=90.267090, sd=9.653340
Starting fnv_hash
time=4226 ms
Table size is 184867
tot=184867, used=2048, min=57, max=124, avg=90.267090, sd=9.262789
Starting oat_hash
time=4049 ms
Table size is 184867
tot=184867, used=2048, min=59, max=125, avg=90.267090, sd=9.533937
We also notice that the time is almost always the same (4 seconds), except for the only bad function (hash_pjw) which has not used all of its buckets. I really suspect that chtbl_insert() is suboptimal and that all the time is lost in it instead of in the hash functions themselves, because on the same machine, the hash functions all run between 1 and 10 millions hashes per second, while here we're more around 50k hashes/s.
FNV shows a slightly better distribution and deviation than others, and
according to my memory, it was about as fast as wt_hash on x86 but slower
on some achitectures where gcc could not optimize well the multiply. However,
I think it was easily attackable due to the fact that it relies on a simple
multiply followed by a xor ; basically if you pass to it values in this form,
I think it might always return zero in the last byte :
xxxxxxx1
xxxxxx10
xxxxx100
xxxx1000
xxx10000
xx100000
x1000000
10000000
00000000
I attach the patch to report the stats.
Best regards,
Willy
/* include the test hash files */
@@ -565,6 +566,43 @@
return;
} /* end print_table () */
+
+static void print_stats(const CHTbl *htbl) {
+ int i, j, used, tot;
+ double avg, nb1, sd, totsq;
+ unsigned int min, max;
+
+ min = ~0UL;
+ tot = max = used = 0;
+ totsq = 0;
+
+ fprintf(stdout, " Table size is %d\n", chtbl_size(htbl));
+ for (i = 0; i < TBLSIZ; i++) {
+ j = list_size(&htbl->table[i]);
+ if (j)
+ used++;
+ if (j < min)
+ min = j;
+ if (j > max)
+ max = j;
+ tot += j;
+ }
+
+ avg = (double)tot / (double)TBLSIZ;
+ for (i = 0; i < TBLSIZ; i++) {
+ j = list_size(&htbl->table[i]);
+ nb1 = (double)j - avg;
+ totsq += nb1 * nb1;
+ }
+
+ sd = sqrt(totsq/(double)TBLSIZ);
+
+ fprintf(stdout,
+ " tot=%d, used=%d, min=%d, max=%d, avg=%f, sd=%f\n",
+ tot, used, min, max, avg, sd);
+ return;
+} /* end print_table () */
+
#define run_test(fct, args) { \
@@ -587,6 +625,7 @@
}
#define run_test2(fct,counter, url) { \
+ struct timeval tv1, tv2; \
unsigned long count; \
CHTbl htbl; \
fprintf(stdout, "Starting %s\n", #fct); \
@@ -596,6 +635,7 @@
fflush(stdout); \
return 1; \
} \
+ gettimeofday(&tv1, NULL); \
for (count = 0; count < counter; count++) { \
if (chtbl_insert(&htbl, url[count],strlen(url[count])) != 0){ \
fprintf(stdout, "error @chtbl_insert :-(\n"); \
@@ -603,7 +643,15 @@
return 1; \
} \
} \
- print_table(&htbl); \
+ gettimeofday(&tv2, NULL); \
+ tv2.tv_sec -= tv1.tv_sec; \
+ tv2.tv_usec -= tv1.tv_usec; \
+ if (tv2.tv_usec < 0) { \
+ tv2.tv_usec += 1000000; \
+ tv2.tv_sec -= 1; \
+ } \
+ printf(" time=%lu ms\n", (tv2.tv_sec*1000 + tv2.tv_usec/1000)); \
+ print_stats(&htbl); \
}
int main(){
@@ -614,7 +662,7 @@
i = 0;
This archive was generated by hypermail 2.2.0 : 2007/12/27 08:15 CET