Re: hash functions distribution test

From: Willy Tarreau <w#1wt.eu>
Date: Thu, 27 Dec 2007 07:53:45 +0100


Hi Aleks,

On Thu, Dec 27, 2007 at 02:24:43AM +0100, Aleksandar Lazic wrote:
> Hi all ;-),
>
> today I have finished the first attempt to see which hash function
> distribute well in a hashtable ;-)

OK that's interesting. I've modified your code to report statistics instead of the entries values. The number of buckets used, the min and max entries per bucket, the avg entries/bucket and the standard deviation on the number of entries per bucket are reported. BTW, I had to remove 3 functions from the test because they did never complete after several minutes :

I suspect that there are some collisions which upset the chtbl_insert function.

Here's what it reports for an old squid access_log :

willy#pcw:hash_div$ ./test_hash_dis
Starting wt_hash
  time=4005 ms
  Table size is 184867
  tot=184867, used=2048, min=60, max=127, avg=90.267090, sd=10.006124 Starting SuperFastHash2
  time=4057 ms
  Table size is 184867
  tot=184867, used=2048, min=58, max=126, avg=90.267090, sd=9.504802 Starting SuperFastHash
  time=3976 ms
  Table size is 184867
  tot=184867, used=2048, min=58, max=126, avg=90.267090, sd=9.504802 Starting hashpjw
  time=818 ms
  Table size is 184867
  tot=184867, used=1884, min=0, max=930, avg=90.267090, sd=156.832357 Starting hash_djbx33
  time=3991 ms
  Table size is 184867
  tot=184867, used=2048, min=54, max=129, avg=90.267090, sd=9.903117 Starting bernstein
  time=4125 ms
  Table size is 184867
  tot=184867, used=2048, min=55, max=126, avg=90.267090, sd=9.861861 Starting fnv_32a_str
  time=4021 ms
  Table size is 184867
  tot=184867, used=2048, min=60, max=126, avg=90.267090, sd=9.290003 Starting hashword
  time=4158 ms
  Table size is 184867
  tot=184867, used=2048, min=58, max=123, avg=90.267090, sd=9.469444 Starting sax_hash
  time=4024 ms
  Table size is 184867
  tot=184867, used=2048, min=59, max=120, avg=90.267090, sd=9.653340 Starting fnv_hash
  time=4226 ms
  Table size is 184867
  tot=184867, used=2048, min=57, max=124, avg=90.267090, sd=9.262789 Starting oat_hash
  time=4049 ms
  Table size is 184867
  tot=184867, used=2048, min=59, max=125, avg=90.267090, sd=9.533937

We also notice that the time is almost always the same (4 seconds), except for the only bad function (hash_pjw) which has not used all of its buckets. I really suspect that chtbl_insert() is suboptimal and that all the time is lost in it instead of in the hash functions themselves, because on the same machine, the hash functions all run between 1 and 10 millions hashes per second, while here we're more around 50k hashes/s.

FNV shows a slightly better distribution and deviation than others, and according to my memory, it was about as fast as wt_hash on x86 but slower on some achitectures where gcc could not optimize well the multiply. However, I think it was easily attackable due to the fact that it relies on a simple multiply followed by a xor ; basically if you pass to it values in this form, I think it might always return zero in the last byte :   xxxxxxx1
  xxxxxx10
  xxxxx100
  xxxx1000
  xxx10000
  xx100000
  x1000000
  10000000
  00000000

I attach the patch to report the stats.

Best regards,
Willy

 /* include the test hash files */
@@ -565,6 +566,43 @@

   return;
 } /* end print_table () */                          

+
+static void print_stats(const CHTbl *htbl) {
+  int        i, j, used, tot;
+  double avg, nb1, sd, totsq;
+  unsigned int min, max;
+
+  min = ~0UL;
+  tot = max = used = 0;
+  totsq = 0;
+
+  fprintf(stdout, "  Table size is %d\n", chtbl_size(htbl));
+  for (i = 0; i < TBLSIZ; i++) {
+    j = list_size(&htbl->table[i]);
+    if (j)
+      used++;
+    if (j < min)
+      min = j;
+    if (j > max)
+      max = j;
+    tot += j;
+  }
+  
+  avg = (double)tot / (double)TBLSIZ;
+  for (i = 0; i < TBLSIZ; i++) {
+    j = list_size(&htbl->table[i]);
+    nb1 = (double)j - avg;
+    totsq += nb1 * nb1;
+  }
+
+  sd = sqrt(totsq/(double)TBLSIZ);
+
+  fprintf(stdout,
+          "  tot=%d, used=%d, min=%d, max=%d, avg=%f, sd=%f\n",
+          tot, used, min, max, avg, sd);
+  return;
+} /* end print_table () */
+			

 /* run the tests */  

 #define run_test(fct, args) { \
@@ -587,6 +625,7 @@
 }  

 #define run_test2(fct,counter, url) {					\
+    struct timeval tv1, tv2;						\
     unsigned long count;						\
     CHTbl   htbl;                                                       \
     fprintf(stdout, "Starting %s\n", #fct);				\
@@ -596,6 +635,7 @@
       fflush(stdout);                                                   \
       return 1;								\
     }                                                                   \
+    gettimeofday(&tv1, NULL);						\
     for (count = 0; count < counter; count++) {				\
       if (chtbl_insert(&htbl, url[count],strlen(url[count])) != 0){ 	\
 	fprintf(stdout, "error @chtbl_insert :-(\n");			\
@@ -603,7 +643,15 @@
         return 1;                                                       \
 	}								\
     }									\
-    print_table(&htbl);                                                 \
+    gettimeofday(&tv2, NULL);						\
+    tv2.tv_sec -= tv1.tv_sec;						\
+    tv2.tv_usec -= tv1.tv_usec;						\
+    if (tv2.tv_usec < 0) {						\
+      tv2.tv_usec += 1000000;						\
+      tv2.tv_sec  -= 1;							\
+    }									\
+    printf("  time=%lu ms\n", (tv2.tv_sec*1000 + tv2.tv_usec/1000));	\
+    print_stats(&htbl);                                                 \
 }  

 int main(){
@@ -614,7 +662,7 @@  

   i = 0;  

Received on 2007/12/27 07:53

This archive was generated by hypermail 2.2.0 : 2007/12/27 08:15 CET