Bench of haproxy

From: Vincent Bernat <>
Date: Wed, 04 May 2011 16:54:06 +0200


 I have tried to bench HAProxy 1.4.15 using two Spirent Avalanche. The  HAProxy box has the following features:

 driver: igb
 version: 2.1.0-k2
 firmware-version: 1.2-1

 Those are multiqueue cards and their interrupts are spread on the 8  processors on TX and RX.

 The setup is fairly simple. The HAProxy box is connected to some Nortel  5530 switch using active/active bond (balance-xor on Linux side, MLT on  Nortel side). Both Avalanche are also connected to this switch using 2  links. One of them act as a reflector (web server). Each link is mapped  to a set of clients (for the regular Avalanche) or act as a set of  servers (for the reflector).

 Offloading is enabled.

 rx-checksumming: on
 tx-checksumming: on
 scatter-gather: on
 tcp segmentation offload: on
 udp fragmentation offload: off
 generic segmentation offload: on

 MTU is set to 1500 (no jumbo frames)

 ╭─────────────────────────────────────────────────╮  │ 5530 switch ┌─┐ ┌─┐ ┌─┐ ┌─┐ ┌─┐ ┌─┐ ┌─┐ ┌─┐ │  │ └┬┘ └┬┘ └┬┘ └┬┘ └┬┘ └┬┘ └─┘ └─┘ │  ╰────────────────┼───┼────┼───┼───┼───┼───────────╯

                  │   │    │   │   │   └──────────────┐
                  │   │    │   │   └──────────────┐   │
 ╭────────────────┼───┼─╮╭─┼───┼────────────────╮ │   │
 │ Avalanche     ┌┴┐ ┌┴┐││┌┴┐ ┌┴┐ Avalanche     │ │   │
 │ (clients)     └─┘ └─┘││└─┘ └─┘ (reflector)   │ │   │
 ╰──────────────────────╯╰──────────────────────╯ │   │
                                                  │   │
                                       │ HAProxy ┌┴┐ ┌┴┐│
                                       │         └─┘ └─┘│

 The Avalanche simulates 256 clients on each port to attach 4 IP that  are configured in HAProxy. The Reflector simulates 4 web servers, 2 on  each port. Those servers serve 1KB pages. Here is my configuration of  haproxy :


           log   local0
           log   local1 notice
           user haproxy
           group haproxy
 	  nbproc 1
           stats socket /var/run/haproxy.socket
           log     global
           mode    http
           option  httplog
           option  dontlognull
 	  option  splice-auto
           retries 3
           option  redispatch
           contimeout      5s
           clitimeout      50s
           srvtimeout      50s
   listen poolbench
           mode    http
 	  option  splice-response
           stats   enable
           option  httpchk /
           option  dontlog-normal
           option  log-health-checks
           balance roundrobin
           server  real1
           server  real2
           server  real3
           server  real4

 HA-Proxy version 1.4.15 2011/04/08
 Copyright 2000-2010 Willy Tarreau <>

 Build options :

   TARGET  = linux26
   CPU     = generic
   CC      = gcc
   CFLAGS  = -O2 -g -fno-strict-aliasing
 USE_PCRE=1  Default settings :
   maxconn = 2000, bufsize = 16384, maxrewrite = 8192, maxpollevents =  200

 Encrypted password support via crypt(3): yes

 Available polling systems :

      sepoll : pref=400,  test result OK
       epoll : pref=300,  test result OK
        poll : pref=200,  test result OK
      select : pref=150,  test result OK
 Total: 4 (4 usable), will use sepoll.

 With this configuration, I get 10 000 HTTP req/s. The haproxy process  takes 100% CPU. Changing "maxconn" or disabling splice does not change  anything. If I use 6 haproxy process, I can get to 30 000 HTTP req/s.  All haproxy takes 100% CPU in this case. Moreover, I am pretty sure that  the Avalanche is not the bottleneck since we can bench more than 120 000  HTTP req/s with the same setup. I have tried to stick haproxy to 1 CPU  (with taskset) and I still get 10 000 HTTP req/s.

 Now, if I look at, I can see that I should  be able to achieve 40 000 HTTP req/s. This is four times what I am able  to achieve. What is wrong with my setup? Why enabling/disabling splice  does not affect my results? Is there a way to fetch the 2.6.27-wt5 used  for the tests?

 A side question now. Enabling the use of multiple processes would allow  to leverage the power of modern multi-core machines (we now get 6 cores  per CPU on recent servers). However, this is discouraged. One drawback  is the inability to get reliable stats. Is this problem worked on? We  could spawn some master process that exhibits the stat socket. This  master will grab stats from the other processes using the same protocol  as on the socket but using pipes. Stats will be aggregated and sent back  to the client.

 Thanks for any insight on the performance part. Received on 2011/05/04 16:54

This archive was generated by hypermail 2.2.0 : 2011/05/04 17:00 CEST