Re: Performance problems with 1.3.20

From: Willy Tarreau <w#1wt.eu>
Date: Thu, 13 Aug 2009 23:00:41 +0200


Hi,

On Wed, Aug 12, 2009 at 12:50:00PM -0700, James Hartshorn wrote:
> Hi,
>
> We run Haproxy on Amazon ec2 for http load balancing.  On Monday
> (august 11) we upgraded seven of our load balancers in two of our
> products to 1.3.20 from 1.3.15.8 (four servers, all of one product)
> and 1.3.18 (three servers, all of the other product).  We kept the
> config files the same.  We finished replacing the load balancers by
> 2300 UTC on aug 11, and at about 0900 UTC Aug 12 the first cluster
> (the one upgraded from 1.3.15.8) started showing performance issues,
> enough to cause our monitoring systems to go off.  Response times were
> several seconds.

please enable the stats page, it will show you a lot of useful info in such cases. Most specifically the session rate and concurrent sessions number.

>  Logging on to one of the load balancers I saw normal
> cpu and memory, but looking at netstat -anp I saw more than 30k lines
> there, the majority in TIME_WAIT state.

TIME_WAIT is completely normal. Assuming your system is running with default settings (60s timeout on finwait/timewait), 30k TIME_WAIT sessions means you're getting 500 connections per second.

>  For background, the load
> balancers each point to the same pool of about 60 servers, which at
> the time were doing about 20-30 sessions per server, and the servers
> reporting about 80 requests per second (nominally 60% of peak).

80 req/s cumulated or per server ? It seems extremely low for a cumulative count, but if it's per server, it means 4800 req/s cumulated which is approximately in the range of what we have observed on another site running at EC2, the limit certainly being caused by virtualization and/or shared CPU resources.

>  At
> this point we put the old load balancers back into production and
> found them to be still working fine.

that's what I find strange then :-/

>  At around 1200 UTC Aug 12 a
> nearly identical state occured on the other set of load balancers (the
> ones upgraded from 1.3.18).
>
> If anyone can see any issues please let me know.
>
> I have pasted a representative haproxy.cfg file below:

> global
> #log 127.0.0.1 local0 info
> #log 127.0.0.1 local1 notice
> #log loghost local0 info
> maxconn 75000
> chroot /var/lib/haproxy
> user haproxy
> group haproxy
> daemon
> #debug
> #quiet
>
> defaults
> #log global
> mode http
> #option httplog
> option dontlognull
>     option  redispatch
> retries 3
> maxconn 75000

your defaults and frontend maxconns should be slightly lower than the global one, so that one single frontend can never fill the whole process. BTW, 75000 seems a bit optimistic for a virtualized environment...

> contimeout 5000
> clitimeout 50000
> srvtimeout 2000

Is there a reason for 50s on the client and only 2s on the server ? I suppose that when your servers slow down, you're killing a lot of requests by sending 504 responses.

> frontend openx *:80
> #log global
> maxconn 75000
>        option forwardfor
>        default_backend openx_ec2_hosted_http
>
> backend openx_ec2_hosted_http
>        mode http
>        #balance roundrobin
>        balance leastconn
>        option abortonclose
>        option httpclose
>        #remove the line below if not 1.3.20
>        #option httpchk HEAD /health.chk

Why is there a special case for this line and 1.3.20 ? Are you sure you don't change it when you switch to another version ? If so, it may be the reason why your servers may be flapping.

>        timeout queue 500

Same here, 500ms for a queue seems very short (but looks consistent with the 2s for the server though).

>        #option forceclose

Just in case you'd have enabled it, avoid using forceclose, as you may reach a point where the system is refusing to allocate a source port for haproxy to connect to the server.

(...)

Other than the points above, I don't see anything really wrong. Please do enable the stats and save a report. Check the "Dwn" and "Chk" columns for your servers. You might notice they're flapping because they'd take too much time to respond to health checks.

Regards,
Willy Received on 2009/08/13 23:00

This archive was generated by hypermail 2.2.0 : 2009/08/13 23:15 CEST