Hypermail

From: Willy Tarreau <w#1wt.eu>
Date: Fri, 30 Jul 2010 17:42:23 +0200

Hi Matt,

On Wed, Jul 28, 2010 at 12:29:15PM -0600, Matt Banks wrote:
> OK, this is somewhat funny, but I'm mostly done with this email and a VERY similar sounding problem was just asked a few minutes ago...
>
> All,
>
> Long story short(ish):
>
> We put haproxy in front of a few servers that generate dynamic pages from a database. Here's a crude description of the setup:
>
> HAProxy -> 2 to 10 Apache servers -> Gateway (connection to db) -> Local caching database server ---(LAN or WAN)-> Database
>
> The point is that if the page is cached, the local caching db server will reply very fast. If not, it may take a few seconds to respond.

Those are precisely multiples of 3 seconds I guess ?

> We've also found that we basically HAVE to use keep alive (eg loading an image takes well under a second to load without HAProxy and perhaps .5 to 1.5 seconds with keepalive on whereas with keepalive off, the same image on the same page takes 12-18 seconds) if that makes a difference.

Yes, with keep-alive, you have one session, without you have many. Losing a SYN or a SYN/ACK when establishing a connection implies a 3 second retransmit delay. So with keep-alive disabled, each object comes in a separate session, causing more connection establishments, then amplifying the retransmission delay.

> Here's where things get a bit... tricky?
>
> We have httpcheck disabled. This is essentially because it's not working for us - at least how we'd like it to be. In a nutshell, we're getting a LOT of false positives where a server is listed as "up going down" or down when in reality, a non-cached page was simply taking a couple seconds (probably 3-5 but definitely less than 10) to load.

This is also typical of high packet loss rate.

> The point is, we get several 503 errors throughout the day. And they appear to be random. Apache never goes down nor reports an error. Frankly, I think what's happening is that haproxy is hitting a server which takes too long to respond, so it tries another server (which also doesn't have the page cached) and goes through the list until it gives up and reports a 503.

In my opinion, what it happening is that something is causing connections to fail between haproxy and the servers (since health checks fail too). There are two common causes for this :

a network card connected to a forced 100-full switch. Almost all gigabit cards will negociate 100-half if the switch does not advertise anything, causing a huge packet loss rate. You can easily check on your server using ethtool :

ethtool eth0

a mis-configured netfilter which remains enabled on the haproxy machine (the default settings of the conntrack table are too small to support a moderate load). You can see messages like "conntrack table is full" in "dmesg". Just in case, you should completely unload the nf_conntrack / ip_conntrack modules from the machine.

You could also try to run an FTP test from/to the haproxy machine. You should easily be able to saturate the port when transferring large files (approx 11800 kB/s on 100 Mbps, 118 MB/s on 1 Gbps). Any significantly lower value indicates a communication trouble. This will show you where the network runs well and where it runs poorly. Sometimes this is as simple as a broken NIC, wire or switch port (the later happened to me several time).

> Meanwhile, if you go directly to the page on the Apache server, it loads fine. Or if you re-load using HAProxy, it works fine as well.
>
> I'm just wondering where to start with this. We have several sites experiencing the same problem, but since we're using roughly the same setup for each one, I'm not opposed to saying it could be how we have HAProxy set up.

There is no particular reason your config could cause such things to happen and you could definitely not cause the checks to randomly fail. That's why I'm suggesting environment issues, which are a very recurring concern.

Regards,
Willy Received on 2010/07/30 17:42

Re: where to start with 503 errors