Hypermail

From: Willy Tarreau <w#1wt.eu>
Date: Sun, 25 Jan 2009 23:58:58 +0100

Hi Joseph,

On Fri, Jan 23, 2009 at 07:21:08PM -0500, Joseph Hardeman wrote:
> Hi Guys,
>
> Here is a question I am hoping someone has either seen before or has a
> suggestion for me.
>
> For the first time since we put haproxy in months ago, the primary
> haproxy we have did not respond in 10 seconds for the check on port
> 60000, which we have set as our health check port:
>
> listen health_check 0.0.0.0:60000
> mode health
>
> When nagios checks port 60000 it looks for the OK in the response. Two
> days ago, it did not get the OK in the max 10 second timeout.

Do you have an idea if the connection did at least establish ? I suspect it hanged waiting for haproxy to accept it. Either the system's backlog was full and a few SYNs were dropped, or the process's maxconn was reached and no listener was accepting any more connection.

In fact, listeners in "mode health" are not scheduled at all and reply immediately after the accept. That's why I suspect one of the issues above.

> I am
> running haproxy on a Dell R200, Dual Core 2.4GHz, with 2G of memory. I
> have gone over the system logs and have not been able to find anything
> wrong. I do have a script that is called via SNMP that calculates the
> number of uniq IP's hitting the external IP on port 80 and at the time
> it failed over there was only 2 IP's hitting it. Because this is for a
> client who can not have any down time I allow a single time out,
> checking the status every minute, before failing haproxy over to the
> backup system. On the next check haproxy responded ok, but it had
> already fallen over and no traffic was hitting it.

It is very dangerous to take such a decision on only one fault. If you need it to failover very fast, you should check it twice as fast and at least accept one failure. There are multiple reasons for such a failure to occur. The system might have been doing backups, swapping, or the network interface's transceiver might have been renegociating due to a transient error, etc... It is also possible that the nagios probe itself was having difficulties (swap, network, CPU, ...) and was victim of its own load.

> Has anyone else seen this happen where haproxy did not respond back or
> it has taken longer than 10 seconds to respond? I would think it might
> have been internet traffic, but the checks are from another system that
> is on the same network over Gig ports.

You could check the switch's error counters on each port, and the servers' counters as well. At least you seem to have a high quality network if you caught such an anomaly only once in several months.

Regards,
Willy Received on 2009/01/25 23:58

Re: Check on Port 60000 not responding in time