Hypermail

From: Willy Tarreau <w#1wt.eu>
Date: Sat, 29 Nov 2008 21:32:07 +0100

Hello Kai,

On Fri, Nov 28, 2008 at 11:55:51AM +0000, Kai Krueger wrote:
> Hello list,
>
> we are trying to set up haproxy as a load balancer between several
> webservers running long standing database queries (on the order of
> minutes). Altogether it is working quite nicely, however there are two
> somewhat related issues causing many requests to fail with a 503 error.
> As the backend webservers have some load that is outside the control of
> haproxy, during some periods of time, requests fail immediately with a
> 503 over load error from the backends. In order to circumvent this, we
> use the httpchk option to monitor the servers and mark them as down when
> they return 503 errors. However, there seem to be cases where requests
> get routed to backends despite them being correctly recognized as down
> (according to haproxy stats page) causing these requests to fail too.
> Worse still is that due to the use of least connection scheduling the
> entire queue gets immediately drained to this server once this happens,
> causing all of requests in the queue to fail with 503.
> I haven't identified under what circumstances this exactly happens, as
> most of the time it works correctly. One guess would be that the issue
> may be something to do with that there are still several connections
> open to the server when it gets marked down which happily continue to
> run to completion.

From your description, it looks like this is what is happening. However, this must be independant on the LB algo. What might be happening though, is that the last server to be identified as failing gets all the requests.

> Is there a way to prevent this from happening?

Not really since the connections are already established, it's too late. You can shorten your srvtimeout though. It will abort the requests earlier. Yours are really huge (20mn for HTTP is really huge). But anyway, once connected to the server, the request is immediately sent, and even if you close, you only close one-way so that server will still process the request.

> The second question is regarding requeuing. As the load checks fluctuate
> quite rapidly periodically querying the backends to see if they are
> overloaded seems somewhat too slow, leaving a window open between when
> the backends reject requests and until haproxy notices this and takes
> down that server. More ideally would be for haproxy to recognize the
> error and automatically requeue the request to a different backend.

In fact, what I wanted to add in the past, was the ability to either modulate weights depending on the error ratios, or speed-up health checks when an error has been detected in a response, so that a failing server can be identified faster. Also, is it on purpose that your "inter" parameter is set to 10 seconds ? You need at least 20 seconds in your config to detect a faulty server. Depending on the load or your site, this might impact a huge number of requests. Isn't it possible to set shorter intervals ? I commonly use 1s, and sometimes even 100ms on some servers (with a higher fall parameter). Of course it depends on the work performed on the server for each health-check.

> At
> the moment haproxy seems to pass through all errors directly to the
> client. Is there a way to configure haproxy to requeue on errors?

clearly, no. It can as long as the connection has not been established to the server. Once established, the request begins to flow towards the server, so it's too late.

> I
> think I have read somewhere that haproxy doesn't requeue because it does
> not know if it is safe, however these databases are completely read only
> and thus we know that it is safe to requeue, as the requests have no
> side effects.

There are two problems to replay a request. The first one is that the request is not in the buffer anymore once the connection is established to the server. We could imagine mechanisms to block it under some circumstances. The second problem is that as you say, haproxy cannot know which requests are safe to replay. HTTP defines idempotent methods such as GET, PUT, DELETE, ... which are normally safe to replay. Obviously now GET is not safe anymore, and with cookies it's even more complicated because a backend server can identify a session for a request and make that session's state progress for each request. Also imagine what can happen if the user presses Stop, clicks a different link, and by this time, haproxy finally gives up on former request and replays it on another server. You can seriously affect the usability of a site or even its security (you don't want a login page to be replayed past the logout link for instance).

So there are a lot of complex cases where replaying is very dangerous.

I'd be more tempted by adding an option to not return anything upon certain types of errors, so that the browser can decide itself whether to replay or not. Oh BTW you can already do that using "errorfile". Simply return an empty file for 502, 503 and 504, and your clients will decide whether they should replay or not when a server fails to respond.

> P.S. In case it helps, I have attached the configuration we use with
> haproxy (version 1.3.15.5, running on freeBSD)

Thanks, that's a good reflex, people often forget that important part ;-)

Regards,
Willy Received on 2008/11/29 21:32

Re: Sending requests to servers that are marked as down?