Re: Avoid 503 during failover to backup?

From: Willy Tarreau <w#1wt.eu>
Date: Wed, 3 Dec 2008 06:46:59 +0100


Hello Jim,

On Tue, Dec 02, 2008 at 11:28:33PM +0100, Jim Jones wrote:
> On Tue, 2008-12-02 at 20:12 +0100, Krzysztof Oledzki wrote:
> > On Tue, 2 Dec 2008, Jim Jones wrote:
> > >
> > > listen foo 0.0.0.0:8080
> > > balance leastconn
> > > cookie C insert indirect nocache
> > > option persist
> > > option redispatch
> > > option allbackups
> > > option httpchk HEAD /test HTTP/1.0
> > > fullconn 200
> > > server www1 192.168.0.1 weight 1 minconn 3 maxconn 100 cookie A check inter 5000
> > > ??? server www2 192.168.0.2 weight 1 minconn 3 maxconn 100 cookie B check inter 5000
> > > ??? server www3 192.168.0.3 weight 1 minconn 3 maxconn 100 cookie C check inter 5000
> > > server bu1 192.168.0.10 weight 1 minconn 3 maxconn 100 check inter 20000 backup
> > > ???
> > > When we shut down all www servers (www1-www3) haproxy will shortly after
> > > route requests to the backup server - just as intended.
> > >
> > > Our problem is that *during* the failover some requests will get a 503
> > > response from haproxy: "No server available to serve your request".
> >
> > This is simply because haproxy needs some time to detect and mark all the
> > active servers (www1-www3) down and to activate the backup one (bu1).
> >
> > > More precisely: When we shut down all www servers and then make a
> > > request before the 5 second timeout has elapsed this request will
> > > receive the 503 response.
> >
> > It should take even longer (fall*inter = 3*5s=15s). However, you may use
> > "fastinter 1000" to make it much shorter.
>
> Thank you for the pointer to fastinter. We'll definately play
> with that value to speed up the process of going up/down.
>
> > > Is there a way to avoid this gap and make the failover
> > > completely transparent?
> >
> > Currently backup servers are only activated if there are no other active
> > servers, moreover redispatcher (option redispatch) does not redispatch to
> > an inactive backup server. I have a patch that mitigates this behavior but
> > as it was a quick&dirty solution I have never intended to make public,
> > but now I think I'll get on this, clean it and post it here. ;)
>
> Hmm. Well, it would be really nice if HAproxy would keep re-scheduling
> failed requests until either a global timeout (conntimeout?) is reached
> or the request was served. Displaying a 503 to the user should be the
> very last resort.

Right now, only one attempt is made on another server when the redispatch option is set. It is the last retry which is performed on another server.

> Right now it seems to go like this ("server1" could be
> synonym for a whole group of servers here):
>
> 1. Server1 goes down.
>
> 2. Request arrives, haproxy schedules it for server1 because
> it hasn't noticed yet that server1 is down.
>
> 3. Haproxy attempts to connect to server1 but times out
> and eventually displays a 503 to the user.

Except that the last retry goes to another server. Krzysztof has even provided a patch to avoid redispatching to the same server.

> 4. Further requests will fail the same way until haproxy finally
> notices that server1 is down and activates the backups.
>
>
> More desirable would be:
>
> ???1. Server1 goes down
>
> 2. Request arrives, haproxy schedules it for server1 because
> it hasn't noticed yet that server1 is down
>
> ???3. Haproxy attempts to connect to server1 but times out.
> It reschedules the request and tries again, picking a new server
> according to the configured balancing algorithm. It may even
> choose a backup now if, in the meantime, it noticed the failure
> of server1.

It must only do that after the retry counter has expired on the first server. In fact, we might change the behaviour to support multiple redispatches with a counter (just like retry) and set the retry counter to only 1 when we are redispatching. It's probably not that hard.

> 4. Step 3 would repeat until conntimeout is reached or the
> request is successfully served. Only when the timeout is hit
> does the user get a 503 from haproxy.
>
> If haproxy worked like that then 503's could be completely avoided by
> setting conntimeout to a value higher than the maximum time that it can
> take haproxy to detect failure of all non-backup servers. (unless the
> backups fail, too - but well, that *is* a case of 503 then)

You're thinking like this because you don't have any stickyness :-)

There are many people who don't like the redispatch option because it breaks their applications on temporary network issues. Sometimes, it's better to have the user get a 503 (disguised with an "errorfile"), wait a bit and click "reload", than to have the user completely lose his session, basket, etc... because a server has failed to respond for a few seconds.

But I agree that for stateless applications, redispatching multiple times would be nice. However, we would not maintain a list of all attempted servers. We would just re-sumbit to the LB algorithm, hoping to get another server.

BTW, one feature I've wanted for a long time was the ability to switch a server to fastinter as soon as it returns a few errors. Let's say 2-3 consecutive connect errors, timeouts, or invalid responses. I thought about using it on 500 too, but that would be cumbersome for people who load-balance outgoing proxies which would just reflect the errors from the servers they are joining.

In your situation, this would considerably help because the fast inter could be triggered very early, and it would even save a few more seconds.

Cheers,
Willy Received on 2008/12/03 06:46

This archive was generated by hypermail 2.2.0 : 2008/12/03 07:00 CET