Re: Avoid 503 during failover to backup?

From: Jim Jones <jjimjjones#googlemail.com>
Date: Tue, 02 Dec 2008 23:28:33 +0100


On Tue, 2008-12-02 at 20:12 +0100, Krzysztof Oledzki wrote:
> On Tue, 2 Dec 2008, Jim Jones wrote:
> >
> > listen foo 0.0.0.0:8080
> > balance leastconn
> > cookie C insert indirect nocache
> > option persist
> > option redispatch
> > option allbackups
> > option httpchk HEAD /test HTTP/1.0
> > fullconn 200
> > server www1 192.168.0.1 weight 1 minconn 3 maxconn 100 cookie A check inter 5000
> >  server www2 192.168.0.2 weight 1 minconn 3 maxconn 100 cookie B check inter 5000
> >  server www3 192.168.0.3 weight 1 minconn 3 maxconn 100 cookie C check inter 5000
> > server bu1 192.168.0.10 weight 1 minconn 3 maxconn 100 check inter 20000 backup
> > 
> > When we shut down all www servers (www1-www3) haproxy will shortly after
> > route requests to the backup server - just as intended.
> >
> > Our problem is that *during* the failover some requests will get a 503
> > response from haproxy: "No server available to serve your request".
>
> This is simply because haproxy needs some time to detect and mark all the
> active servers (www1-www3) down and to activate the backup one (bu1).
>
> > More precisely: When we shut down all www servers and then make a
> > request before the 5 second timeout has elapsed this request will
> > receive the 503 response.
>
> It should take even longer (fall*inter = 3*5s=15s). However, you may use
> "fastinter 1000" to make it much shorter.

Thank you for the pointer to fastinter. We'll definately play with that value to speed up the process of going up/down.

> > Is there a way to avoid this gap and make the failover
> > completely transparent?
>
> Currently backup servers are only activated if there are no other active
> servers, moreover redispatcher (option redispatch) does not redispatch to
> an inactive backup server. I have a patch that mitigates this behavior but
> as it was a quick&dirty solution I have never intended to make public,
> but now I think I'll get on this, clean it and post it here. ;)

Hmm. Well, it would be really nice if HAproxy would keep re-scheduling failed requests until either a global timeout (conntimeout?) is reached or the request was served. Displaying a 503 to the user should be the very last resort.

Right now it seems to go like this ("server1" could be synonym for a whole group of servers here):

  1. Server1 goes down.
  2. Request arrives, haproxy schedules it for server1 because it hasn't noticed yet that server1 is down.
  3. Haproxy attempts to connect to server1 but times out and eventually displays a 503 to the user.
  4. Further requests will fail the same way until haproxy finally notices that server1 is down and activates the backups.

More desirable would be:

1. Server1 goes down

2. Request arrives, haproxy schedules it for server1 because

   it hasn't noticed yet that server1 is down

3. Haproxy attempts to connect to server1 but times out.

   It reschedules the request and tries again, picking a new server    according to the configured balancing algorithm. It may even    choose a backup now if, in the meantime, it noticed the failure    of server1.

4. Step 3 would repeat until conntimeout is reached or the

   request is successfully served. Only when the timeout is hit    does the user get a 503 from haproxy.

If haproxy worked like that then 503's could be completely avoided by setting conntimeout to a value higher than the maximum time that it can take haproxy to detect failure of all non-backup servers. (unless the backups fail, too - but well, that *is* a case of 503 then)

cheers
- jj Received on 2008/12/02 23:28

This archive was generated by hypermail 2.2.0 : 2008/12/02 23:30 CET