Re: Avoid 503 during failover to backup?

From: Jim Jones <jjimjjones#googlemail.com>
Date: Wed, 03 Dec 2008 20:58:08 +0100


On Wed, 2008-12-03 at 19:40 +0100, Willy Tarreau wrote:
> On Wed, Dec 03, 2008 at 06:36:55PM +0100, Jim Jones wrote:
> > Imho this is a problem regardless of the stickyness question.
> >
> > Even the people who use stickyness would likely prefer a seamless
> > failover to backup when all primary servers fail, instead of the
> > "503 gap" that they have now.
>
> No, as I explained, the risk (and cost) of losing all users session for
> a temporary network outage makes this often unacceptable. For instance,
> you discover that your haproxy server has its network interface renegociate
> the link with the switch, and you get an average of 3-5 seconds of outage.
> If you redispatch everyone in this case, you loose all users's baskets on
> an e-business site. Sometimes it's really better to return a clean page
> "Sorry for the inconvenience, please try again in a few minutes" than to
> break all users' sessions.

Yes, sometimes it is. But IMHO most of the time it would be preferable to have this clean page come from your backup servers instead of haproxy itself. Who dictates that the redispatch to backup has to be permanent? As far as I am concerned the session cookie could be kept intact, so users can just return to the primary servers once those are resurrected...

If I understand the haproxy config correctly then the primary servers and backups could simply use different cookie names. Thus we can have stickyness on both the primaries and the backups, independently of each others, without a problem.

> Don't get me wrong, I'm not for returning bare 503 errors to the clients,
> otherwise I would not have implemented errorloc/errorfile. I'm just for
> not taking dangerous decisions for them when they would rather click a
> "retry" button.

Agree'd. But a backup webserver is so much more flexible for styling and presenting that button :-)

> Setting too large a retry count prevents this because the user might
> wait very long before getting a response, and will consider the site
> as down.

That's an interesting point. Imho an error page, even (and maybe especially!) if it shows instantly, is more likely to drive a user away than a long delay.

My observation of my own behaviour in such situations is this:

When I'm browsing a site and it suddenly starts to load very slowly then I'll often just switch tab and do something else until it (hopefully) comes back.

When a site flashes me with an error-page in the middle of doing something then the situation is different: I'll much more likely give up quickly, after a few reload attempts.

I'm not sure why this is the case but I guess it is because an error-page has this "final" taste to it and forces a decision upon the user immediately ("Ok it's broken, do I reload or go away?") whereas a long delay, just as the name suggests, delays this decision for a bit.

Decisions translate to "work" and users will always choose the route of least resistance (== least work).

> Redispatching commonly breaks existing sessions. That's the
> reason why there are people who do not want to switch to another server.
>
> IMHO, a lot of features are still lacking in the backup area. We don't have
> spare servers, I mean servers which are there to compensate for the capacity
> loss caused by a few failures. We don't have failover servers either : I'd
> like to be able to associate servers by pairs (even cascades ?). If one server
> fails, retry on its failover. It could be used in environments when there are
> shared local storage. Etc... I'm clearly for trying to build the solutions
> which really address the real issues, not for hiding them.

I agree somewhat but at the same time I think that you're maybe overcomplicating the issue. Yes, there are corner cases to watch out for but imho we can agree on the general requirements quite easily? The specific requirement I'm after still is: No 503's from haproxy ever (unless all servers and all backups are down or time out).

> In your case, I think that since it's a stateless application, a good solution
> would consist in always retrying on a different server instead of retrying on
> the same. It's indeed wasted to retry at the same place when other candidates
> might be available with no added cost. We could simply add something like
> "option stateless-retry" or "option always-redispatch" to ensure that each
> retry would be performed on a distinct server.

Yes, that sounds like it might be a way.

> In fact, that's not perfectly true. Some hash-based LB algorithms would
> still retry on the same server. But we could very well mix a retry counter
> with the hash result in this case, in order to find another working server.

I don't know what's the best approach here. Maybe a per-request blacklist could also work (keep a list of servers that were already tried and found dead)?

> I think that this problem needs more thinking than code. The resulting patch
> will probably be a 10-liner, only once we agree on what is needed.

Agree'd.

best regards
-jj :-) Received on 2008/12/03 20:58

This archive was generated by hypermail 2.2.0 : 2008/12/03 21:01 CET