Re: Avoid 503 during failover to backup?

From: Willy Tarreau <w#1wt.eu>
Date: Wed, 3 Dec 2008 19:40:01 +0100


On Wed, Dec 03, 2008 at 06:36:55PM +0100, Jim Jones wrote:
> Imho this is a problem regardless of the stickyness question.
>
> Even the people who use stickyness would likely prefer a seamless
> failover to backup when all primary servers fail, instead of the
> "503 gap" that they have now.

No, as I explained, the risk (and cost) of losing all users session for a temporary network outage makes this often unacceptable. For instance, you discover that your haproxy server has its network interface renegociate the link with the switch, and you get an average of 3-5 seconds of outage. If you redispatch everyone in this case, you loose all users's baskets on an e-business site. Sometimes it's really better to return a clean page "Sorry for the inconvenience, please try again in a few minutes" than to break all users' sessions.

Don't get me wrong, I'm not for returning bare 503 errors to the clients, otherwise I would not have implemented errorloc/errorfile. I'm just for not taking dangerous decisions for them when they would rather click a "retry" button.

Setting too large a retry count prevents this because the user might wait very long before getting a response, and will consider the site as down. Redispatching commonly breaks existing sessions. That's the reason why there are people who do not want to switch to another server.

IMHO, a lot of features are still lacking in the backup area. We don't have spare servers, I mean servers which are there to compensate for the capacity loss caused by a few failures. We don't have failover servers either : I'd like to be able to associate servers by pairs (even cascades ?). If one server fails, retry on its failover. It could be used in environments when there are shared local storage. Etc... I'm clearly for trying to build the solutions which really address the real issues, not for hiding them.

In your case, I think that since it's a stateless application, a good solution would consist in always retrying on a different server instead of retrying on the same. It's indeed wasted to retry at the same place when other candidates might be available with no added cost. We could simply add something like "option stateless-retry" or "option always-redispatch" to ensure that each retry would be performed on a distinct server.

In fact, that's not perfectly true. Some hash-based LB algorithms would still retry on the same server. But we could very well mix a retry counter with the hash result in this case, in order to find another working server.

I think that this problem needs more thinking than code. The resulting patch will probably be a 10-liner, only once we agree on what is needed.

Regards,
Willy Received on 2008/12/03 19:40

This archive was generated by hypermail 2.2.0 : 2008/12/03 19:46 CET