Hypermail

From: Willy Tarreau <w#1wt.eu>
Date: Thu, 9 Jun 2011 22:36:14 +0200

Hi Alexey,

On Thu, Jun 09, 2011 at 01:32:06PM +0400, Alexey Vlasov wrote:
> Hi!
>
> Here, actually, I've found the description of the same problem. At
> Apache falling/restart, haproxy returns to users 502 error.
> > http://www.formilux.org/archives/haproxy/0812/1575.html
>
> Here I give the example of how it looks:
>
> # while true; do echo -n `date "+%T.%N "`" "; curl -s
> http://test-nl11-apache-aux2.com/uptime.php; echo; done
> 12:50:21.294819803 OK
> 12:50:21.481879293 OK
> 12:50:21.666777343 OK
> ...
> I stop Apache:
> # /opt/apache_aux2_pool1/current/sbin/apachectl -k stop
> I receive an error:
> ...
> 12:50:21.854037923 OK
> 12:50:22.039332296 OK
> 12:50:22.244071674 <html><body><h1>502 Bad Gateway</h1>
> The server returned an invalid or incomplete response.
> </body></html>
>
> 12:50:22.463404198 OK
> 12:50:22.653188547 OK
> ...
> ...
>
> Haproxy log in attach.
>
> My haproxy.conf:
> ==========
> global
> daemon
> user haproxy
> group haproxy
> chroot /var/empty
> ulimit-n 32000
>
> defaults
> log 127.0.0.1 local1 notice
> mode http
> maxconn 2000
> balance roundrobin
> option forwardfor except 111.222.111.222/32
> option redispatch
> retries 10
> stats enable
> stats uri /haproxy?stats
> timeout connect 5000
> timeout client 150000
> timeout server 150000
>
> listen backend_pool1 111.222.111.222:9099
> option httplog
> log 127.0.0.1 local2
> cookie SERVERID insert indirect
> option httpchk
> capture request header Host len 40
> server pool1 111.222.111.222:8099 weight 256 cookie backend1_pool1 check inter 500 fastinter 100 fall 1 rise 2 maxconn 500
> server pool2 111.222.111.222:8100 weight 1 cookie backend1_pool2 check inter 800 fastinter 100 fall 1 rise 2 maxconn 250
> server pool3 111.222.111.222:8101 backup
> ==========
>
> My challenge is to make ha proxy not to return to the user an error
> 502 at once, but to make it try to repeat the inquiry times
> through N time intervals, and if it all failed only then let haproxy
> return to the user the 502 error. Can I somehow do it or is there
> any other suitable decision?

It's more complex than just black-or-white. First, there's a solution so that you never have any error at all, but let me first explain what is happening and why it's behaving that way.

When you restart Apache that way, you break existing connections at any point during their processing. Some were waiting for a request to come, some were processing the request, some were sending response headers, and some were sending response data. The 502 that you're seeing indicates that Apache had accepted the connection but did not finish sending headers, so most likely it was processing the request. The processes killed before accepting the connection will at most cause a connection retry to occur, and if it's killed once Apache has started sending a response, then you won't see the 502, the client will just get a truncated response.

There are two issues with retrying requests. The first one is related to the implementation (here haproxy, but any component will have a limit eventhough different). The issue is that haproxy has a request buffer which has a limited size. A full request is buffered, parsed, processed and forwarded to the server. From this point, the request is not in haproxy's buffer anymore. In theory, by adding a few more pointers, until the data in the buffer are not replaced, we could be able to find the request there and retry it, and indeed we'll have to do that in the future, but more on this below. The problem is that some requests will definitely not fit in the buffer at all. Let's say you get a PUT request with a file of 10 MB. The server breaks the connection after you have forwarded 9 MB. You'd have to forward those 9 MB again, but it's clearly impossible to keep that large buffers just for hypothetical retries. So there will always be a class of requests that cannot be replayed because of implementation limitations, whatever the limits you set. And the same is true with the response : for all requests that were aborted after the server started to respond, we can't tell the client "hey, please ignore what I sent you till now, here's a new version instead".

The second issue comes from the HTTP specification. HTTP says that only idempotent requests may be replayed, which means requests whose effect on the server is exactly the same whether you do it once or any larger number of times. A GET should be an idempotent request (in theory). If you retrieve a static file, fail in the middle and do it again, the server's state will not change. GET with a query string starts to be subject to caution. And a POST definitely is not an idempotent request. When you order a book on a site, you don't want a stupid load balancer between the site and you to silently post your order a second time when the first connection died in the middle. Same when you click "delete this mail" in your preferred webmail, you don't want the LB to send that twice.

So if a non-idempotent request fails, it will never be replayed at all, regardless of buffer capacities, and that rule is mandated by the HTTP specification and respecting it is very important.

This means that if you blindly restart an HTTP server which holds connections, whatever the proxy, LB or whatever component in front of it, you will always make it visible for some users.

So what can we do ? Haproxy includes several ways to seamlessly act on your servers. There are two common methods people are using, depending on how the work is split between their production teams :

those doing everything themselves disable the server on haproxy, wait for all activer requests to complete, and restart the server ;
those where the LB is not managed by the same people as the servers prefer to make the server tell haproxy that it is going down.

The first method involves the maintenance mode which can be triggered either by the web stats page, or by the CLI ("disable server", "enable server").

The second method is different, it consists in making the server change the result to the health checks so that haproxy stops sending requests there, then it can restart. There are two variants, hard (basically return 500 instead of 200 to the health check), and soft (respond 404 first). The former kicks every user out of the server and is appropriate for static servers or stateless servers in general. The second one only accepts requests from users with a persistence cookie, but will not assign new users to the server. It's for stateful servers (eg: the one where you're ordering your book and which holds your session). In both cases, either you wait for the logs to report that there's no activity anymore, or you decide that a few minutes after the announce, you automatically perform the restart.

There are people who program automatic config updates, haproxy upgrades, or system reboots with this second method. It's very convenient and flexible and you never break any connection that way.

Hoping this helps,
Willy Received on 2011/06/09 22:36

Re: Handling errors "502 Bad Gateway"