Re: Why would haproxy send a request to a server that is down?

From: Willy Tarreau <w#1wt.eu>
Date: Fri, 14 May 2010 23:30:02 +0200


Hi Malcolm,

On Fri, May 14, 2010 at 12:07:53PM -0700, Malcolm Handley wrote:
> Hi, everyone.
>
> I'm having some trouble with the routing of requests to servers within a
> backend.
>
> Firstly, although I have "retries 3" in the defaults section of my config
> file I'm not seeing any evidence of retries. If a server is down but has not
> been detected as down by haproxy then a request may still get sent to it and
> a failure returned to the client. (This is the first bold line in the log
> below.)

Yes, this is expected if your "retries" value is not large enough to cover the time to detect that the server is down. Also, the retries are only performed on the same server. If you want the request to be redispatched to another server after the last attempt, you should use "option redispatch".

> Second, occasionally haproxy seems to route a request to a server that it
> knows is down. (This is the second bolded section below.)

No, if you look more closely, you'll see that the request was received at 04:03:20.475, *before* the server was marked down (04:03:21), and failed last attempt at 04:03:23. Since the request did not switch to another server on the last retry, I think you did not have "option redispatch" enabled.

> I could understand both of these if I were using cookies for routing and had
> not enabled redispatching. But I'm using "balance leastconn" with no mention
> of cookies in the config file. What else might I be doing that would force
> haproxy to use a downed backend and not retry requests?

Well, be careful, retries are always performed on the same server, except the last one which can be redispatched. I will study if we could force an early redispatch in case the server changes state during retries, but there's nothing certain in this area.

Regards,
Willy



> May 14 04:03:03 prod_lb0 haproxy[28398]: 67.112.125.46:61758
> [14/May/2010:04:02:44.517]
> ws_in ws_in/<NOSRV> -1/-1/-1/-1/18609 400 187 - - CR-- 7/7/0/0/0 0/0
> "<BADREQ>"
> May 14 04:03:17 prod_lb0 haproxy[28398]: *127.0.0.1:58921
> [14/May/2010:04:03:08.486]
> ws_in lists_ws/ws_2 51/0/0/-1/9049 502 204 - - SH-- 7/7/2/1/0 0/0 "GET
> /-/ping HTTP/1.1"*
> May 14 04:03:17 prod_lb0 haproxy[28074]: 67.112.125.46:59661
> [14/May/2010:03:04:22.220]
> ws_in lists_ws/ws_2 55/0/0/31/3535558 101 27943453 - - ---- 2/2/2/2/0 0/0
> "GET /app/etherlist/socket?session_id=9680378170&profiler=1 HTTP/1.1"
> May 14 04:03:17 prod_lb0 haproxy[28398]: 99.66.213.198:58624
> [14/May/2010:03:52:18.455]
> ws_in lists_ws/ws_2 455/0/0/2/659334 101 3348145 - - ---- 6/6/1/0/0 0/0 "GET
> /app/etherlist/socket?session_id=9385884222&profiler=1 HTTP/1.1"
> May 14 04:03:17 prod_lb0 haproxy[28074]: 98.210.108.197:43537
> [14/May/2010:03:04:22.477]
> ws_in lists_ws/ws_2 165/0/0/2/3535454 101 8809800 - - ---- 1/1/1/1/0 0/0
> "GET /-/socket?session_id=9646753365 HTTP/1.1"
> May 14 04:03:18 prod_lb0 haproxy[28074]: 70.36.139.123:56872
> [14/May/2010:03:04:22.511]
> ws_in lists_ws/ws_2 88/0/0/2/3535503 101 25165709 - - ---- 0/0/0/0/0 0/0
> "GET /-/socket?session_id=9669242145 HTTP/1.1"
> May 14 04:03:21 prod_lb0 haproxy[28398]: *Server lists_ws/ws_2 is DOWN,
> reason: Layer4 connection problem, info: "Connection refused", check
> duration: 0ms.*
> May 14 04:03:23 prod_lb0 haproxy[28398]: *127.0.0.1:58965
> [14/May/2010:04:03:20.475]
> ws_in lists_ws/ws_2 10/0/-1/-1/3032 503 212 - - SC-- 8/8/2/0/3 0/0 "GET
> /-/ping HTTP/1.1"*
> May 14 04:03:29 prod_lb0 haproxy[28398]: *Server lists_ws/ws_2 is UP,
> reason: Layer4 check passed, check duration: 0ms.*
> May 14 04:03:34 prod_lb0 haproxy[28398]: 99.66.213.198:59005
> [14/May/2010:04:02:43.444]
> ws_in ws/ws_1 15/0/1/30/50569 304 296 - - cD-- 9/9/4/2/0 0/0 "GET
> /-/static/luna/browser/images/loading.gif HTTP/1.1"

Received on 2010/05/14 23:30

This archive was generated by hypermail 2.2.0 : 2010/05/14 23:45 CEST