Hypermail

From: Willy Tarreau <w#1wt.eu>
Date: Sun, 30 Nov 2008 22:01:20 +0100

Hi Kai,

On Sun, Nov 30, 2008 at 08:24:17PM +0000, Kai Krueger wrote:
> >Could you try to change the LB algo from leastconn to roundrobin and
> >reproduce the test above ? If only leastconn triggers the issue, then
> >there might be a bug in its set_server_down() function. Maybe sometimes
> >a server is not removed from the tree.
> >
> Ok, I think I might have tracked the problem down a bit more. I still
> don't have a full grasp of which path exactly the requests go through,
> but I might have identified one potential way of preventing this. The
> problem is not that the server is not removed from the tree. That
> happens correctly. However there seem to be several paths in
> proto_http.c which bypass the load balancing tree with comments of the
> form "We used to have a free connection slot. Since we'll never use it,
> we have to inform the server that it may be used by another session. "
> However the sever could have gone down in the mean time and neither
> may_dequeue_tasks(server) nor process_srv_queue(server) check for the
> status of the server before wakeing up tasks on that server. Adding a
> check in either of these functions seem to do the trick, but again, I am
> not sure if this is the right fix, or if my analysis was correct.

Damn, you're perfectly right, shame on me ! I'm amazed that this problem has never showed up before. I think it is because the low maxconns and the long requests trigger it more easily (otherwise the server would have nothing left in the queue when completing an old request).

Could you please try this patch ? If it works, I'll merge it and release two more versions.

Thanks very much for your long work at tracking this issue down!

Willy

From 80b286a064eaec828b7fd10e98e3f945e8b244f3 Mon Sep 17 00:00:00 2001 From: Willy Tarreau <w#1wt.eu>
Date: Sun, 30 Nov 2008 21:51:58 +0100
Subject: [BUG] do not dequeue requests on a dead server

Kai Krueger reported a problem when a server goes down with active connections. A lot of connections were drained by that server. Kai did an amazing job at tracking this bug down to the dequeuing mechanism which forgets to check the server state before allowing a request to be sent to a server.

The problem occurs more often with long requests, which have a chance to complete after the server is completely marked down, and to find requests in the global queue which have not yet been fetched by other servers.

The fix consists in ensuring that a server is up before sending it any new request from the queue.

---
 include/proto/queue.h |    5 +++--
 1 files changed, 3 insertions(+), 2 deletions(-)

diff --git a/include/proto/queue.h b/include/proto/queue.h
index 6899aee..be78b53 100644
--- a/include/proto/queue.h
+++ b/include/proto/queue.h
@@ -64,10 +64,11 @@ static inline struct pendconn *pendconn_from_px(const struct proxy *px) {
 }
 
 /* returns 0 if nothing has to be done for server <s> regarding queued connections,
- * and non-zero otherwise. Suited for and if/else usage.
+ * and non-zero otherwise. If the server is down, we always return zero. Suited for
+ * and if/else usage.
  */
 static inline int may_dequeue_tasks(const struct server *s, const struct proxy *p) {
-	return (s && (s->nbpend || p->nbpend) &&
+	return (s && (s->state & SRV_RUNNING) && (s->nbpend || p->nbpend) &&
 		(!s->maxconn || s->cur_sess < srv_dynamic_maxconn(s)));
 }
 
-- 
1.6.0.4

Received on 2008/11/30 22:01

Re: Sending requests to servers that are marked as down?