On Wed, Dec 26, 2007 at 02:30:54PM +0100, Krzysztof Oledzki wrote:
> >IMHO, this should only be done for non-deterministic algorithms such as
> >roundrobin, and later leastconn. But people who use hashes really expect
> >that the same client or request will be sent to the same server (or to an
> >equivalent backup server when that may happen).
>
> OK, what about PH, as there is a get_server_rr_with_conns fallback?
The RR fallback is here so that if a URL does *not* contain the required parameter (typically a user ID), then and only then we use round-robin. It's a poor man's stickyness on a URL parameter in fact, and nothing more. But you're quite right by pointing the finger at this function, as it is this function itself which should consider the server to avoid.
> >And in fact, in the PH code, if you find the same server, you return NULL,
> >which will immediately cause a 503. In this case, we'd probably prefer to
> >still return the same server and perform a last connection attempt to it
> >in the hope that it finally accepts the connection.
>
> Not exactly as if s->srv ==NULL then s->srv =
> get_server_rr_with_conns(s->be, s->srv):
yes, you're right. What I meant is that instead of returning "no server" at the end of the function, it would be better yo return the one we tried to avoid.
I agree with your on the fact that it's dangerous to return the valuee passed in argument. But in fact, since we try to exclude a server if it matches, what we should do would basically look like this :
get_server_any_algo(be, srv, avoidme) {
server *avoided = NULL;
...
while (server_not_determined_yet) {
if (server == avoidme)
avoided = server;
else
break;
}
...
if (!server)
server = avoided;
return server;
> case BE_LB_ALGO_PH:
> /* URL Parameter hashing */
> s->srv = get_server_ph(s->be,
> s->txn.req.sol + s->txn.req.sl.rq.u,
> s->txn.req.sl.rq.u_l, s->srv);
> if (!s->srv) {
> /* parameter not found, fall back to round robin on the map
> */
> s->srv = get_server_rr_with_conns(s->be, s->srv);
> if (!s->srv)
> return SRV_STATUS_FULL;
> }
> break;
>
> It should work, or maybe I overlooked something?
It should but it's not what we would expect, since the goal of the hash is to *stick* to the same server as long as the information is there.
> >This reminds me about a feature I've wanted to add for a long time now: two
> >different check intervals. The normal one (current "inter") and a fast one,
> >used during transitions (eg: "fastinter"), used to speed up state
> >transition
> >when a failed check has been detected. The idea is that once this is in
> >place,
> >it should be relatively simple to switch a server to fastinter as soon as
> >it
> >experiences a redispatch. This will make it possible to detect server
> >failures
> >very quickly even with slow checks.
>
> Yes, and as I remember we had already discussed about it and agreed that
> it is important. Please also note that it often happens that a server is
> not defected but only saturated for ~1s so if we also add a retrydelay
> parameter than instead of 4 tcp resets in a row we may get a tcp ack, if
> we wait ~1s after each refused connection.
Yes, I still have a note in my TODO list: "add a turn_around state". This is still not easy to add right now, but it should be easier once the FSMs are broken into pieces.
> >For instance, if we consider that a server is checked every 10 seconds and
> >has 3 retries ("fall 3"), 30 seconds will be necessary to detect a failure.
> >Now if we use "inter 10s fastinter 500ms", we will need 11 seconds to
> >detect
> >a failure, or 1.5 second after the first redispatch occurs. In this
> >example,
> >this is 20 times faster than what we can currently achieve.
>
> I have a half-ready patch for it. In my solution it is quite simple and
> looks similar to this:
>
> if (((sv->health < s->rise + s->fall - 1) || (sv->health)) && s->inter >
> SRV_CHK_INTER_THRES) than
> cr = s->inter * global.tcr;
> else
> cr = s->inter;
>
> This tcr is for "transition check rate" so you have a global 1-100%
> paremeter for it, exactly like spread_checks.
Basically what we need indeed, except that I'd really like to have two different values and not just a factor between them. While most people would prefer to use faster checks during transitions, others will prefer to slow them down in order to give servers a chance to recover.
Cheers,
Willy
Received on 2007/12/26 15:01
This archive was generated by hypermail 2.2.0 : 2007/12/26 15:15 CET