Hypermail

From: Willy Tarreau <w#1wt.eu>
Date: Fri, 15 Feb 2008 23:45:38 +0100

On Fri, Feb 15, 2008 at 10:25:48PM +0100, Krzysztof Oledzki wrote:
> >if timeout.check is set, use it as the maximum time for the whole test to
> >succeed, which means both the connect timeout, and the rest.
>
> This is exactly was I would like to avoid. It means that it will take
> "inter + (fall-1)*fastinter + fall*timeout.check" to detect a down server,

I forgot about fastinter, but it brings another issue then :

> for example:
>
> inter = 15s
> fastinter = 1s
> timeout connect = 4s
> timeout check = 10s
> fall = 4

it is not logical to have fastinter=1s with timeout connect=4s, because if we expect that we may take up to 4s to connect, then fastinter will kick too fast and we'll never get up after getting down.

In fact, I know why connect=4s. It's because we want to be tolerant with connections which concern *traffic*. And if we use a low fastinter, it is because we want to quickly detect changes, possibly ignoring timeout.connect. That means that connect should not be used for health-checks because its value is always set to cover random errors and we want fast detection of trouble. Health check errors are covered by the "fall" parameter, not by the connect timeout.

Second, it is not reasonable to have check=10s with fastinter=1s, because this time we're sure that it will really not work. As you said, check responses are fast, and also they're really stable in time. If we consider that fastinter can cover them, then timeout.check should be <= 1s too. Otherwise we create the problem where fastinter will be too short.

> Total: 15 + 3*1 + 4*10 = 58s (~1m)
>
> What we currently have is:
> "inter + (fall-1)*fastinter + fall*min(inter, timeout.connect)"
> Total: 15 + 3*1 + 4*4 = 34s (~two times faster)
>
> Right, it still takes ~1m to detect a overloaded server - one that accepts
> SYNs but is too busy to send an answer, but in some (most?) cases we are
> faster.

I would like to propose something more in sync with your version, but slightly adapted for bizarre situations. In the examples, what I call "inter" will be whatever inter (normal, fast, etc...).

most common setups will use medium connect (about 4-5 seconds) to cover one lost packet, and short inter (about 1s) to quickly detect changes.
some setups will use medium connect with large inter (60s) in order not to flood the server with checks, because we're not interested in quick changes. However, as you noted, it's not fair to permit that long for a connection attempt to succeed, and we should at least kick the check off if we know that normal traffic will not succeed (meaning kick it off after timeout.connect anyway).
some setups use large connect (60s) at least because of queue. We have to support them. Some of them will use short inter (1s) and we don't want that interval to shift to 1 minute because of the connect, and for others with very inter (60s), we would still like to be able to stop a check if it does not succeed within a few seconds.

So I would say that we should always bound the connect timeout to min(timeout.connect, inter). Most common setups (first case) will remain unaffected. Second case will be affected but will get back to reality by testing the real service instead of something which might present an apparently up server which does not work for traffic. Third case will remain unaffected (connect=60, inter=1), but last case will not get any better (60, 60).

That's exactly where timeout.check is needed. What does it basically mean ? It means that we don't want a check to run for too long. If it does not *complete* within the expected time, kill it. Having it cover the whole sequence reduces the risk of delay shifts due to additions of many small numbers (what if the server responds 1 byte per second after all ?).

It also allows us to *reduce* the allowed connect time to the server for health checks without affecting timeout.connect which is initially for traffic.

So with your example values, we still remain at 34 seconds, but we can now reduce timeout.check to make it more meaningful :

  inter = 15s
  fastinter = 1s
  timeout connect = 4s
  timeout check = 1s (we don't care about retransmits, fall is there for that)   fall = 4

we get :

"min(check,connect,inter) + (fall-1)*min(check,connect,fastinter) +

fall*min(inter, connect, check)"
Total: 1 + 3*1 + 4*1 = 8s

That way, existing setups benefit from the fix for the second case, and new ones can play with timeout.check to enforce the timeout on their checks without depending on other counter-intuitive timeout calculations.

Is that OK for you that way ? At least it is for me since I see how I can configure my proxies with this, and I also see how I can explain to users how to use it and what each parameter does. This simply resumes in this :

a health-check never lasts longer than inter|fastinter
a health-check never lasts longer than timeout.check
a health-check never takes longer than timeout.connect to establish

Best regards,
Willy Received on 2008/02/15 23:45

Re: Changes in the check timeouts