Re: Changes in the check timeouts

From: Willy Tarreau <w#1wt.eu>
Date: Wed, 27 Feb 2008 09:34:29 +0100


Hi !

getting back to this subject after 2 weeks...

On Sun, Feb 17, 2008 at 10:55:22PM +0100, Krzysztof Oledzki wrote:
> >Do not forget that large connect times are the prime signs
> >of failures if they happen often, and that if they happen occasionally,
> >they are just caused by a lost packet, and this lost packet may as well
> >happen in the response. That's why I think that timeout.check should cover
> >the whole check, which in your case should mostly look like this :
> >
> > <-- connect --> | <---------------- response ---------------->
> > 1ms 5 seconds
> >
> >If you set timeout.check to 5 seconds, you cover the case above without
> >lost
> >packet. But now, let's consider a lost packet in either part :
> >
> > <------- connect --------> | <---------------- response ---------------->
> > 3 seconds 5 seconds
> >
> > <-- connect --> | <----------------------- response ---------------------->
> > 1ms 8 seconds
>
> If a packet is lost when a connection is already established it don't
> always mean that such retransmission takes 3s. I would rather say that in
> most cases it takes much less time. So it is possible for it to look like
> this:
>
> <-- connect --> | <----------------------- response ---------------->
> 1ms ~RTT + 5 seconds (< 8 seconds)

Not really because health-checks generally return very few bytes, and for this to happen, you would have to return at least two packets from the server to the client so that the client notices a disorder and sends a sack or explicitly asks for an retrans. For most responses fitting one packet, the server would have to wait for the 3s without ACK to retransmit.

> The main question is if is better to have a precisely defined timeout of
> the whole check (connect + response read) or to have a precisely defined
> guaranteed time to a server to deal with a check-request.
>
> Let's assume that a server is quite busy, it needs 5s for an anserwer and
> check.timeout is set to 7s:
>
> 0. No packet lost:
> <-- connect --> | <----------------------- response -------------------->
> 1ms 5 seconds
>
> 1. SYN packet lost (check.timeout covers full check):
>
> <--------------------- check.timeout: 7 seconds ----------------> FAILED
> <------- connect --------> | <---------------- response ---------------->
> 3 seconds 5 seconds
>
> 2. SYN packet lost (check.timeout covers response read):
> <--------------------- check.timeout: 7
> seconds -(...)->
> <------- connect --------> | <---------------- response ----------------> OK
> 3 seconds 5 seconds
>
> IMHO, scenario #1 is better.

IMHO too. Let's not forget that having transient failures in health checks does not cause problems, as it is to cover this exact situation that we have the "fall" parameter. My concern is not to hide a flaky server, but to be able to detect it just in time, and if possible, inform the admin that this server is having a check failure rate higher than others', indicating a real problem on this one.

> So, the only problem is to make sure that the timeout used for
> check-connect is both not to short and not to long. What we have currently
> is min("timeout connect", "inter"). Maybe this one is wrong? If it is set
> to high, there is no way one can fix it by playing with timeout.check.
> Even if it covers the full check.

I agree.

> As I stated above, we should allow by default for 1 retransmission of a
> SYN. So *maybe* we can just hardcode it to quite safe ~3.5s? Or we can add
> another variable (you probably are starting to hate me ;) implicitly
> initialized to 3.5s (safe value) that one can change to: ~0.5s (no SYN
> retransmission is allowed) or >= ~8.5s (more SYN retransmissions are
> allowed).

Oh no, I really don't want to do that for two reasons :

My initial intention with fastinter was precisely to detect a failed server. I *want* to be able to set a connect timeout to 10 or 100 ms for the health checks, and have only <fall> cover the rare unexpected retransmits.

> If timeout.check is not set:
> - a health-check never lasts longer than server.inter (current situation)
>
> If timeout.check is set:
> - a health-check never waits more than timeout.checkconnect (or just 3.5s)
> to connect
> - a health-check never lasts longer than timeout.check to read a response
>
> <CUT>
> >- a health-check never lasts longer than server.inter (current situation)
> Glad we agree on this one. ;)
>
> >- a health-check never lasts longer than timeout.check
> >- a health-check never waits more than inter|fastinter to connect
> >- a health-check never waits more than timeout.connect to connect
>
> IMHO fastinter set for example to 1s may be too short even for a connect
> timeout, as we want to allow a SYN retransmission.

no, in my case, I want to ensure we will *not* silently cover SYN retransmissions :-)

So I'm putting down all we have (and want) :

Then I'm wondering, wouldn't it be easier if we considered that both <inter> and <fastinter> act as a connect timeout, which can be even reduced to timeout.connect if smaller, and use timeout.check only as a grace period for the check itself once the connection succeeds ?

It seems logical enough to me, easy to explain and understand, and would fit all of our usages.

Regards,
Willy Received on 2008/02/27 09:34

This archive was generated by hypermail 2.2.0 : 2008/02/27 09:45 CET