Re: Changes in the check timeouts

From: Krzysztof Oledzki <ole#ans.pl>
Date: Sat, 16 Feb 2008 01:42:49 +0100 (CET)

On Fri, 15 Feb 2008, Willy Tarreau wrote:

> On Fri, Feb 15, 2008 at 10:25:48PM +0100, Krzysztof Oledzki wrote:
>>> if timeout.check is set, use it as the maximum time for the whole test to
>>> succeed, which means both the connect timeout, and the rest.
>>
>> This is exactly was I would like to avoid. It means that it will take
>> "inter + (fall-1)*fastinter + fall*timeout.check" to detect a down server,
>
> I forgot about fastinter, but it brings another issue then :
>
>> for example:
>>
>> inter = 15s
>> fastinter = 1s
>> timeout connect = 4s
>> timeout check = 10s
>> fall = 4
>
> it is not logical to have fastinter=1s with timeout connect=4s, because if
> we expect that we may take up to 4s to connect, then fastinter will kick
> too fast and we'll never get up after getting down.

Ah, right. I forget how it is supposed to work. :(

> In fact, I know why connect=4s. It's because we want to be tolerant with
> connections which concern *traffic*. And if we use a low fastinter, it is
> because we want to quickly detect changes, possibly ignoring timeout.connect.
> That means that connect should not be used for health-checks because its
> value is always set to cover random errors and we want fast detection of
> trouble. Health check errors are covered by the "fall" parameter, not by
> the connect timeout.
>
> Second, it is not reasonable to have check=10s with fastinter=1s, because
> this time we're sure that it will really not work. As you said, check
> responses are fast, and also they're really stable in time. If we consider
> that fastinter can cover them, then timeout.check should be <= 1s too.
> Otherwise we create the problem where fastinter will be too short.
>
>> Total: 15 + 3*1 + 4*10 = 58s (~1m)
>>
>> What we currently have is:
>> "inter + (fall-1)*fastinter + fall*min(inter, timeout.connect)"
>> Total: 15 + 3*1 + 4*4 = 34s (~two times faster)
>>
>> Right, it still takes ~1m to detect a overloaded server - one that accepts
>> SYNs but is too busy to send an answer, but in some (most?) cases we are
>> faster.

Right, right. It seems it is too late for me to think reasonable after a long day full of meetings. :( Sorry.

> I would like to propose something more in sync with your version, but
> slightly adapted for bizarre situations. In the examples, what I call
> "inter" will be whatever inter (normal, fast, etc...).
>
> - most common setups will use medium connect (about 4-5 seconds) to cover
> one lost packet, and short inter (about 1s) to quickly detect changes.

OK.

> - some setups will use medium connect with large inter (60s) in order not
> to flood the server with checks, because we're not interested in quick
> changes. However, as you noted, it's not fair to permit that long for
> a connection attempt to succeed, and we should at least kick the check
> off if we know that normal traffic will not succeed (meaning kick it
> off after timeout.connect anyway).

Right.

> - some setups use large connect (60s) at least because of queue. We have to
> support them. Some of them will use short inter (1s) and we don't want
> that interval to shift to 1 minute because of the connect, and for others
> with very inter (60s), we would still like to be able to stop a check if
> it does not succeed within a few seconds.

Exactly.

> So I would say that we should always bound the connect timeout to
> min(timeout.connect, inter).

Yes, we could, but...

> Most common setups (first case) will remain unaffected. Second case will
> be affected but will get back to reality by testing the real service
> instead of something which might present an apparently up server which
> does not work for traffic. Third case will remain unaffected
> (connect=60, inter=1), but last case will not get any better (60, 60).

... but this is not so simple. A health-check is not only a successful connection. It is also a work a server has to fulfill to deal with a request and this work takes some time. If server is loaded (but not overloaded) and a health-check-script is something more that "return '200 OK'" than it may take some seconds. Especially if script checks for example if it is able to read files from a NFS/CIFS/iSCSI storage, connect to a database (possibly to wait for a free slot) and to perform some selects when it checks if is supposed to return a 200 or rather a 4xx/5xx code when someone scheduled a downtime, ...

> That's exactly where timeout.check is needed. What does it basically mean ?
> It means that we don't want a check to run for too long. If it does not
> *complete* within the expected time, kill it. Having it cover the whole
> sequence reduces the risk of delay shifts due to additions of many small
> numbers (what if the server responds 1 byte per second after all ?).

But we also want to give a health-check some chances to finish.

> It also allows us to *reduce* the allowed connect time to the server for
> health checks without affecting timeout.connect which is initially for
> traffic.
>
> So with your example values, we still remain at 34 seconds, but we can
> now reduce timeout.check to make it more meaningful :
>
> inter = 15s
> fastinter = 1s
> timeout connect = 4s
> timeout check = 1s (we don't care about retransmits, fall is there for that)
> fall = 4
>
> we get :
>
> "min(check,connect,inter) + (fall-1)*min(check,connect,fastinter) +
> fall*min(inter, connect, check)"
> Total: 1 + 3*1 + 4*1 = 8s
>
> That way, existing setups benefit from the fix for the second case, and
> new ones can play with timeout.check to enforce the timeout on their
> checks without depending on other counter-intuitive timeout calculations.
>
> Is that OK for you that way ? At least it is for me since I see how I
> can configure my proxies with this, and I also see how I can explain
> to users how to use it and what each parameter does. This simply resumes
> in this :
>
> - a health-check never lasts longer than inter|fastinter
> - a health-check never lasts longer than timeout.check
> - a health-check never takes longer than timeout.connect to establish

In my situation, when a health-check connection was successfully established, it typically requires 0-5 more seconds to finish a test so 10s timeout seems to be safe. I would like to detect when a server is down or broken ASAP but at the same time I don't want to kick it out if it (or a database, a storage, a memcache engine, etc) is saturated for a (short) moment. If something went wrong I like to repeat the test soon but also not too soon to prevent false alarms and not to flood a server (fastinter = ~1s) so I need timeout.check to be large. Not very large but large enough. Bounding everything to fastinter is not going to work for me. :( Bounding it to timeout.check can be fine but only if there were no connect timeout because if it happened, server may run out of time to execute a healt-check-script. So, this is the reason why I designed it that way.

Finally, this "keep the old behavior if timeout.check is not set" was supposed to to keep old configs 100% valid but give a chance to tune everything if someone finds a time to read a new manual and decides to use the additional "check timeout".

Best regards,

                                 Krzysztof Olędzki Received on 2008/02/16 01:42

This archive was generated by hypermail 2.2.0 : 2008/02/16 01:46 CET