Re: Backend Server UP/Down Debugging?

From: Krzysztof Oledzki <ole#ans.pl>
Date: Sun, 30 Aug 2009 14:21:55 +0200 (CEST)

On Thu, 27 Aug 2009, Dmitry Sivachenko wrote:

> On Thu, Aug 27, 2009 at 08:45:23AM +0200, Krzysztof Oledzki wrote:
>>> On Wed, Aug 26, 2009 at 02:00:42PM -0700, Jonah Horowitz wrote:
>>>> I???m watching my servers on the back end and occasionally they flap. I???m wondering if there is a way to see why they are taken out of service. I???d like to see the actual response, or at least a HTTP status code.
>>>
>>> right now it's not archived. I would like to keep a local copy of
>>> the last request sent and response received which caused a state
>>> change, but that's not implemented yet. I wanted to clean up the
>>> stats socket first, but now I realize that we could keep at least
>>> some info (eg: HTTP status, timeout, ...) in the server struct
>>> itself and report it in the log. Nothing of that is performed right
>>> now, so you may have to tcpdump at best :-(
>>
>> As always, I have a patch for that, solving it nearly exactly like you
>> described it. ;) However for the last half year I have been rather silent,
>> mostly because it is very important time in my private life, so I think
>> I'm partially excused. ;) I know that there are some unfinished tasks (acl
>> for exapmple) so I'll try to push ASAP, maybe starting from the easier
>> patches, likt this ones. The rest will have to wait when I get back from
>> honeymoon.
>
>
> I see flapping servers in my logs too and also have no clue why
> haproxy disables them.
>
> If you have a patch to log the reason why the particular server
> was disabled, I'd love to test it (I run 1.4-dev2).

Please check the attached patch. This code is far from being ready for inclusion, however I cleaned it and ported to 1.4-dev2, so it should work for you.

With this patch you should be able to see status of last check in the stats page and to check in you logs, why servers are considered down, for example:

[WARNING] 241/141556 (1971) : Backup Server p1/s1 is DOWN, reason: Layer5-7 response error(10), code: 400, check duration: 0ms. 0 active and 4 backup servers left. Running on backup. 0 sessions active, 0 requeued, 0 remaining in queue. [WARNING] 241/141702 (4006) : Backup Server p2/s6 is DOWN, reason: Layer5-7 timeout(6), check duration: 10001ms. 0 active and 0 backup servers left. 0 sessions active, 0 requeued, 0 remaining in queue.

Best regards,

                                 Krzysztof Olędzki

Received on 2009/08/30 14:21

This archive was generated by hypermail 2.2.0 : 2009/08/30 14:30 CEST