Re: node frequently goes down on another physical machine

From: Bedis 9 <bedis9#gmail.com>
Date: Wed, 29 Dec 2010 10:24:05 +0100


Hi Amit,

Try a "netstat -in" and see if you have any errors on your interfaces :)

might help to figure out if you have a duplex mismatch.

cheers

On Wed, Dec 29, 2010 at 8:04 AM, Amit Nigam <amitnigam#gobindas.in> wrote:
> Hi Willy,
>
> Thanks for your support, makes me believe I would solve this riddle.
> After updating to 1.4.10, sync-ing TC2 and LB1 times thru NTP, and using
> options tcp-smart-connect and tcp-smart-accept, I have seen significant
> improvments in server downtimes, retries and redispatches. But still I see
> lots of retries though there are only 1 redispatch at TC2.
>
>>> Now in new stats page I noticed one thing which was not in 1.3.22 is
>>> LastChk, but I wonder tc1 is showing L7OK/302 in 324ms _and tc2 is
>>> showing
>>> L7OK/302 in 104ms _ while currently haproxy is running on LB1 and there
>>> are
>>> 13 retries at TC2.
>>
>> The only explanation I can see is a network connection issue. What you
>> describe looks like packet loss over the wire. It's possible that one
>> of your NICs is dying, or that the network cable or switch port is
>> defective.
>>
>> You should try to perform a file transfer between the machine showing
>> issues and another one from the local network to verify this hypothesis.
>> If you can't achieve wire speed, it's possible you're having such a
>> problem. Then you should first move to another switch port (generally
>> easy), then swap the cable with another one (possibly swap the cables
>> between your two LBs if they're close) then try another port on the
>> machine.
>
> We are on production, and servers are also in a data center, so wont be
> possible to swap cabels.
> To ascertain packet loss I carried out ping between LB1 and TC2 and TC1. LB1
> and TC1 avg time was 0.101 ms and LB1 to TC2 avg time was 0.382 ms on 64
> byte packet with 0% loss.
>>
>> Another possible explanation which becomes quite rare nowadays would
>> be that you'd be using a forced 100Mbps full duplex port on your switch
>> with a gigabit port on your server, which would negociate half duplex.
>> You can check for that with "ethtool eth0" on your LBs and TCs.
>
> I checked we are using vSwitch for external and internal server
> communication 1000mb full among v servers on a physical machine. v-server's
> are using 1000mb full v-adapters.
>
> Following are the current stats:
> TC1 Retr: 0,  Redis: 0 Status OPEN 5h 52m UP, LastChk L70K/302 in 321 ms,
> Server Chk: 4, Dwn 1, Dwntime 4m 17s.
> TC2 Retr:1326 ,  Redis: 1 Status OPEN 4h 1m UP, LastChk L70K/302 in 87 ms,
> Server Chk: 90, Dwn 2, Dwntime 26s.
> Backend 5d 6m UP
>
> anyways what does LastChk signify?
>
> Regards,
> Amit
> ----- Original Message ----- From: "Willy Tarreau" <w#1wt.eu>
> To: "Amit Nigam" <amitnigam#gobindas.in>
> Cc: "Guillaume Bourque" <guillaume.bourque#gmail.com>;
> <haproxy#formilux.org>
> Sent: Monday, December 27, 2010 11:25 AM
> Subject: Re: node frequently goes down on another physical machine
>
>
>> Hi Amit,
>>
>> On Fri, Dec 24, 2010 at 12:24:55PM +0530, Amit Nigam wrote:
>> (...)
>> I see nothing wrong in your configs which could justify your issues.
>>
>>> Now in new stats page I noticed one thing which was not in 1.3.22 is
>>> LastChk, but I wonder tc1 is showing L7OK/302 in 324ms _and tc2 is
>>> showing
>>> L7OK/302 in 104ms _ while currently haproxy is running on LB1 and there
>>> are
>>> 13 retries at TC2.
>>
>> The only explanation I can see is a network connection issue. What you
>> describe looks like packet loss over the wire. It's possible that one
>> of your NICs is dying, or that the network cable or switch port is
>> defective.
>>
>> You should try to perform a file transfer between the machine showing
>> issues and another one from the local network to verify this hypothesis.
>> If you can't achieve wire speed, it's possible you're having such a
>> problem. Then you should first move to another switch port (generally
>> easy), then swap the cable with another one (possibly swap the cables
>> between your two LBs if they're close) then try another port on the
>> machine.
>>
>> Another possible explanation which becomes quite rare nowadays would
>> be that you'd be using a forced 100Mbps full duplex port on your switch
>> with a gigabit port on your server, which would negociate half duplex.
>> You can check for that with "ethtool eth0" on your LBs and TCs.
>>
>>> Also can this issue be due to time differences between cluster nodes? as
>>> I
>>> have seen there is a time difference of around 2 minutes between physical
>>> machine 1 vms and physical machine 2 vms.
>>
>> While it's a bad thing to have machines running at different times, I
>> don't see why it could cause any such issue.
>>
>> Regards,
>> Willy
>>
>
>
>
>
Received on 2010/12/29 10:24

This archive was generated by hypermail 2.2.0 : 2010/12/29 10:30 CET