Hypermail

From: Willy Tarreau <w#1wt.eu>
Date: Thu, 30 Dec 2010 16:17:55 +0100

On Wed, Dec 29, 2010 at 12:34:08PM +0530, Amit Nigam wrote:
> Hi Willy,
>
> Thanks for your support, makes me believe I would solve this riddle.
> After updating to 1.4.10, sync-ing TC2 and LB1 times thru NTP, and using
> options tcp-smart-connect and tcp-smart-accept, I have seen significant
> improvments in server downtimes, retries and redispatches. But still I
> see lots of retries though there are only 1 redispatch at TC2.

I'm realizing that you should also check your servers' settings. It's possible that you have too short a backlog queue for the connection rate you're sending them.

> We are on production, and servers are also in a data center, so wont be
> possible to swap cabels.

OK.

> To ascertain packet loss I carried out ping between LB1 and TC2 and TC1.
> LB1 and TC1 avg time was 0.101 ms and LB1 to TC2 avg time was 0.382 ms on
> 64 byte packet with 0% loss.

OK but how many packets was that ? Was it a ping flood at least (ping -f) ?

> >Another possible explanation which becomes quite rare nowadays would
> >be that you'd be using a forced 100Mbps full duplex port on your switch
> >with a gigabit port on your server, which would negociate half duplex.
> >You can check for that with "ethtool eth0" on your LBs and TCs.
> I checked we are using vSwitch for external and internal server
> communication 1000mb full among v servers on a physical machine.
> v-server's are using 1000mb full v-adapters.

Sorry but I'm not sure I'm following you here. I'm not aware of this vSwitch nor v-servers/v-adapters, but the naming makes it sound like those are just "virtual" adapters, in which case link status would obviously not mean anything. However it's quite common to observe losses in virtualized environments when the hosts are overloaded, and the sad part is that it's hard to debug since you don't have any exploitable counters. That creates lots of "v-problems" for the admins and "v-performance" for the visitors :-)

> Following are the current stats:
> TC1 Retr: 0, Redis: 0 Status OPEN 5h 52m UP, LastChk L70K/302 in 321 ms,
> Server Chk: 4, Dwn 1, Dwntime 4m 17s.
> TC2 Retr:1326 , Redis: 1 Status OPEN 4h 1m UP, LastChk L70K/302 in 87
> ms, Server Chk: 90, Dwn 2, Dwntime 26s.
> Backend 5d 6m UP
>
> anyways what does LastChk signify?

It is a capture of the last check that was performed before the stats page was accessed. Seeing a test run in 302 ms seems huge to me : either it's 302 ms of CPU time and it means your checks are extremely expensive, or it's 302 ms spent waiting for some resource, and it means your server is having issues somewhere.

Regards,
Willy Received on 2010/12/30 16:17

Re: node frequently goes down on another physical machine