Re: Solaris x86 tuning...

From: Willy Tarreau <w#1wt.eu>
Date: Thu, 20 May 2010 07:04:02 +0200


On Wed, May 19, 2010 at 06:10:11PM -0600, Matt Banks wrote:
> FWIW, I appreciate the response, but I'm not sure how event-ports, epoll vs poll and select is going to cause load times to increase for us by 500% with one client hitting haproxy vs that same one client hitting apache directly. I can see it having a negative effect with a heavy load (exclusively) but that (to me - I claim no expertise in the matter) doesn't explain the performance hit of one single client loading one web page.

Indeed, it will only make a difference on very high load which does not seem to be your case. Also, solaris' poll() is particularly fast because it limits the number of responses, which limits processing latencies.

I've once observed an issue looking like yours on solaris, which was denied by our sun support contact but which disappeared by itself after an upgrade. The issue was that some packets were delayed in the system before being delivered to local processes. This caused apache not to be woken up for 3s when an incoming connection came in. And running snoop+truss in parallel really showed the delay between the SYN/SYN-ACK/ACK sequence and the accept() call.

I think you should proceed the same way : truss + snoop. Observe how system calls translate into network packets and how network packets wake up system calls. Maybe you'll simply notice that everything works very well but you get a lot of connection retries or something like this.

You should also test with tcp_strong_iss set to 0 (ndd /dev/tcp). I remember it was often set to 2 (default ?) which tells the system that it should generate very strong random sequence numbers, but I don't know if that makes use of any hardware-assist random number generator. It could be the case that it lacks entropy and needs time to gather enough to create a new connection.

It's possible too that you have issues with some network drivers. I got system freezes with the bge driver for some time under load (though this was fixed). It could be possible that you're using a NIC with a poor driver experiencing a massive drop rate, causing TCP retransmits. Snoop will show you that.

Oh and the usual point about network negociation : please ensure that you're connected in gig speed. If you're at 100 Mbps, there are high chances of a duplex mismatch issue between two ends. And it's not easy to check on all drivers, check ndd /dev/{bge,e1000g,nge}. Ah, something interesting if you're running on "nge" (nvidia nforce driver), it supports 1000-half duplex which is not in the ethernet spec! That could be funny...

Regards,
Willy Received on 2010/05/20 07:04

This archive was generated by hypermail 2.2.0 : 2010/05/20 07:15 CEST