Re: TCP Splicing with HAProxy 1.4.8

From: Willy Tarreau <w#1wt.eu>
Date: Thu, 7 Oct 2010 23:33:40 +0200


Hi,

I won't repeat what David said, he's completely right. I'll respond to the point below :

> Are there specific traffic profiles which benefit most from the use of
> splicing? As I mentioned earlier I'm not seeing any performance improvement
> while using it - in fact I've seen degradation when transferring large files,
> which suggests to me that I'm possibly not using it appropriately.

It depends a lot on network hardware and drivers. At 10 Gbps, you have almost no choice, and in fact it works extremely well. On gigabit NICs, I've seen mixed results. Sometimes you'll see only single TCP segments get returned by a splice() call, resulting in a lot more calls than what recv() would do, thus showing lower performance. This has improved a lot with kernels around 2.6.27 because that was when we started experimenting with splice() and the developers were very reactive to fix some issues. I still had this case recently with a 2.6.32.x on an ARM box (the crappy guruplug server I bought). Splice() was around 10% slower than standard recv(). With 2.6.35.x, it's the opposite, splice() has become about 10% faster than recv(), so something has improved.

For splice() to be efficient, you need a NIC that supports LRO or at least a kernel that supports GRO. I recall an exchange I had with one guy at Zeus when we were debugging splice(). He was observing significantly lower performance on a quad-port intel gigabit card with splice() than with recv(). After the fixes he got very similar results, which means that splice() did not help at all with this card and his version of the driver.

You should play a bit with ethtool. First, check that your NIC supports TSO, checksum offloads and that GRO is enabled (ethtool -k). Second, see with "ethtool -c" if you can reduce the interrupt rate or increase the RX delay so that the NIC has some chances to merge multiple packets in the receive path. Obviously you'll need a decent NIC anyway. Don't expect anything from a realtek or nforce ;-)

From my tests, Myricom's Myri10GE NICs benefit a lot from splice(). That's the NIC I use on the 10Gbps tests at only 25% CPU. Another guy I know has got very good results with intel's 10 GE NICs too. I've see very low CPU figures at up to 5 Gbps production traffic. I remember having noticed a small improvement on the old PCI-based TG3 NIC on my old notebook (pentium-M at 1.7 GHz). I haven't done enough tests on e1000 since we got splice().

All in all, at gigabit speeds on decent hardware, the improvements should be minimal, as we're only talking about memory avoiding copies at 125 MB/s. Still on small CPU-bound or FSB-bound hardware, it can be a nice improvement.

Regards,
Willy Received on 2010/10/07 23:33

This archive was generated by hypermail 2.2.0 : 2010/10/07 23:45 CEST