Re: HAProxy stuck on TCP Retransmission

From: Willy Tarreau <w#1wt.eu>
Date: Sat, 3 Dec 2011 22:50:02 +0100


Hi Simon,

On Wed, Nov 30, 2011 at 03:10:23AM +0000, Simon Schmid wrote:
> I have a HAProxy + NodeJS + Rails Setup, I use the NodeJS Server for file
> upload purposes.
>
> The problem I'm facing is that if I'm uploading files through haproxy to
> nodejs and a "TCP (Fast) Retransmission" occurs the TX rate on the client
> drops to zero for about 5-10 secs and gets flooded with TCP Retransmissions.
>
> This does not occur if I upload to NodeJS directly (TCP Retransmission
> happens too but it doesn't get stuck with dozens of retransmission attempts).
>
> My test setup is a simple HTML4 FORM (method POST) with a single
> file input field.
>
> The NodeJS Server only reads the incoming data and does nothing else.
>
> I've tested the upload from multiple machines, networks, browsers,
> always the same issue.
>
> I must admit that we have another server (in another datacenter) with
> the same setup where HAProxy is running fine... The only difference may be
> the system, the non-working system has Kernel 2.6.32-33-generic and
> the working one has 2.6.18-028.

I think you have spotted a bug in the kernel that must be reported to your distro vendor. Haproxy does not know about TCP which is completely handled by the kernel, it just uses sockets. Do not wait too much to report the issue, because from experience, such bugs tend to suddenly vanish for days, weeks or months and you would come into trouble when trying to reproduce it or to test a kernel patch.

> Here's a TCP Traffic Dump from the client while uploading a file:
>
> .....
> TCP 1506 [TCP segment of a reassembled PDU]
> --> everything is uploading fine until:
> TCP 1506 [TCP Fast Retransmission] [TCP segment of a reassembled PDU]
> TCP 66 [TCP Dup ACK 7392#1] 63265 > http [ACK] Seq=4844161 Ack=1
> Win=524280 Len=0 TSval=657047088 TSecr=79373730
> TCP 1506 [TCP Retransmission] [TCP segment of a reassembled PDU]
> --> the last message is repeated about 50 times for >>5-10 secs<<
> (TX drops to 0 on client, RX drops to 0 on server)
> TCP 1506 [TCP segment of a reassembled PDU]
> --> upload continues until the next TCP Fast Retransmission and the same
> thing happens again

This looks like a tshark capture, as it lacks all important information (time, sequence numbers, packet ID, TTL, flags, ...). Could you please redo it with "tcpdump -s0 -Svvnpi eth0 -w bug.cap" and provide the resulting file ?

You could try to disable TCP checksum offload on your NIC with "ethtool -K". If this fixes the issue, then you may have either a buggy NIC or a buggy driver (or both).

Regards,
Willy Received on 2011/12/03 22:50

This archive was generated by hypermail 2.2.0 : 2011/12/03 23:00 CET