Re: Strangest thing ever

From: Willy Tarreau <w#1wt.eu>
Date: Thu, 30 Oct 2008 22:24:07 +0100


Hi Marcus,

On Thu, Oct 30, 2008 at 02:54:05PM +0100, Marcus Herou wrote:
> Hi guys.
>
> I want to discuss something with you guys which have confused us for some
> time.
>
> Background:
> Aftonbladet (one of the biggest newspapers in Sweden), my coworkers at our
> internal network (and myself) and now a third party UK blogger have reported
> that everything that loads from script.tailsweep.com or
> media.tailsweep.comis sluggish or times out. This seems to be true for
> some users and it seem
> to be related to when they sit behind the same firewall/router and get the
> same external IP.
>
> Today the same problem appeared again and the strangest thing happened. I
> will try to explain in bullet form.
>
> * I issue GET http://script.tailsweep.com/host.txt which responds lightning
> fast 1000 times in a row. (host.txt contains the backend hostname so you
> should get a new one every new request)
> * Two of my coworkers do the same without any response at all (same internal
> network = same IP-address to HAProxy)
> * Debugging - Turning on DEBUG in HAProxy => No connect statements
> * Debugging - Turning on httplog in HAProxy => No matching request (tail
> -100f /var/log/syslog|grep host.txt)
> * tcpdump - Apparently something bad is happening here (SYN, SYN, SYN,
> SYN, SYN, SYN, ACK, TCP acked lost segment) but I am no wizard in tcp
> debugging. I have attached the dump file.
> * I can ssh into the machine exit and ssh again without probs
> * My coworkers cannot at all.
> * We move HAProxy to another machine and the same problems appears there and
> now we can ssh to the old machine (only when we stop haproxy not before
> that, even though we have zero traffic now.)
>
> Any ideas anyone ? This is killing me and I hate this so much I can
> die....buaahahaha. Have never seen something like this..

Your machine is simply overloaded with SYNs. Either it is under a SYN flood (possible), or it is not tuned to support the load you're sending to it, and the SYN backlog is too small. Most distros ship with average parameters which are good for desktop usage and for SMB server usage, but never for hosting. Check your /proc/sys/net/core/somaxconn. It limits the number of concurrent pernding connections at a time on the whole system. By default it's commonly 128, but on large systems, you want it very large (say 30000). But it will not be enough, because /proc/sys/net/ipv4/tcp_max_syn_backlog is a limit for the same value too, but per socket (the min of both serves as a max for pending conns). It's often 1024 by default. You want it large too (eg: 30000).

There are several other important tunables, but these ones clearly act on the problem you describe. You have to restart any process affected by the limit when you change them. I understand that it's mainly haproxy in your situation, which should not be too hard :-)

If you're under a SYN flood attack, you may want to enable SYN cookies, which tend to provide some help. But doing so will make things worse for your firewall.

BTW, I'm wondering. Where did you take the tcpdump capture ? Before or after the firewall ? You TTL makes me think it was after (which is OK). But if it was before, it might as well be the firewall which cannot create any more connection.

Last, all what I said above consider that you don't have ip_conntrack nor iptables loaded on your server. Otherwise, you need to tune them first (hash_size, conntrack_max, etc...) in order to ensure that it's not ip_conntrack which is saturated. However, when it does, it reports it in the logs (dmesg|grep -i full).

Regards,
Willy Received on 2008/10/30 22:24

This archive was generated by hypermail 2.2.0 : 2008/10/30 22:30 CET