Re: haproxy alleatory fails on config reload

From: Willy Tarreau <w#1wt.eu>
Date: Fri, 30 May 2008 06:43:34 +0200


On Fri, May 30, 2008 at 01:22:56AM +0200, Pablo Escobar wrote:
> Hi Krzysztof and Willy,
>
> I have compiled haproxy 1.3.15.1 for x86_64
>
> I found with dmesg some of this errors:
> ip_conntrack: table full, dropping packet.
>
> So I tougth I had found the problem but I was wrong. I doubled the
> ip_conntrack_max value and the dmesg error dissapeared but I still get the
> same problem. Around 5 reloads ok and then a wrong reload wich takes all my
> websites offline for EXACTLY 1 MIN.
>
> connecting to port 81 I see every backend up. connecting on port 80 I get 503
> error on every backend. exactly when the stats web arrives to "uptime
> 1:01min" I get every backend up on port 80. I am sure the 1min is not random.
> I have tried it 2 times with the same result.

1 minute might be the time needed to expire old sessions from your conntrack. You can try to reduce ip_conntrack_tcp_timeout_time_wait in /proc/sys/net/ipv4/netfilter to see if this has any effect.

Also, it is possible that your conntrack is buggy and does not initiate a new session on a SYN reusing a same source port (once your source ports wrap around).

You should *really* unload conntrack to see if it makes any difference.

> Other strange thing is that I am getting this on logs on every reload. on both
> good reload and wrong reloads

(...)
> May 30 00:47:58 127.0.0.1 haproxy[27763]: Proxy trac failed to enter pause
> mode.
> May 30 00:47:58 127.0.0.1 haproxy[27763]: Proxy trac failed to enter pause
> mode.

OK for those now I think I found why they report this error. haproxy is trying to pause all proxies, including the backends, then checks that they are correctly paused, which is stupid. I'll fix this. In my opinion, except for the unpleasant log messages, it should not cause any trouble.

> other strange thing is that in the wrong reload I got this on the log:
>
> May 30 00:47:42 127.0.0.1 haproxy[27763]: 127.0.0.1:35292
> [30/May/2008:00:47:42.500] http-in http-in/<STATS> 0/-1/-1/-1/0 200 34106 - -
> PR-- 0/0/0/0/0 0/0 "GET /stats HTTP/1.1"
> May 30 00:47:42 127.0.0.1 haproxy[27763]: 127.0.0.1:35297
> [30/May/2008:00:47:42.990] http-in http-in/<STATS> 0/-1/-1/-1/0 200 34084 - -
> PR-- 1/1/0/0/0 0/0 "GET /stats HTTP/1.1"
> May 30 00:47:43 127.0.0.1 haproxy[27763]: 127.0.0.1:35310
> [30/May/2008:00:47:43.507] http-in http-in/<STATS> 0/-1/-1/-1/0 200 34053 - -
> PR-- 1/1/0/0/0 0/0 "GET /stats HTTP/1.1"
> May 30 00:47:43 127.0.0.1 haproxy[27763]: 127.0.0.1:35315
> [30/May/2008:00:47:43.622] http-in http-in/<STATS> 0/-1/-1/-1/0 200 34055 - -
> PR-- 2/2/0/0/0 0/0 "GET /stats HTTP/1.1"
> May 30 00:47:44 127.0.0.1 haproxy[27763]: 127.0.0.1:35328
> [30/May/2008:00:47:44.109] http-in http-in/<STATS> 0/-1/-1/-1/0 200 34056 - -
> PR-- 1/1/0/0/0 0/0 "GET /stats HTTP/1.1"
> May 30 00:47:44 127.0.0.1 haproxy[27763]: 127.0.0.1:35335
> [30/May/2008:00:47:44.555] http-in http-in/<STATS> 0/-1/-1/-1/0 200 34056 - -
> PR-- 2/2/0/0/0 0/0 "GET /stats HTTP/1.1"
>
> 30/May/2008:00:47:44
> ---------- HERE STOPS LOGIN FOR AROUND ONE MINUTE ---------------------
> ---------- THEN CONTINUES LOGGING BUT TIMESTAMP IS 1SEC LATER -----------
Are you really sure about this ? The log messages are only 1 second later, so it like your machine paused for 1 minute. Last time I saw such a thing was on a machine regularly entering suspend mode due to excessive power management settings.

> May 30 00:47:45
>
> May 30 00:47:45 127.0.0.1 haproxy[27763]: 127.0.0.1:35338
> [30/May/2008:00:47:45.035] http-in http-in/<STATS> 0/-1/-1/-1/0 200 34056 - -
> PR-- 0/0/0/0/0 0/0 "GET /stats HTTP/1.1"
> May 30 00:47:45 127.0.0.1 haproxy[27763]: 127.0.0.1:35355
> [30/May/2008:00:47:45.559] http-in http-in/<STATS> 0/-1/-1/-1/0 200 34059 - -
> PR-- 2/2/0/0/0 0/0 "GET /stats HTTP/1.1"
> May 30 00:47:46 127.0.0.1 haproxy[27763]: 127.0.0.1:35371
> [30/May/2008:00:47:46.043] http-in http-in/<STATS> 0/-1/-1/-1/0 200 34060 - -
> PR-- 3/3/0/0/0 0/0 "GET /stats HTTP/1.1"
> May 30 00:47:46 127.0.0.1 haproxy[27763]: 127.0.0.1:35373
> [30/May/2008:00:47:46.283] http-in http-in/<STATS> 0/-1/-1/-1/0 200 34060 - -
> PR-- 2/2/0/0/0 0/0 "GET /stats HTTP/1.1"
> May 30 00:47:46 127.0.0.1 haproxy[27763]: 127.0.0.1:35376
> [30/May/2008:00:47:46.545] http-in http-in/<STATS> 0/-1/-1/-1/0 200 34060 - -
> PR-- 1/1/0/0/0 0/0 "GET /stats HTTP/1.1"
> May 30 00:47:47 127.0.0.1 haproxy[27763]: 127.0.0.1:35389
> [30/May/2008:00:47:47.217] http-in http-in/<STATS> 0/-1/-1/-1/0 200 34060 - -
> PR-- 5/5/0/0/0 0/0 "GET /stats HTTP/1.1"
> May 30 00:47:47 127.0.0.1 haproxy[27763]: 127.0.0.1:35406
> [30/May/2008:00:47:47.915] http-in http-in/<STATS> 0/-1/-1/-1/0 200 34061 - -
> PR-- 2/2/0/0/0 0/0 "GET /stats HTTP/1.1"
> May 30 00:47:48 127.0.0.1 haproxy[27763]: 127.0.0.1:35415
> [30/May/2008:00:47:48.371] http-in http-in/<STATS> 0/-1/-1/-1/0 200 34096 - -
> PR-- 3/3/0/0/0 0/0 "GET /stats HTTP/1.1"
>
> ¿¿maybe can affect that I am using syslog-ng?? I dont think so but.....

No I don't think so either.
One possibility would be the fact that conntrack is full and drops UDP logs, but I really don't think so because once the session is created, the same will be used again and again.

> The output of ethtool is this:
>
> Settings for eth0:
> Supported ports: [ TP ]
> Supported link modes: 10baseT/Half 10baseT/Full
> 100baseT/Half 100baseT/Full
> 1000baseT/Full
> Supports auto-negotiation: Yes
> Advertised link modes: 10baseT/Half 10baseT/Full
> 100baseT/Half 100baseT/Full
> 1000baseT/Full
> Advertised auto-negotiation: Yes
> Speed: 100Mb/s
> Duplex: Full
> Port: Twisted Pair
> PHYAD: 1
> Transceiver: internal
> Auto-negotiation: on
> Supports Wake-on: g
> Wake-on: d
> Link detected: yes
>
>
> Settings for eth1:
> Supported ports: [ TP ]
> Supported link modes: 10baseT/Half 10baseT/Full
> 100baseT/Half 100baseT/Full
> 1000baseT/Full
> Supports auto-negotiation: Yes
> Advertised link modes: 10baseT/Half 10baseT/Full
> 100baseT/Half 100baseT/Full
> 1000baseT/Full
> Advertised auto-negotiation: Yes
> Speed: 100Mb/s
> Duplex: Full
> Port: Twisted Pair
> PHYAD: 1
> Transceiver: internal
> Auto-negotiation: on
> Supports Wake-on: g
> Wake-on: d
> Link detected: yes
>
> Not sure if is possible what you mention " duplex mismatch between the haproxy
> server and the switch". I will try to explore it deeper tomorrow at the lab.

At least you have auto-neg enabled and full duplex reported, so you most likely have the same on the other end.

> The 503 error logs that I sent on my last mail I think was ok. I mean, the
> backend was really down and the 503 error was right. I just added the "trac"
> backend today and testing it I turned it off. The strange thing is that those
> are the only 503 errors on my log. On the 1 minute downtime I dont get any
> 503 error on my haproxy log (I can see the 503 just on my browser).
>
> I thik tomorrow I will try to compile latests haproxy version for x86 and do
> some testing....

First, disable ip_conntrack so that we know where to look for the issue. If conntrack is responsible for the problem, we need to understand why it is not clear from haproxy's logs and try to improve the logs so that tracking that down becomes easier in the future.

> many thanks again for all your help.
>
> I go to bed now. enough for one day :)

:-)

regards,
willy Received on 2008/05/30 06:43

This archive was generated by hypermail 2.2.0 : 2008/05/30 06:46 CEST