Re: 1.5 badly dies after a few seconds

From: Willy Tarreau <w#1wt.eu>
Date: Sat, 18 Sep 2010 09:41:50 +0200


Hi Joe,

On Thu, Sep 16, 2010 at 04:49:00PM +0200, R.Nagy József wrote:
> Some more details, let the production server suffer 2 more times to
> test a narrowed down config.
> The new config only worked as a rate limiter 1.5.dev haproxy instance,
> and had a running 1.3 instance in the background doing the real
> backend game.

I really appreciate your involvement in trying to get this issue solved.

> So for the 1.5 rate limiter -still dieing- config was narrowed down to:
>
> global
> log 127.0.0.1 daemon debug
> maxconn 1024
> chroot /var/chroot/haproxy2
> uid 99
> gid 99
> daemon
> quiet
> pidfile /var/run/haproxy-private2.pid

One thing could be very useful, it would be to add the stats socket here in the global section :

	stats socket /tmp/haproxy.sock level admin mode 666
	stats timeout 1d

Then using the "socat" tool, you can connect to it and launch some commands to inspect the internal state :

 $ socat readline unix-connect:/tmp/haproxy.sock  prompt

 > show info
 > show stat
 > show sess
 > show table
 > show table mySite-webfarm

I'm particularly interested in those outputs, they will make it easier to find if we're facing a memory corruption, a resource shortage or any such trouble. If it's easier for you, you can also chain all the commands at once and avoid long copy-pastes :

 $ echo "show info;show stat;show sess;show table;show table mySite-webfarm" | socat stdio unix-connect:/tmp/haproxy.sock > haproxy-debug.log

I'm just thinking about something else : there are basically two things that change with the OS :

  1. polling system

you may try to disable kqueue by adding "nokqueue" in the global section. I don't think it's the issue because kqueue has not changed between 1.4 and 1.5 and there are some happy users of 1.4 on FreeBSD/OpenBSD.

2) struct sizes

the pool allocator merges structs of similar sizes in the same pools. In the past it has already happened that an uninitialized member that was always zero caused no trouble on most platforms but caused crashes on other ones due to it containing data from another use. You can check pool sizes by starting haproxy in debug mode then issuing a kill -QUIT on it :

  terminal1$ haproxy -db -f $file.cfg
  terminal2$ killall -QUIT haproxy

Haproxy will then dump all of its pools statistics to the stderr output. You don't need to do that in production in fact, you can do that on a test machine, because the output only depends on the binary itself and not on the environment.

> And yeah, died with the same socks error message as yesterday.
> (Server was hit by 30-40reqs/sec during this time, it died after ~30mins)

I noticed that the same error message can be found at two places. Could you please adapt them both in order to also dump the FD value ? :

In src/session.c around line 215, please change :

  Alert("accept(): cannot set the socket in non blocking mode. Giving up\n");

with :

  perror("fcntl");
  Alert("session_accept(): cannot set the socket %d in non blocking mode. Giving up\n", cfd);

And in src/frontend.c, around line 89, replace the same line with :

  perror("setsockopt");
  Alert("frontend_accept(): cannot set the socket %d in non blocking mode. Giving up\n", cfd);

I'm almost sure it's frontend_accept() that returns the error, and I'm interested in knowing the reported file descriptor which probably is buggy, as well as the errno code.

Thanks a lot for what you can do !
Willy Received on 2010/09/18 09:41

This archive was generated by hypermail 2.2.0 : 2010/09/18 09:45 CEST