Re: haproxy alleatory fails on config reload

From: Pablo Escobar <pescobar#cipf.es>
Date: Thu, 29 May 2008 18:00:18 +0200


Hi,

Many thanks for your answer Willy, really helpful as always :)

After your indications I tried to increase the check interval and reduce the rise parameter and I have been testing during this week. For three days it hasn´t failed but today suddenly the 503 error came back.

I havent found that the check column is growing in noany of the backends.

I have a quite strange behaviour. If I directly connect to the haproxy listening on port 81 I can see that all my backeds are up but I get the 503 error for around 10 -15 seconds after the reload. This happens randomly.

 Also if a backend is down when I do the reload I get the 503 on all my backend, not just the down backend. ¿is this the normal behaviour?

Right now I have a reverse proxy using apache´s mod_proxy which sends all my inbound http traffic to the haproxy listening on the same machine on port 81. ¿maybe this can be affecting? I am doing it in this way because having a apache´s vhost which processes all my http traffic let me apply mod_rewrite, mod_security and mod_cband to all my traffic.

If you are interested I can send you my vhost config and my haproxy config.

many thanks in advance for any help.

Pablo.

p.d. I havent forgot about the docs to get snmpd working with perl support. Sorry for the delay, late weeks I have been overloaded with work. I promise to send it this week.

El Friday 23 May 2008 22:15:32 Willy Tarreau escribió:
> On Thu, May 22, 2008 at 02:43:01PM +0200, Pablo Escobar wrote:
> > Hi to the list,
> >
> > I am having an strange issue when I try to reload haproxy´s config. For
> > each 5 tries to reload the config it usually fails 1 or 2 tries. Most of
> > the times haproxy keeps it current state and reload the new config but
> > alleatory it does a restart instead a reload so I get a 503 error on all
> > my websites untils the "rise 2" is acomplished on all my backends and
> > everythings gets up again.
> >
> > The most strange thing in the fail is the alleatory (maybe a bug??) .
> > Maybe I reload the config 5 times without any problem but suddenly fails.
> >
> > I need to add new acl quite often and it´s a pain to get a 503 error each
> > time I need a new acl. I think the config is ok because once the "rise 2"
> > is acomplissed and the 503 error dissapear my new acls work ok.
> >
> > I am running haproxy version 1.3.15 on debian etch x86_64.
> >
> > On my /etc/init.d/haproxy file I have this function to reload as
> > explained on the docs.
> >
> > haproxy_reload()
> > {
> > $HAPROXY -f "$CONFIG" -p $PIDFILE -D $EXTRAOPTS -sf $(<$PIDFILE)
> > \
> >
> > || return 2
> >
> > return 0
> > }
> >
> > I also tried this function which comes with debian lenny packages which I
> > think is the same:
> >
> > haproxy_reload()
> > {
> > $HAPROXY -f "$CONFIG" -p $PIDFILE -D $EXTRAOPTS -sf $(cat
> > $PIDFILE) \
> >
> > || return 2
> >
> > return 0
> > }
> >
> >
> > I verified that $PIDFILE points to /var/run/haproxy.pid in
> > both /etc/init.d/haproxy and /etc/haprox/haproxy.cfg and it contains the
> > right PID for the haproxy process.
> >
> > any ideas or suggestion about how to solve this problem?? it is really
> > annoying. Right now what I am doing is to add new acl at night when the
> > web traffic is lower. It´s a dirty solution and not usefull when someone
> > needs a new acl "for yesterday"

>

> I think that it's not a reload problem, but a health-check problem. When
> the new process starts, it sends a first health-check to the servers,
> and the result of *this* health check indicates the initial server state.
> This is to ensure that we don't start with dead servers and that we don't
> take too much time detecting valid servers.
>

> If your servers regularly fail several health checks, it is quite possible
> that upon a restart they all fail and you have to wait for rise*inter to
> get the service again. This should also show up in the stats with
> constantly growing values in the "check" column.
>

> What you can do is to increase the check interval in order to give them
> more time to respond. But be careful, it also means that if one still
> fails, it will take more time to be seen UP again. In this case, the
> solution is to reduce the "rise" parameter. You can also set "rise" to
> zero but use a slowstart instead. That way, as soon as a check succeeds,
> the server will be seen as UP, but will take a slowly growing load. If
> all servers are in the same situation, they will still get all the load,
> but if only a few are in this situation, they will get so little load
> compared to others that they will have the time to stabilize.
>

> Hoping this helps,
> Willy
-- 
Pablo Escobar Lopez
Head of Infrastructure & IT Support
Bioinformatics Department
Centro de Investigación Príncipe Felipe (CIPF)
Tfn: (34) 96 328 96 80 ext: 1004
http://bioinfo.cipf.es
Received on 2008/05/29 18:00

This archive was generated by hypermail 2.2.0 : 2008/05/29 18:00 CEST