Re: Health check timeframe and incoming connections

From: Willy Tarreau <w#1wt.eu>
Date: Thu, 24 Dec 2009 06:48:28 +0100


On Wed, Dec 23, 2009 at 11:31:49AM -0800, Paul Hirose wrote:
> I didn't see my original post(s) come back from haproxy@ mail-list, so I'm
> trying again, hopefully consolidated a bit smaller :)

yes it was posted.

> I have the configuration lines:
> option httpchk
> balance roundrobin
> timeout connect 5s
> timeout client 5s
> timeout server 5s
> server LDAP1 1.1.1.1:389 check addr localhost port 9101 inter 5s
> downinter 120s fall 3
> server LDAP2 1.1.1.2:389 check addr localhost port 9102 inter 5s
> downinter 120s fall 3
>
> So haproxy checks every 5 seconds. Say LDAP1 is down, and after 3 failure
> responses, that realserver service is marked unavailable. What happens to
> incoming connections during that 15s? Half go to LDAP2 (since it's
> round-robin) and get answered ok. The other go to LDAP1 and stall?

Yes, precisely, because the "fall 3" really means "ignore any issue up to 3 consecutive times". When you're doing health checks, it's very common to fail a connection once in a while, which must generally not imply the server is dead.

> The
> early ones, I guess, will timeout because of the 5s timeout limit and then
> just get rejected or whatever and the client can try again (and hopefully
> get LDAP2?)

you need the "retries" and "option redispatch" parameters for that, but I don't see them in your config. For instance, "retries 3" means that up to 3 consecutive connection failures will be retried. With "option redispatch" you're saying that the last retry should be performed on a different server.

> What about ones near the end of that 15s time? If a
> connection is still in its 5s timeout server timeframe, and the server is
> marked as unavailable, will that connection be rerouted to the other
> server(s) in the pool?

Yes if you use "retries".

Normally, you should ensure that your retries * timeout connect cover a complete health check cycle. Also you can speed up failure detection using fastinter. For instance, if you have "inter 5s fastinter 1s", it will take 5 seconds to detect the first failure, then 1s for each new one, so it'll take 5+1+1 = 7 seconds to mark the server down.

> Second, is there a way to get customized health check scripts launched by
> haproxy itself?

no (see my previous mail about chroot).

> I realize you might not want haproxy to be held hostage by
> poorly written user scripts. Perhaps a forked child process or something?
> Right now, I use xinetd and launch a script whenever haproxy connects to
> port 9101 or 9102, and the script does an actual LDAP query and prints out
> HTTP200 if the query returs a known good value, or HTTP500 for any other
> result. I find myself now having to do many more such localhost xinetd
> type expansion, one for every server

Exactly. The other advantage of the inetd script is that it runs on the server, where we have the most information about its health. Some people I know use those scripts to check the end of master-master replication of their LDAP servers in the logs. The server is not presented as available until replication is complete. You cannot much check that remotely.

> Finally, I'm using the VRRP part of keepalived to handle failover of the
> actual load-balancer that runs haproxy. If anyone has any tips/hints on
> better intergating haproxy and keepalived, I'd greatly appreciate it. It's
> working for me right now, and I can unplug the network cable and it fails
> over to the other load-balancer (which also has haproxy running) and my
> clients go merrily on their way. But anything to smooth that out, or any
> gotchas I should watch out for.

If you use a recent version of keepalived, you have the vrrp_scripts checks that I have implemented there and which allow you to track some local processes. I like to check for presence of the haproxy service itself so that if I accidentely kill it, the VRRP address migrates to the other server. I don't remember all the exact details but the principle is the following :

vrrp_script check_haproxy {

      script "killall -0 haproxy"    # check for process presence
      interval 1                     # every second
      fall 2                         # allow it to disappear once (eg: restart)
      rise 1

}

But there are more reliable examples in the keepalived documentation with the proper syntax.

One thing we discussed with Alex (keepalived's author) was to have an external checker which would run a wide variety of tests and could inform both keepalived and haproxy about the tests results. But there are many issues to solve before being able to achieve that, because we need the services to be able to send some feedback to the checker (eg: fastinter changes the check speed, and we want to be able to change the checks depending on traffic errors). So that's not something we'll get soon at all :-)

Regards,
Willy Received on 2009/12/24 06:48

This archive was generated by hypermail 2.2.0 : 2009/12/24 07:00 CET