Re: high cpu utilization

From: Willy Tarreau <w#1wt.eu>
Date: Sat, 16 Feb 2008 15:54:40 +0100


On Sat, Feb 16, 2008 at 07:41:21AM -0500, Marc Breslow wrote:
> Thanks Willy. I wanted to wait until a slower time to run strace as it
> sounded like it could interrupt or slow down our services. HAProxy is
> running at 50% CPU now with roughly 275 HTTP sessions and 100 TCP sessions.

What's the approximative session rate and data rate ? If you have 5000 sessions per second or if you are forwarding 1 Gbps, that could be justified. Otherwise, even my small Geode 500 MHz consumes that at 100 Mbps under 3W !

> I generated the trace file. I searched for "refused" and found things like
> 07:26:30.697634 send(372, "HEAD /staging.online HTTP/1.0\r\n\r"..., 33,
> MSG_DONTWAIT|MSG_NOSIGNAL) = -1 ECONNREFUSED (Connection refused) <0.000010>
>
> Is that an example of something that takes a lot of CPU for haproxy?

No, this looks like a health-check, it consumes almost nothing. I was more worried about un checked servers which would be down but regularly selected under high session rates (eg: thousands of sessions per second).

> Maybe we're not using haproxy in the most effective way.

I don't see why this would be the case. However, as I said, I'm more worried about the kernel, and my worries were amplified by the "top" output you posted which showed very high system CPU usage.

> We have a couple of
> spare web server instances in our cluster that are usually not online. The
> way that we bring them online is by creating the file that haproxy uses to
> see if it's up or down. So every 2.5s it's checking those two servers and
> finding their down.

OK, but that's almost nothing. I don't remember the size of your machine, but on a typical 2 GHz single-proc machine, health-checks can go as high as 20000/s at full throttle, so you're about 50000 times below, which cannot be a problem.

> We also have an entire duplicate haproxy configuration for our testing site
> which we'll add 1 or 2 servers into at any time. We add the servers in by
> touching a different file on the web server that haproxy is constantly
> polling for. 5 or 6 out of 6 of these instances are usually unavailable.

This is the correct way of doing this.

> Is that more overhead for haproxy then if the servers are always available?

Not at all. BTW, you can also run strace in statistics mode :

# strace -c -p $(pidof haproxy)

wait 5s then Ctrl-C. It will output something like this :

Process 12233 attached - interrupt to quit Process 12233 detached
% time seconds usecs/call calls errors syscall

------ ----------- ----------- --------- --------- ----------------
 77.38    0.025027          70       355         1 select
  8.75    0.002829           5       548           ioctl
  4.32    0.001397          11       133        33 read
  3.66    0.001185          13        90           write
  2.96    0.000957          10       100         3 sigreturn
  2.94    0.000950           3       276           gettimeofday
------ ----------- ----------- --------- --------- ----------------
100.00    0.032345                  1502        37 total

If you could post it here, it would help trouble-shoot the problem.

> What else can I look for in the trace file?

Large times, and signs of one syscall looping on nothing. Eg: epoll() returning zero and being called immediately afterwards. This would indicate a big bug in haproxy. But since 1.2 and 1.3 are fairly different in this area and both exhibit the problem for you, I'm sceptical.

Regards,
Willy Received on 2008/02/16 15:54

This archive was generated by hypermail 2.2.0 : 2008/02/16 16:00 CET