Re: nbproc>1, ksoftirqd and response time - just can't get it

From: Dmitriy Samsonov <dmitriy.samsonov#gmail.com>
Date: Mon, 18 Jul 2011 22:01:47 +0400


Hi!

2011/7/18 Willy Tarreau <w#1wt.eu>

> Hi,
>
> On Mon, Jul 18, 2011 at 06:42:54AM +0400, Dmitriy Samsonov wrote:
> > My test setup is three Dell r410 servers (dual Intel(R) Xeon(R) CPU X5650
> @
> > 2.67GHz - 24 threads total, 128Gb RAM) all connected to 1Gbps network.
> >
> > One server is haproxy, configured to block all requests with
> > 'Accept-Encoding: none':
> >
> > global
> > daemon
> > maxconn 80000
> > option forwardfor
> > retries 10
>
> Something is missing here above, probably "defaults" and maybe a few other
> options before "option forwardfor", such as "mode http" and even a few
> timeouts.
>

Yes, section defaults is the follwing:
defaults

        mode http
        maxconn 79500
        timeout client 20s
        timeout server 15s
        timeout queue  60s
        timeout connect 4s
        timeout http-request 5s
        timeout http-keep-alive 250
        #option httpclose
        option abortonclose
        balance roundrobin
        option forwardfor
        retries 10

Also there was conntrack module loaded - friend of mine was playing with iptables and did not remove it. Now there is no iptables at all:

dex9 ipv4 # sysctl -a | grep conntrack | wc -l 0
dex9 ipv4 # lsmod | grep xt_ | wc -l
0
dex9 ipv4 # lsmod | grep nf_ | wc -l
0

I followed your recommendation and set affinity for processes: dex9 ipv4 # schedtool 16638 # haproxy's process PID 16638: PRIO 0, POLICY N: SCHED_NORMAL, NICE 0, AFFINITY 0x2

dex9 ipv4 # schedtool 3 # ksoftirqd/0
PID 3: PRIO 0, POLICY N: SCHED_NORMAL, NICE 0, AFFINITY 0x1

Now in top it looks like this:
  PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 16638 root 20 0 100m 36m 520 R 93 0.0 19:33.18 haproxy

    3 root 20 0 0 0 0 R 38 0.0 7:11.76 ksoftirqd/0

93% percent for haproxy and 38% for ksoftirqd/0

I was lucky enough to reach 66903 session rate (max) and average value for Cur is around 40-42k.

Typical output of one of two ab2 running is:

Server Software:

Server Hostname:        nohost
Server Port:            80

Document Path:          /
Document Length:        0 bytes

Concurrency Level:      1000
Time taken for tests:   470.484 seconds
Complete requests:      10000000
Failed requests:        0
Write errors:           0
Total transferred:      0 bytes
HTML transferred:       0 bytes
Requests per second:    21254.72 [#/sec] (mean)
Time per request:       47.048 [ms] (mean)
Time per request:       0.047 [ms] (mean, across all concurrent requests)
Transfer rate:          0.00 [Kbytes/sec] received

Connection Times (ms)
              min  mean[+/-sd] median   max
Connect:        0   34 275.9     11   21086
Processing:     0   13  17.8     11     784
Waiting:        0    0   0.0      0       0
Total:          2   47 276.9     22   21305

Percentage of the requests served within a certain time (ms)
  50%     22
  66%     26
  75%     28
  80%     30
  90%     37
  95%     41
  98%     47

  99% 266
 100% 21305 (longest request)

Typical output of vmstat is:
dex9 ipv4 # vmstat 1
procs -----------memory---------- ---swap-- -----io---- -system-- ----cpu----
 r b swpd free buff cache si so bi bo in cs us sy id wa
 1 0 0 131771328 46260 64016 0 0 2 1 865 503 1 6 94 0
 1 0 0 131770688 46260 64024 0 0 0 0 40496 6323 1 9 90 0
 1 0 0 131771920 46260 64024 0 0 0 0 47046 7212 1 8 91 0
 2 0 0 131771704 46260 64024 0 0 0 0 40864 6143 1 8 91 0
 1 0 0 131771688 46268 64024 0 0 0 12 36547 5268 1 8 91 0
 2 0 0 131771696 46268 64032 0 0 0 0 53189 8979 0 5 95 0
 3 0 0 131771584 46268 64024 0 0 0 0 31633 4025 0 4 96 0
 2 0 0 131771840 46268 64020 0 0 0 0 49723 9290 1 9 91 0
 2 0 0 131772448 46268 64028 0 0 0 0 44484 7008 1 8 91 0
 2 0 0 131771688 46276 64032 0 0 0 20 40132 4531 1 8 92 0
 2 0 0 131771456 46276 64028 0 0 0 0 36006 4445 0 8 91 0
 2 0 0 131772208 46276 64024 0 0 0 0 41325 5902 1 8 91 0
 2 0 0 131771208 46276 64032 0 0 0 0 44262 7427 1 5 94 0
 1 0 0 131771456 46276 64028 0 0 0 0 42403 5422 1 8 91 0
 2 0 0 131771944 46284 64020 0 0 0 12 46907 7419 1 7 93 0
 1 0 0 131772960 46284 64020 0 0 0 0 42772 6663 1 8 91 0
 2 0 0 131772832 46284 64024 0 0 0 0 45298 6695 1 8 91 0
 2 0 0 131772712 46284 64028 0 0 0 0 44604 5361 0 6 94 0
 1 0 0 131772464 46284 64028 0 0 0 0 39798 5105 0 4 96 0

Also, I've checked version of NIC's firmware: dex9 ipv4 # ethtool -i eth0
driver: bnx2
version: 2.0.21
firmware-version: 6.2.12 bc 5.2.3
bus-info: 0000:01:00.0

Moreover, I've tried launching two ab2 localy: dex9 ipv4 # ab2 -c 1000 -H 'Accept-Encoding: None' -n 10000000 http://localweb/
This is ApacheBench, Version 2.3 <$Revision: 655654 $> Copyright 1996 Adam Twiss, Zeus Technology Ltd, http://www.zeustech.net/ Licensed to The Apache Software Foundation, http://www.apache.org/

Benchmarking inbet.cc (be patient)
Completed 1000000 requests
Completed 2000000 requests
^C

Server Software:

Server Hostname:        localweb
Server Port:            80

Document Path:          /
Document Length:        0 bytes

Concurrency Level:      1000
Time taken for tests:   104.583 seconds
Complete requests:      2141673
Failed requests:        0
Write errors:           0
Total transferred:      0 bytes
HTML transferred:       0 bytes
Requests per second:    20478.13 [#/sec] (mean)
Time per request:       48.833 [ms] (mean)
Time per request:       0.049 [ms] (mean, across all concurrent requests)
Transfer rate:          0.00 [Kbytes/sec] received

Connection Times (ms)
              min  mean[+/-sd] median   max
Connect:        0   38 352.7      6   21073
Processing:     0   10  49.7      7   14919
Waiting:        0    0   0.0      0       0
Total:          1   48 365.8     13   21078

Percentage of the requests served within a certain time (ms)
  50%     13
  66%     19
  75%     26
  80%     35
  90%     36
  95%     37
  98%     39
  99%     67

 100% 21078 (longest request)

Two such ab2 processes are running both at 100% and saturating haproxy to 100%. 'Cur' session rate is also around 40-44k/s.

Should I get rid of dell r410 and replace it with Core i5?:)) Being serious, is there any other tips or tricks I can try? To see those amazing 100k/s session rate?

> > frontend public
> > bind 192.168.0.1:80
> > default_backend nginx
> > acl accepts_none hdr(Accept-Encoding) -i none
> > errorfile 403 /raid/emptypage.txt
> > block if accepts_none
> >
> > backend nginx
> > server srv 127.0.0.1:80 maxconn 8192
> >
> > File /raid/emptypage.txt is an empty file made with 'touch
> > /raid/emptypage.txt'.
> >
> > I'm doing ab2 -c 1000 -H 'Accept-Encoding: None' -n 1000000
> > http://192.168.0.1/ on two other servers and get following:
> >
> > When nbproc = 1 haproxy saturates 100% of cpu core it runs at, but server
> is
> > running nice, I'm able to get reply from nginx behind by using curl on my
> > machine: curl http://192.168.0.1/, ab reports 16833 requests/second each
> and
> > longest request is around 14seconds.
>
> 16k req/s is quite low for such a server, especially since no request
> should
> be forwarded. You should check how the user/system CPU usage is spread, I
> suspect there could be an important part spent in the system. Such a test
> should better report numbers in the 100k range.
>
> Also, 14 seconds to respond is simply unacceptable. Something is not
> working at all. 14 milliseconds would even be high for this type of
> processing but could be explained by packet queuing and interrupt
> rate depending on the NIC tuning.
>
> > When I change nbproc to higher values (maximum is 24 as there 24 threads
> > total) I can see ksoftirq/0 process saturating cpu core, network becomes
> > slow on server, ab reports same 16k-17k requests/second for each client,
> but
> > longest request is always around 20-30 seconds.
>
> At such loads, you should not see ksoftirq trigger. 16kreq/s, assuming
> there is no keep-alive will basically result in 7 packets per request,
> (4 in, 3 out) so you're running around 112 kpps, or 64k IN and 48k out.
>
> Could you check with "vmstat 1" your interrupt rate ? Wouldn't your NIC
> be a BNX2 with an old firmware such as 1.9.6 which is known for causing
> a high number of packet losses ? (check with ethtool -i eth0).
>
> > I've seen such things with ksoftirq/0 running at 100% and network is
> almost
> > down during DDoS attacks in case of too many iptables rules but what is
> > happening now?
>
> Indeed, if you have many iptables rules, then your machine will be very
> sensible to SYN floods because for each SYN it will be necessary to find
> the matching rule. Do you have ip_conntrack / nf_conntrack loaded ? I
> assume that it's the case. Did you tune it ? Default settings are fine
> for a desktop usage but not for a server. It is possible that all the
> time is wasted in the conntrack hash table lookup.
>
> > And what number to use at nbproc? Is it ok to have haproxy
> > running at 100%?
>
> It can be under extreme situations (eg: ddos) but it should not be the
> norm, otherwise your scaling abilities are seriously limited. Nbproc
> should normally be 1. With proper tuning this will give you much more
> raw performance than what you can get in multi-process and poor tuning.
> To give you an idea, at Exceliance, we can get 300kconn/s out of a single
> core of a dual-core i5 3 GHz. You're at 16k.
>
> What is very important too is to bind network interrupts and processes
> to different core sharing the same L2 cache. If you force all your NIC
> interrupts to core 0 and bind haproxy to core 1, chances are that they
> will be on the same cache (check in /proc/cpuinfo). This is how I get
> the most performance, with IRQs taking between 50 and 80% of one core,
> and haproxy 100% of another core.
>
> > It looks like I can have 30k requests per second in my
> > setup, is there any way to make it higher? I've done some basic tuning
> like
> > tcp_max_tw_buckets = 1024*1024, tcp_tw_reuse = 1, tcp_max_syn_backlog =
> > 30000. Am I running out of options?
>
> There are a number of other options, such as net.core.somaxconn which
> also limits the backlog (128 by default!). But in your case, the apparent
> lack of NIC/process binding can be one cause, and if iptables is there,
> it surely is another one.
>
> It is easy to verify if iptables is likely involved : for this you'll
> enable keep-alive. Add this to your haproxy config :
>
> option http-server-close
>
> And replace :
>
> block if accepts_none
>
> with :
>
> redirect location / code 302 if accepts_none
>
>
> Then use "ab -k" instead of "ab" to run your tests. It will try to maintain
> connections open as long as possible. The only difference is that it will
> see status 302 instead of 403. But that's not a problem for a test.
>
> If you see that the numbers increase by a factor of approximately two, it
> will probably mean that you're bound by the packet count and/or by the CPU.
> If your numbers increase a lot more (eg: 5-10 times) then it will be very
> likely due to the fact that you don't stress conntrack anymore and don't
> spend your CPU cycles doing connection lookups.
>
> I really encourage you to try this as it's a simple test. Hint: on my
> 6 y-o laptop, I'm already getting your numbers above, and testing with
> keep-alive doubles the performance.
>
> Regards,
> Willy
>
>
Received on 2011/07/18 20:01

This archive was generated by hypermail 2.2.0 : 2011/07/18 20:15 CEST