Hypermail

From: Rainer Sabelka <sabelka#iue.tuwien.ac.at>
Date: Thu, 24 Jul 2008 16:35:47 +0200

Hi Willy,

thanks for you analysis!

On Friday 11 July 2008 22:09:13 Willy Tarreau wrote:
> Since you put me the doubt, I've just checked the code. The number of bytes
> logged is the number of bytes READ on the socket. So if you see 66749 bytes
> in the request buffer, it really means that only that many bytes (including
> the headers and the request line) have reached the process.

OK.

To see some more information from apache I have added %I to its log format. This way I get the total size of the request. However, this is the number of bytes before SSL decryption, so it is just a not very accurate estimate of the deciphered request.

So, here is my last 502 error. Haproxy says: Jul 24 09:18:41 127.0.0.1 haproxy[4541]: 127.0.0.1:37099 [24/Jul/2008:09:15:56.316] myapp_http myapp_default/web8 6038/0/0/-1/164829 502 204 - - SHVN 3/0/0/0 0/0 {87.xxx.yyy.zzz|161272|58502|0|58502} "POST /AttachDocument.Asp?SID=569137&action=143&state=0&binaryattach=1&TaskID=NEW_1&TSK_ORD_OrderID=180292 HTTP/1.0" So the content length is 161272 bytes, but the request size is only 58502 byte, which is not even the half.

And this is what apache says:

87.xxx.yyy.zzz - - [24/Jul/2008:09:15:56 +0200] "POST /AttachDocument.Asp?SID=569137&action=143&state=0&binaryattach=1&TaskID=NEW_1&TSK_ORD_OrderID=180292 HTTP/1.1" 502 123 "-" "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; .NET CLR 1.1.4322; .NET CLR 2.0.50727)" 161272 - 10.3.0.41 369 162850 - Here we see the content length header is again 161272 (not surprisingly) and the total request size is 162850 byte.
This makes 1578 bytes difference, which sounds reasonable for the headers and the SSL overhead when I compare it to other (successfull) requests. So it seems apache gets the complete request.

Another interesting detail:
From the lines before and after the above line of the apache access log file I conclude that it was written at 9:22:05. Haproxy wrote its logfile line at 09:18:41. So this means it took apache more than 4 minutes to send the 502 response to the client? Something seems really broken here.

> That would mean
> several possibilities (from the client to the server) :
> - MSIE not sending you all the data

It seems MSIE is behaving ok and sends all data.

> - the retransmit mechanism between windows and the OS hosting your apache
> proxies not correctly working (eg: no retransmit after a FIN has been
> sent, which could explain why it hits close to the end).

Also seems to work.

> - problem with apache (not much likely)

hmm I hope not, but could be possible

> - problem between apache's OS and haproxy's OS => possible, given the
> recent issues with TCP retransmits not correctly working on some
> linux 2.6.

Haproxy and apache are running on the same machine. Apache connects to haproxy locally via 127.0.0.1:84.
I think problems with TCP retransmits aren't an issue with local connections, are they?

> - problem in linux sometimes not waking up haproxy for the last data
> block (not much likely)
> - problem with haproxy not always being woken up when data comes in
> (possible, but sounds strange)
> - problem with haproxy not refreshing some timeouts which would
> finally strike (possible). However from your example above, the
> request lasted more than two minutes, which is enormous for only
> 66 kB of data (it is only 480 Bps), so having a timeout problem
> with that little data sounds really weird.

> - problem with haproxy sometimes not sending the last data chunk
> to the server (possible).

> - problem with the server not reading all the data (not ACKing the
> last block for instance, which would clobber the request buffer).
>
> To rule out the two last possibilities, you could try the following :
> - report the request buffer's data length in the logs. You should
> find it in sess->req->l. You could also check that sess->req->total
> matches the request size you have already logged, as it is the
> source for this data. The request buffer should be empty if we're
> waiting for the client, or populated if we're waiting for the
> server (one of the last two possibilities).

I've added those. The value of sess->req->total always matches s-
>logs.bytes_in. And sess->req->l is always 0.

So this would mean that the problem lies between apache and haproxy?

> Also, since the requests are small, you might want to retry using
> larger buffers so that one full request fits in the buffer. For that,
> you need to rebuild with SMALL_OPTS="-DBUFSIZE=131072" for 128kB
> buffers for instance. That way, you'll know how large of the request
> apache did send.

OK. I'll try that.
> While I am thinking about it, could you check that you have the
> client timeout set to the same value as the server timeout ?

Yes its the same, my (relevant part of the) config file is below.

> Last resort will unfortunately be using tcpdump to try to capture
> a full request both on the front and back sides. I really hope
> that you won't have to tcpdump on an SSL connection :-/

Well I should try. But, since these 502 errors happen relatively seldom compared to the total number of requests I'll have to capture really large amounts of data.

But before I start with tcpdump I think I'll try another kernel first. Can you recommend a specific version?

Thanks for you help.

Best regards,
-Rainer

global

        log 127.0.0.1 local6
        chroot /var/lib/haproxy
        user haproxy
        group haproxy
        daemon

defaults
        log     global
        mode    http
        option  httplog
        option  dontlognull
        retries 3
        option redispatch
        option persist
        # maxconn 2000
        contimeout      30000
        clitimeout      600000
        srvtimeout      600000

frontend myapp_http
        bind 10.3.0.41:80
        bind 10.4.0.41:80
        bind *:84
        capture request header X-Forwarded-For len 15
        capture request header content-length len 15
        default_backend myapp_default

backend myapp_default
        stats enable
        stats auth Administrator:******
        cookie  SERVERID insert indirect nocache
        balance roundrobin
        option httpclose
        option forwardfor except 127.0.0.1
        fullconn 300
        option httpchk HEAD /loadbalancer/server_available.html
        server web4  10.4.0.4:80 cookie srv4 check weight 7 minconn 4 maxconn 
30
        server web8  10.4.0.31:80 cookie srv8 check weight 50 minconn 16 
maxconn 60
        server web10 10.4.0.33:80 cookie srv10 check weight 50 minconn 16

maxconn 60 Received on 2008/07/24 16:35

Re: IIS times out on large post requests