Hypermail

From: Willy Tarreau <w#1wt.eu>
Date: Thu, 24 Jul 2008 21:42:54 +0200

Hi Rainer,

first, thanks for your update !

On Thu, Jul 24, 2008 at 04:35:47PM +0200, Rainer Sabelka wrote:
> To see some more information from apache I have added %I to its log format.
> This way I get the total size of the request. However, this is the number of
> bytes before SSL decryption, so it is just a not very accurate estimate of the
> deciphered request.

OK but it's useful anyway, now we know that MSIE sends all data.

> So, here is my last 502 error. Haproxy says:
> Jul 24 09:18:41 127.0.0.1 haproxy[4541]: 127.0.0.1:37099
> [24/Jul/2008:09:15:56.316] myapp_http myapp_default/web8 6038/0/0/-1/164829
> 502 204 - - SHVN 3/0/0/0 0/0 {87.xxx.yyy.zzz|161272|58502|0|58502} "POST
> /AttachDocument.Asp?SID=569137&action=143&state=0&binaryattach=1&TaskID=NEW_1&TSK_ORD_OrderID=180292
> HTTP/1.0"
>
> So the content length is 161272 bytes, but the request size is only 58502
> byte, which is not even the half.
>
> And this is what apache says:
>
> 87.xxx.yyy.zzz - - [24/Jul/2008:09:15:56 +0200] "POST
> /AttachDocument.Asp?SID=569137&action=143&state=0&binaryattach=1&TaskID=NEW_1&TSK_ORD_OrderID=180292
> HTTP/1.1" 502 123 "-" "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1;
> .NET CLR 1.1.4322; .NET CLR 2.0.50727)" 161272 - 10.3.0.41 369 162850 -
>
> Here we see the content length header is again 161272 (not surprisingly) and
> the total request size is 162850 byte.
> This makes 1578 bytes difference, which sounds reasonable for the headers and
> the SSL overhead when I compare it to other (successfull) requests.
> So it seems apache gets the complete request.
>
> Another interesting detail:
> From the lines before and after the above line of the apache access log file I
> conclude that it was written at 9:22:05.
> Haproxy wrote its logfile line at 09:18:41. So this means it took apache more
> than 4 minutes to send the 502 response to the client?
> Something seems really broken here.

Yes, that sounds really strange... It would mean that apache timed out. Probably that the 502 returned by apache is not the same as haproxy's. You may want to try adding "errorloc 502 /error502" in haproxy configuration. That way, instead of returning a 502, it will return "302 moved temporarily" with "/error502" in the location header. Apache should get the 302, but I would not be surprized that instead it still reports a 502 because the connection is frozen between both.

> > - problem with apache (not much likely)
>
> hmm I hope not, but could be possible
>
> > - problem between apache's OS and haproxy's OS => possible, given the
> > recent issues with TCP retransmits not correctly working on some
> > linux 2.6.
>
> Haproxy and apache are running on the same machine. Apache connects to haproxy
> locally via 127.0.0.1:84.
> I think problems with TCP retransmits aren't an issue with local connections,
> are they?

This could still happen because of network memory allocation for instance, though this should be rather rare (but may be as much rare as the problem you're seeing).

> > To rule out the two last possibilities, you could try the following :
> > - report the request buffer's data length in the logs. You should
> > find it in sess->req->l. You could also check that sess->req->total
> > matches the request size you have already logged, as it is the
> > source for this data. The request buffer should be empty if we're
> > waiting for the client, or populated if we're waiting for the
> > server (one of the last two possibilities).
>
> I've added those. The value of sess->req->total always matches s-
> >logs.bytes_in. And sess->req->l is always 0.
>
> So this would mean that the problem lies between apache and haproxy?

yes, exactly. That means 3 suspects now : apache, kernel and haproxy (in no particular order).

> > Also, since the requests are small, you might want to retry using
> > larger buffers so that one full request fits in the buffer. For that,
> > you need to rebuild with SMALL_OPTS="-DBUFSIZE=131072" for 128kB
> > buffers for instance. That way, you'll know how large of the request
> > apache did send.
>
> OK. I'll try that.

fine.

> > While I am thinking about it, could you check that you have the
> > client timeout set to the same value as the server timeout ?
>
> Yes its the same, my (relevant part of the) config file is below.

OK, very much a standard one.

> > Last resort will unfortunately be using tcpdump to try to capture
> > a full request both on the front and back sides. I really hope
> > that you won't have to tcpdump on an SSL connection :-/
>
> Well I should try. But, since these 502 errors happen relatively seldom
> compared to the total number of requests I'll have to capture really large
> amounts of data.

I have a few scripts to automatize this for you if you're interested. I'm using them for full-time capture in datacenters. They rotate files and monitor free disk space, then kill old files after a while. You just have to stop the script once you caught the problem again.

> But before I start with tcpdump I think I'll try another kernel first. Can you
> recommend a specific version?

Oh you're making me think about something nasty. All 2.6.24 and 2.6.25 kernels had a problem with FRTO which caused some TCP connections to be stuck after a few retransmits. This has been fixed in recent 2.6.25.X (don't precisely remember which one, but 2.6.25.10/11 are OK and .4 is not). And it seems like some distro's 2.6.24 are not fixed. I don't know if the problem pre-dates 2.6.24, but at least 2.6.22 was OK.

So if you're on 2.6.2[45], please upgrade to 2.6.25.11 (or even .12 in a few days). Otherwise, maybe tcpdump will be best in order to catch the real problem.

Regards,
Willy Received on 2008/07/24 21:42

Re: IIS times out on large post requests