Re: [PATCH 1/2] [MEDIUM] Collect & show information about last health check

From: Willy Tarreau <w#1wt.eu>
Date: Sat, 5 Sep 2009 19:00:05 +0200


Hi Krzysztof,

On Mon, Aug 31, 2009 at 09:27:40PM +0200, Krzysztof Piotr Oledzki wrote:
> >From cbfda2a0808cd5d5cbf17ec7f0d04f7091bec9cb Mon Sep 17 00:00:00 2001
> From: Krzysztof Piotr Oledzki <ole#ans.pl>
> Date: Mon, 31 Aug 2009 21:04:02 +0200
> Subject: [MEDIUM] Collect & show information about last health check
>
> Collect information about last health check result,
> including L5-7 code if possible (for example http or smtp
> return code) and time took to finish last check.
>
> Health check info is provided on both stats pages (html & csv)
> and logged when a server is marekd UP or DOWN. Currently active
> check are marked with an asterisk, but only in html mode.

I have tested this a bit, and I must say this is awesome. The idea of marking a currently active check is really smart too !

However, I found that it was hard to understand the status codes in the HTML stats page. Some people are already complaining about columns they don't trivially understand, but here I think that status codes are close to cryptic. Also, while it is not *that* hard to tell which one means what when you can compare all of them, they must be unambiguous when found individually.

I could propose some adjustments, but it's just a proposal, I'm open to other suggestions :

> Currently there are 9 status:
> UNK -> unknown

From my tests, this means this one is not tested. Maybe we should not indicate anything at all then, or just report "NONE" ? Also we would not want the "in 0 ms" indicator either for this case.

> SOCKERR -> socket error

Good idea to detect that one.

> L14OK -> check passed on layer 1-4, no upper layers testing enabled
> L14TMOUT -> layer 1-4 timeout
> L14UNR -> layer 1-4 unreachable, for example
> "Connection refused" (tcp rst) or "No route to host" (icmp)

I'd suggest "L4" instead of "L14". And also after all, layer 4 is what we're aware of, and people are used to associate it to TCP (eventhough TCP is rather layer 5 if we're purist). I'd suggest shortening the timeout label, something like L4TIM, L4TOUT, L4TMO, ... I think the former is more intuitive.

Would you agree to rename L14UNR to L4CON, to indicate a failure to connect ? In fact, "unreachable" is a bit too precise and not necessarily right. It's common to speak about unreachability in terms of routing, and I find it confusing to read UNR when a check fails on my own machine. Also, if one day we get more precise reports by exploiting errno a bit better, we will then be able to report a real unreachable error.

> L57OK -> check passed on layer 5-7
> L57TMOUT -> layer 5-7 (HTTP/SMTP/SSL) timeout
> L57INVRSP -> layer 5-7 invalid response - protocol error
> L57RSPERR -> layer 5-7 response error, for example HTTP 5xx

Same principle here, I'd suggest L7 instead of L57 for the same reasons as above. Let's shorten the TMOUT one too. We could also turn L57INVRSP into L7INV.

The response error indicates that the server was reached but that its status code was not accepted. Would you be OK with L7STS then ?

I also suggest that we reduce the column name "Last check" to "LastChk" to prevent the browser from wrapping it over two lines.

During my tests, I got a SOCKERR while the server was stopped. I think it's an asynchronous "connection refused" which triggered this error, probably here (but I may be wrong) :

@@ -370,8 +468,10 @@ static int event_srv_chk_w(int fd)

 			}
 			else if (ret == 0 || errno == EAGAIN)
 				goto out_poll;
-			else
+			else {
+				set_server_check_status(s, HCHK_STATUS_SOCKERR);
 				goto out_error;
+			}
 		}

I also found a missing comma at the end of the CSV title (you remember, we always get this wrong) :

> @@ -222,6 +224,7 @@ int print_csv_header(struct chunk *msg, int size)
> "chkfail,chkdown,lastchg,downtime,qlimit,"
> "pid,iid,sid,throttle,lbtot,tracked,type,"
> "rate,rate_lim,rate_max,"
> + "check_status,check_code,check_duration"
> "\n");

You should add one after 'check_duration', as all lines in the CSV are terminated with a comma.

So please tell me what you think about the comments above, if you want to repost an updated patch or if you prefer that I adjust yours, etc...

Thanks!
Willy Received on 2009/09/05 19:00

This archive was generated by hypermail 2.2.0 : 2009/09/05 19:15 CEST