Re: Backend Server UP/Down Debugging?

From: Willy Tarreau <w#1wt.eu>
Date: Sun, 30 Aug 2009 15:42:25 +0200


Hi Krzysztof,

On Sun, Aug 30, 2009 at 02:21:55PM +0200, Krzysztof Oledzki wrote:
> Please check the attached patch. This code is far from being ready for
> inclusion, however I cleaned it and ported to 1.4-dev2, so it should work
> for you.

This is a really nice work. I'm eseing a few FIXMEs with random status codes which look like debugging tags. I think they will not have any impact on the tests though.

My thought about this feature was to store two statuses :

Indeed, it's very common to see a server first timeout, then crash. Or it may also return 500 server error or things like this, which will not be available anymore afterwards. And we need to keep the last one because it's what helps people troubleshoot in production typically when one error hides another one.

However, this is something which can be implemented incrementally. So we can start with your patch and improve it later.

> With this patch you should be able to see status of last check in the
> stats page and to check in you logs, why servers are considered down, for
> example:
>
> [WARNING] 241/141556 (1971) : Backup Server p1/s1 is DOWN, reason: Layer5-7
> response error(10), code: 400, check duration: 0ms. 0 active and 4 backup
> servers left. Running on backup. 0 sessions active, 0 requeued, 0 remaining
> in queue.

This really is excellent. I'm already impatient to merge it ;-)

I have a few comments/questions below :

> @@ -474,40 +572,69 @@
> if (s->proxy->options & PR_O_HTTP_CHK) {
> /* Check if the server speaks HTTP 1.X */
> if ((len < strlen("HTTP/1.0 000\r")) ||
> - (memcmp(trash, "HTTP/1.", 7) != 0)) {
> + (memcmp(trash, "HTTP/1.", 7) != 0 ||
> + (trash[12] != ' ' && trash[12] != '\r')) ||
> + !isdigit(trash[9]) || !isdigit(trash[10]) || !isdigit(trash[11])) {

here you're adding some tests to more accurately check the HTTP status line. Is it just for cleanness or did you encounter cases where the check looked valid but was not ?

> + trash[12] = '\0';
> + s->check_code = atoi(&trash[9]);

you can use one of the str2* functions (see standard.h and standard.c) which automatically stops at the first non-digit character.

> /* check the reply : HTTP/1.X 2xx and 3xx are OK */
> - if (trash[9] == '2' || trash[9] == '3')
> + if (trash[9] == '2' || trash[9] == '3') {
> s->result |= SRV_CHK_RUNNING;
> - else if ((s->proxy->options & PR_O_DISABLE404) &&
> + set_server_check_status(s, HCHK_STATUS_L57OKD);

I think you wanted to put HCHK_STATUS_L57OK here, not OKD since we're in the 2xx/3xx state and not 404 disable. Or maybe I misunderstood the OKD status ?

> + } else if ((s->proxy->options & PR_O_DISABLE404) &&
> (s->state & SRV_RUNNING) &&
> - (memcmp(&trash[9], "404", 3) == 0)) {
> + (s->check_code == 404)) {
> /* 404 may be accepted as "stopping" only if the server was up */
> s->result |= SRV_CHK_RUNNING | SRV_CHK_DISABLE;
> + set_server_check_status(s, HCHK_STATUS_L57OKD);
> }
> - else
> + else {
> s->result |= SRV_CHK_ERROR;
> + set_server_check_status(s, HCHK_STATUS_L57RSPERR);
> + }

(...)

> else if (s->proxy->options & PR_O_SMTP_CHK) {
> + /* Check if the server speaks SMTP */
> + if ((len < strlen("000\r")) ||
> + (trash[3] != ' ' && trash[3] != '\r') ||
> + !isdigit(trash[0]) || !isdigit(trash[1]) || !isdigit(trash[2])) {
> + s->result |= SRV_CHK_ERROR;
> + set_server_check_status(s, HCHK_STATUS_L57INVRSP);
> + goto out_wakeup;
> + }
> +
> + trash[3] = '\0';
> + s->check_code = atoi(&trash[0]);

same here for the status code conversion.

> +
> /* Check for SMTP code 2xx (should be 250) */
> - if ((len >= 3) && (trash[0] == '2'))
> + if (trash[0] == '2') {
> s->result |= SRV_CHK_RUNNING;
> - else
> + set_server_check_status(s, HCHK_STATUS_L57OKD);

L57OK here too ?

Thanks,
Willy Received on 2009/08/30 15:42

This archive was generated by hypermail 2.2.0 : 2009/08/30 15:45 CEST