Re: [PATCH 00/26] Restart of haproxy without dropping connections

From: Willy Tarreau <w#1wt.eu>
Date: Thu, 10 Mar 2011 08:46:20 +0100


On Thu, Mar 10, 2011 at 03:52:58PM +0900, Simon Horman wrote:
> > Wow, I'm impressed, you managed to do that really quickly !
>
> Thanks. Its taken quite a bit of effort to get this far.

I have no doubt about that :-)

> > In the mean time, I have a few questions which come to mind :
> > - does the socket cache consider all of the "bind" parameters ?
> > (eg: mss, interface, transparent, ...)
>
> Yes, it attempts to take them into account, though in a very naive way.
> If they all match then the socket cache entry can be re-used.
> Otherwise the entry is invalidated.

OK.

> > - what happens if the new config file uses some conflicting bind
> > entries ? Eg: old config used to listen on 192.168.1.1:80 and
> > the new one uses 0.0.0.0:80 ? Or even :80 for the old one (IPv4)
> > and :::80 for the new one (IPv6) ?
>
> I think there will be a problem. Lets talk about what would be a sensible
> thing to do.

In my opinion we should define how to fail a restart. Right now with -sf/-st, the new process is able to detect the failure to bind and exits with an error which can be detected by management scripts and reported to the admin which must immediately take action.

I think that a failure to restart smoothly means that the admin will have to take the decision to restart harder (-sf/-st). Sometimes the scripts will do that by themselves as an automatic fallback. So that make me believe that all we want is the master process to refuse to change anything in case an anomaly is detected, and to indicate its refusal. We cannot make it send a signal to the calling process because it does not know its pid, and even if it knew it it would probably not have enough permissions to send it anyway. But we should find something the master process can act on that the calling one can detect (eg: among the horrible ideas, if it's not chrooted, it can change its pid and update the pidfile, but that's ugly).

> > - does the master send a signal to all children asking them to unbind
> > (as we did with -sf) ?
>
> Yes. It sends them a SIGUSR1.
>
> It does not send them a SIGTTOU. Because although that works fine
> I am not entirely sure how to sanely unwind the master at that point.

I think it's OK that way because the master still owns the sockets, so even in case of a late failure, it can restart new processes anyway.

> > - do the debug modes (-d/-db) disable the master_worker mode ?
>
> No. I can make that so if you like.

Yes that's the idea. Many users abuse -d/-db (including me) and it's important that they don't have to touch their config for this.

> > In fact I'm interested in any corner cases we should be aware of so
> > that we can clearly document them and indicate how to handle them
> > (eg: fall back to -st if it's not possible to rebind, etc...)
>
> I don't know of any off hand. But one of my reasons for posting
> the code was to get more eyes on it so we can find such cases.

Fine. Any failure to restart with the expected config typically is what I'd call a corner case. That's why I'd like that we find a way to safely fail so that proper action can be taken externally.

> Re-initialising the configuration was somewhat non-trivial, and I am
> sure that I have missed a few things.

I'm certain that some parts were a real nightmare ! I'll take time to review all your patches because yes, it's possible that you fell in some traps or had to take some decisions that might have other impacts. That's also why I'd like to get your code quickly merged, it's the best way to find what might have been overlooked.

Best regards,
Willy Received on 2011/03/10 08:46

This archive was generated by hypermail 2.2.0 : 2011/03/10 09:00 CET