Hypermail

From: Simon Horman <horms#verge.net.au>
Date: Thu, 10 Mar 2011 17:40:19 +0900

On Thu, Mar 10, 2011 at 08:46:20AM +0100, Willy Tarreau wrote:
> On Thu, Mar 10, 2011 at 03:52:58PM +0900, Simon Horman wrote:
> > > Wow, I'm impressed, you managed to do that really quickly !
> >
> > Thanks. Its taken quite a bit of effort to get this far.
>
> I have no doubt about that :-)
>
> > > In the mean time, I have a few questions which come to mind :
> > > - does the socket cache consider all of the "bind" parameters ?
> > > (eg: mss, interface, transparent, ...)
> >
> > Yes, it attempts to take them into account, though in a very naive way.
> > If they all match then the socket cache entry can be re-used.
> > Otherwise the entry is invalidated.
>
> OK.
>
> > > - what happens if the new config file uses some conflicting bind
> > > entries ? Eg: old config used to listen on 192.168.1.1:80 and
> > > the new one uses 0.0.0.0:80 ? Or even :80 for the old one (IPv4)
> > > and :::80 for the new one (IPv6) ?
> >
> > I think there will be a problem. Lets talk about what would be a sensible
> > thing to do.
>
> In my opinion we should define how to fail a restart. Right now with
> -sf/-st, the new process is able to detect the failure to bind and
> exits with an error which can be detected by management scripts and
> reported to the admin which must immediately take action.

I think that -sr currently behaves that way. I'll check.

> I think that a failure to restart smoothly means that the admin will
> have to take the decision to restart harder (-sf/-st). Sometimes the
> scripts will do that by themselves as an automatic fallback. So that
> make me believe that all we want is the master process to refuse to
> change anything in case an anomaly is detected, and to indicate its
> refusal.

Right. The main thing that I was worrying about is that the master process probably won't know its that by the time the master process detects an anomaly its probably already changes things. So it will either need to know how to roll back, or detect problems before making changes to itself.

That said, there are probably only a few changes to the master that actually matter. For example, if it has loaded a configuration file with a bogus proxy port in it, then it doesn't need to roll that back, it just needs to tell the old workers to keep going - they still have the old config.

So I think that the main problem that I have is knowing what the master would need to rollback. Loggers spring to mind. I'm sure there are a few other things.

Note, chroot, daemon and master_worker currently can't be changed on restart, so we don't need to worry about them.

> We cannot make it send a signal to the calling process
> because it does not know its pid, and even if it knew it it would
> probably not have enough permissions to send it anyway. But we should
> find something the master process can act on that the calling one can
> detect (eg: among the horrible ideas, if it's not chrooted, it can
> change its pid and update the pidfile, but that's ugly).

The master process already has a facility to rewrite the pid file even if it is chrooted (it keeps the fd open) so that could work. The pidfile currently does change on restart - all worker's pids are updated, polling the pid file for something new and valid or an error status could work.

As an aside, having the worker pids in the pid file isn't strictly necessary. I only added them for consistency with the existing usage of the pidfile. But the master could actually just write the master pid and close the fd.

> > > - does the master send a signal to all children asking them to unbind
> > > (as we did with -sf) ?
> >
> > Yes. It sends them a SIGUSR1.
> >
> > It does not send them a SIGTTOU. Because although that works fine
> > I am not entirely sure how to sanely unwind the master at that point.
>
> I think it's OK that way because the master still owns the sockets, so
> even in case of a late failure, it can restart new processes anyway.
>
> > > - do the debug modes (-d/-db) disable the master_worker mode ?
> >
> > No. I can make that so if you like.
>
> Yes that's the idea. Many users abuse -d/-db (including me) and it's
> important that they don't have to touch their config for this.

In the case of -db, master_worker mode doesn't imply background (daemon) mode, so there shouldn't be any changes there by adding my current patches. Do you want me to disable it anyway?

> > > In fact I'm interested in any corner cases we should be aware of so
> > > that we can clearly document them and indicate how to handle them
> > > (eg: fall back to -st if it's not possible to rebind, etc...)
> >
> > I don't know of any off hand. But one of my reasons for posting
> > the code was to get more eyes on it so we can find such cases.
>
> Fine. Any failure to restart with the expected config typically is what
> I'd call a corner case. That's why I'd like that we find a way to safely
> fail so that proper action can be taken externally.
>
> > Re-initialising the configuration was somewhat non-trivial, and I am
> > sure that I have missed a few things.
>
> I'm certain that some parts were a real nightmare ! I'll take time to
> review all your patches because yes, it's possible that you fell in some
> traps or had to take some decisions that might have other impacts. That's
> also why I'd like to get your code quickly merged, it's the best way to
> find what might have been overlooked.
>
> Best regards,
> Willy
>
Received on 2011/03/10 09:40

Re: [PATCH 00/26] Restart of haproxy without dropping connections