Re: BGP / GSLB and HAProxy

From: Willy Tarreau <w#1wt.eu>
Date: Wed, 19 Dec 2007 22:34:55 +0100


Hi John,

On Wed, Dec 19, 2007 at 12:04:14PM -0500, John Marrett wrote:
> Willy,
>
> I find myself in the uncomfortable position of defending DNS based GSLB,
> I share some of your concerns on the subject, and have reviewed the
> documents you linked in the past. That said, I do believe that it offers
> some benefits, in some circumstances, allow me to explain inline.

I never said it does not offer any benefits (otherwise it would not be sold in the first place). It's just that it relies on information cached at multiple places you do not have any control on. Most likely you have already performed quick DNS changes for some services, and noticed a small part of the traffic still going to the old IP for several days, and generally up to a week due to some providers doing excessive caching for their customers.

> > Anyway, DNS is horribly bad for high availability. I recently
> > read a very
> > good article on the subject that comforted me in this feeling
> > I've been
> > having for a long time. Basically, the problem find its roots
> > in caches.
> > DNS is only good to spread multiple IP addresses which are
> > almost never
> > updated. Use BGP the find where the IPs are located.
>
> We don't expect this solution to respond to the issue of customers
> already connected to a dead site.

But it's not just the customers connected to the site, it's about customers getting their DNS entries from a shared cache located at their provider's, or reaching your site through proxies located in their enterprise. It's very common to see HTTP proxies configured with DNS caching for 24 hours. If you have a medium company of 5000 persons behind such a proxy, even if only one of them goes to your site at 8:00am and leaves, all the other ones will not be able to reach it until 8:00am the next day.

> What we can do however, is ensure that new users will connect to a
> functioning site. Modern browsers, in the event that communication is
> lost, will eventually (30 seconds?) connect to other IPs that can be
> found in the RR response. If the client restarts their browser, and with
> a reasonable DNS ttl, they should end up connecting to a functional site
> as well.

There are three concepts here :

But if you want to improve locality, then you cannot play with multiple RRs.

2) direct access from the browser to the RR records Depending on your targetted clients, the browsers may not have access to the RR records simlply because of intermediate proxies. And you don't know how long those proxies will cache the records (it will almost never depend on your TTL anyway).

3) TTL
as said in 2) you have very little control over what people do with the TTL, and there is even a risk of being blacklisted by some providers if you run with too low TTLs which put a lot of stress on their DNS forwarders (eg: one minute).

> The primary goal isn't so much to move people around based on load, but
> to cut out dead sites.

I agree with your goal here, I just don't agree with the use of DNS to reach this goal exactly because of the cache problem.

> A site will almost always only be dead if it
> drops its Internet connectivity, backend failure will affect both sites
> equally, front end failure will be considered, for the sake of argument,
> as completely mitigated :)

ok :-)

> >From the sound of things, for our secondary goal, DNS would be, in your
> opinion, a poor method of directing clients to the closest site.

With multiple RRs, the DNS will not direct your clients to the nearest site, it will just spread them on all advertised sites. Of course you may play with the number of entries of each site depending on where the client sends his request, but then you will still send one part of them to the most distant site, while increasing the time to find the remaining one in case of a failure, due to the repetition of the failed site in the RR records.

> > > > - Currently we aren't quite ready for it, but it would be very
> > > > interesting to take BGP information, and use it to refer
> > customers to
> > > > the closest site, any ideas on this subject?
> > >
> > > Same. This would be the holy grail of scalability options for us.
> >
> > If you announce multiple IP addresses with your DNS, and if all those
> > addresses are available on all sites, BGP will ensure that
> > your customers
> > will reach them on the closest site for them.
>
> You seem to be suggesting using TCP with an anycast IP range available
> at multiple sites. There are articles on the subject [1] that suggest
> that this is relatively low risk (1 in 10000 requests?), but I always
> fear the consequences for a client who has a route flap.

I agree with this and honnestly it's not what I'd recommend the most. But you were speaking about the closest site, so... In fact, you must keep in mind that you will never have more than 2 out of the 3 following features :

  1. availability
  2. nearest site
  3. scalability
    • you may achieve 1) and 2) using BGP and "anycast" with large offsets in the weights, but then since you don't control where your clients connect from, you don't control how the load spreads.
    • you may achieve 1) and 3) using BGP + DNS round robin (using 1 RR per site) but then you will not direct clients to the closest site.
    • you may achieve 2) and 3) using DNS and adjusting replies depending on the origin of the request, but then due to DNS caching, you will not have a good availability.

In fact, you can get slightly closer to (1,2,3) by starting with 1 and 2, then slowly playing on the DNS RR entries to try to adjust the load between the sites. You will slightly degrade 2) but improve 3) without affecting 1) because your DNS will not be used for availability but only for load equalization.

> On the subject of our traffic, it's short http sessions, however
> customer connectivity and diagnoses up to customers premises is
> completely critical.

I certainly can understand that and have been facing the same requirement, hence the solution for 1+3 in my case. But there are only two sites and they are not very far apart, so that's a perfectly acceptable scenario.

> What's your position on the subject of TCP anycast? I fear it would lead
> to difficult to impossible to diagnose client side issues.

It would for long sessions (large uploads/downloads) if the offsets between your sites are not far enough. IMHO, it should be close to impossible to reach an IP address on one site as long as it's available on another one.

Also, if you can afford a dedicated link (redudant) between your front routers and set up a local preference for each IP, then you've won, because you can announce your IPs from wherever you want, and the final step will be performed by those routers for customers entering via the wrong site.

> > Now imagine that the site is replicated with a link between
> > the routers :
> >
> > [clients] --- [ cisco router ] --- [ alteon ] --- [ haproxy
> > ] --- [ servers ]
> > |
> > |
> > [clients] --- [ cisco router ] --- [ alteon ] --- [ haproxy
> > ] --- [ servers ]
>
> What kind of link between the two routers a dedicated point to point
> link, or would it be an encapsulated link of some type over Internet
> link, in which case, you are in a bad place once you drop the upstream
> connectivity.

it is necessary that this is a redudant link. Whether it's encapsulated or not does not matter much as long as it is redudant on each site. And in fact, even if it was not redudant and got broken, the setup would just turn into the TCP anycast scenario where I would get trouble with long sessions only.

> Of course, another approach would be a high speed redundant back end
> link, independent of your internet service. In such a case, you could
> use anycast, but have the IPs answered by HAProxy only at the local
> site. As long as you don't drop your the backend link the potential
> route flapping issues of anycast would be completely mitigated.

Yes that's true too. But sometimes it's harder to provide big backend links because data may circulate unciphered there, or might be considered more sensible.

> You could also use heartbeat to bring up the remote sites IPs in the
> event that you completely lost the communication between the two sites.

If your sites are really far apart, you should avoid layer 2 as much as possible. The latencies you can encounter on long distance links may sometimes cause some IPs to appear for a short time on a site then disappear.

> Another question, why do you have the Alteons in that diagram, what
> benefit are they bringing into the equation?

Because they were there :-)
And because we found them pretty convenient haproxy->BGP protocol converters. In fact, it's generally better to have one dumb device to check for other ones than having the devices check their own health by themselves. I prefer to have 2 VRRP-based alteons checking two haproxies (many more in reality but that's not the point) and doing only basic layer 4 load balancing with a very low risk of dirty failure than to have two haproxies with VRRP on them pretend they are OK while the OS may have experienced a bunch of oopses, memory leaks or anything wrong with the VRRP daemon still beating.

But instead of an alteon, you can use an linux-based PC running LVS and keepalived and doing nothing else. For the same reason, chances are very low for this box to experience dirty failures. The real concept is to separate traffic directors in stages determined by the complexity of what they achieve and the complexity of the checks they are able to perform to check the next stage.

> > I don't know if my explanation was clear enough.
>
> Quite clear, at least for those familiar with the domain :)
>
> This is an extremely interesting discussion, I appreciate you taking the
> time to participate.

Yes it is interesting, but unfortunately it is hard to find people to discuss this subject. BTW, you may be interested in a short article I wrote last year as an introduction to load balancing. Maybe you will like to give it to some newbies who constantly ask you the same questions :-)

   http://1wt.eu/articles/2006_lb/

Best regards,
Willy Received on 2007/12/19 22:34

This archive was generated by hypermail 2.2.0 : 2007/12/19 22:45 CET