Re: Intel nehalem crc32 for URI hashing (and Keep-Alive)

From: Willy Tarreau <w#1wt.eu>
Date: Mon, 21 Dec 2009 22:28:29 +0100


Hi Dormando,

On Sun, Dec 20, 2009 at 05:16:57PM -0800, dormando wrote:
> Hey,
>
> Attached, and linked here:
> http://consoleninja.net/p/haproxy_intel_hash.diff
> ... is a patch to HAProxy 1.3.22 that uses intel's nehalem hardware CRC32
> instruction for the URI hashing.
>
> I did some tests here:
> http://dormando.livejournal.com/522027.html
> With this bench tool:
> http://consoleninja.net/p/intel_hash_bench.c
> ... which I hadn't even compiled with -O2 during my tests.

Nice benchmarks.

> A couple lines of code were used from the linux kernel (the __asm__
> command itself and part of the loop). I guess that's enough to count it as
> GPL'ed?
>
> Anyway. Thoughts?

It's OK since haproxy is in GPLv2 too. However, I like sources to be referenced when some code is imported, so please put a comment indicating where you took it just before the code (kernel version and file).

> I have some mild CPU detection so this could switch from a compile time to
> a runtime switch, but it appears your preference for similar features is a
> compile time switch.

In fact, I don't think that a config option would be very useful for that case, since it's a platform-specific optimisation, it's better to have it or not in the executable file. So the config option is better for that. However, if you can easily detect it, it would be nice if you could disable it and fall back to the old one when not detected. It's better than crashing with an illegal instruction :-)

> '/' depth checking isn't being used because I want
> to avoid the slowdown for us, but it could potentially be put back. If it
> looks like something you want I can re-apply against 1.4

yes, if you want to replace a function, you must make a drop-in replacement with everything working as usual.

> (btw, you should repack your remote git tree;).

good idea, it also annoys me when I'm remote, but after pulling, I forget :-)

> In a mildly related note, I see that Keep-Alive support has been started?

yes slowly because there are a lot of complex things to process. I finished today to handle various cases for the Connection header. Believe it or not, but when you consider the HTTP version of the request, the header of the request, the frontend config, the backend config, the version of the response, and the response header, you have an immense set of possibilities, which are far from being obvious. I spent several weeks on this precise problem, undoing and redoing my code several times.

> It'd make deployment an awful lot easier if Keep-Alive's were possible. In
> fact, the end user slowdown might prevent us from rolling this out at all.

you're not the first one to ask, believe me :-)

> Anything I can do to speed up that process?

not much I can see right now. However, I hope to be able to issue 1.4-dev5 soon with those recent changes (no keepalive yet, just changes on the connection header, and proper consideration of content-length and transfer-encoding). What would be nice of course would be some tests, and if possible, some code review of randomly picked areas. You will necessarily find bugs or corner cases that are poorly covered, and while connection-close easily gets rid of those, keep-alive does not offer you a second chance.

Now, back to your patch, let me make a few comments on your code below :

> diff -ur haproxy-1.3.22/include/proto/backend.h d/include/proto/backend.h
> --- haproxy-1.3.22/include/proto/backend.h 2009-10-14 11:43:22.000000000 -0700
> +++ d/include/proto/backend.h 2009-12-20 16:49:27.000000000 -0800
> @@ -29,6 +29,10 @@
>
> #include <proto/queue.h>
>
> +#ifdef CONFIG_HAP_INTEL_HASH
> +#include <strings.h>
> +#endif

Don't put ifdef to include standard headers. If you need them, include them.

> +
> int assign_server(struct session *s);
> int assign_server_address(struct session *s);
> int assign_server_and_queue(struct session *s);
> @@ -184,6 +188,68 @@
> return px->lbprm.map.srv[hash % px->lbprm.tot_weight];
> }

The code below should be enclosed with #if defined(__x86_64__) || defined(__i386__) or something similar. Otherwise you can build it on other platforms where it will fail because of the ASM code.

> +#ifdef CONFIG_HAP_INTEL_HASH
> +
> +#define SCALE_F sizeof(unsigned long)
> +
> +#if defined(__x86_64__)
> +#define REX_PRE "0x48, "
> +#else
> +#define REX_PRE
> +#endif

I think that most of the code of the function below can be shared with the original one. It may be more maintainable and more readable to just change the function's loop by yours, so that we always call the same function. Also, you may want to run a test without inlined functions. This would reduce the caller function's length and may improve L1 cache efficiency.

> +static inline struct server *get_server_uh_intel(struct proxy *px, char *uri, int uri_len)
> +{
> + uint32_t crc = 0;
> + char *qmark;
> + unsigned int iquotient;
> + unsigned int iremainder;
> + unsigned char *pctmp;
> + unsigned long *ptmp = (unsigned long *)uri;
> +
> + if (px->lbprm.tot_weight == 0)
> + return NULL;
> +
> + if (px->lbprm.map.state & PR_MAP_RECALC)
> + recalc_server_map(px);
> +
> + qmark = index(uri, '?');
> + if (qmark) {
> + uri_len = qmark - uri;
> + }
> +
> + if (px->uri_len_limit)
> + uri_len = MIN(uri_len, px->uri_len_limit);
> +
> + /* Grossly inlining the hash code */
> + iquotient = uri_len / SCALE_F;
> + iremainder = uri_len % SCALE_F;
> + ptmp = (unsigned long *)uri;
> +
> + while (iquotient--) {
> + __asm__ __volatile__(
> + ".byte 0xf2, " REX_PRE "0xf, 0x38, 0xf1, 0xf1;"
> + :"=S"(crc)
> + :"0"(crc), "c"(*ptmp)
> + );
> + ptmp++;
> + }
> +
> + if (iremainder) {
> + pctmp = (unsigned char *)ptmp;
> + while (iremainder--) {
> + __asm__ __volatile__(
> + ".byte 0xf2, 0xf, 0x38, 0xf0, 0xf1"
> + :"=S"(crc)
> + :"0"(crc), "c"(*pctmp)
> + );
> + pctmp++;
> + }
> + }
> +
> + return px->lbprm.map.srv[crc % px->lbprm.tot_weight];
> +}
> +#endif
>
> #endif /* _PROTO_BACKEND_H */

(...)

Also, I've looked at your numbers. while they are nice, you noticed by yourself that the difference should be almost not noticeable with real traffic. We'd save at most a few microseconds per request (typically 3.2 in the 39 chars tests, passing from 4 us to 0.8 us). But maybe it could really make sense when keep-alive is supported, because I have real hopes for crossing the 100000 req/s boundary, which means 10 us max per request.

If you resubmit, please do it against 1.4-dev only (preferably the git version or a recent snapshot), as I don't intend to merge such changes in stable 1.3 but they're welcome in 1.4.

Thanks,
Willy Received on 2009/12/21 22:28

This archive was generated by hypermail 2.2.0 : 2009/12/21 22:30 CET