High Performance Linux



> Try Tempesta FW, a high performance open source application delivery controller for the Linux/x86-64 platform.

> Or check custom high-performance solutions from Tempesta Technologies, INC.

> Careers: if you love low-level C/C++ hacking and Linux, we'll be happy to hear from you.


Sunday, November 23, 2014

Fast Finite State Machine for HTTP Parsing

There is special type of DDoS attacks, application level DDoS, which is quite hard to combat against. Analyzing logic which filters this type of DDoS attack must operate on HTTP message level. So in most cases the logic is implemented as custom modules for application layer (usually nowadays user space) HTTP accelerators. And surely Nginx is the most widespread platform for such solutions. However, common HTTP servers and reverse proxies were not designed for DDoS mitigation- they are simply wrong tools for this issue. One of the reason is that they are too slow to combat with massive traffic (see my recent paper and presentation for other reasons).

If logging is switched off and all content is in cache, then HTTP parser becomes the hottest spot. Simplified output of perf for Nginx under simple DoS is shown below (Nginx’s calls begin with ’ngx’ prefix, memcpy and recv are standard GLIBC calls):


      %            symbol name
    1.5719    ngx_http_parse_header_line
    1.0303    ngx_vslprintf
    0.6401    memcpy
    0.5807    recv
    0.5156    ngx_linux_sendfile_chain
    0.4990    ngx_http_limit_req_handler


The next hot spots are linked to complicated application logic (ngx vslprintf ) and I/O. 

During Tempesta FW development We have studied several HTTP servers and proxies (Nginx, Apache Traffic Server, Cherokee, node.js, Varnish and userver) and learned that all of them use switch and/or if-else driven state machines.

The problem with the approach is that HTTP parsing code is comparable in size with L1i cache and processes one character at a time with significant number of branches. Modern compilers optimize large switch statements to lookup tables that minimizes number of conditional jumps, but branch misprediction and instruction cache misses still hurt performance of the state machine. So the method probably has poor performance.

The other well-known approach is table-driven automaton. However, simple HTTP parser can have more than 200 states and 72 alphabet cardinality. That gives 200 x 72 = 14400 bytes for the table, which is about half of L1d of modern microprocessors. So the approach is also could be considered as inefficient due to high memory consumption.

The first obvious alternative for the state machine is to use Hybrid State Machine (HSM) described in our paper, which combines very small table with also small switch statement. In our case we tried to encode outgoing transitions from a state with at most 4 ranges. If the state has more outgoing transitions, then all transitions over that 4 must be encoded in switch. All actions (like storing HTTP header names and values) must be performed in switch. Using this technique we can encode each state with only 16 bytes, i.e. one cache line can contain 4 states. Giving this the approach should have significantly improve data cache hit.

We also know that Ragel generates perfect automatons and combines case labels in switch statement with direct goto labels (it seems switch is used to be able to enter FSM from any state, i.e. to be able to process chunked data). Such automatons has lower number of loop cycle and bit faster than traditional a-loop-cycle-for-each-transition approach. There was successful attempt to generate simple HTTP parsers using Ragel, but the parsers are limited in functionality.

However there are also several research papers which says that an automaton states is just auxiliary information and an automaton can be significantly accelerated if state information is declined.
So the second interesting opportunity to generate the fastest HTTP parser is just to encode the automaton directly using simple goto statements, ever w/o any explicit loop.

Basically HTTP parsers just matches a string against set of characters (e.g. [A-Za-z_-] for header names), what strspn(3) does. SSE 4.2 provides PCMPSTR instructions family for this purpose (GLIBC since 2.16 uses SSE 4.2 implemenetation for strspn()). However, this is vector instruction which doesn't support accept or reject sets more than 16 characters, so it's not too usable for HTTP parsers.

Results
I made a simple benchmark for four approaches described above (http_ngx.c - Nginx HTTP parsing routines, http_table.c - table-driven FSM, http_hsm.c - hybrid state machine and http_goto.c - simple goto-driven FSM). And here are the results (routines with 'opt' or 'lw' - are optimized or lightweight versions of functions):

Haswell (i7-4650U)

    Nginx HTTP parser:
        ngx_request_line:       730ms
        ngx_header_line:        422ms
        ngx_lw_header_line:     428ms
        ngx_big_header_line:    1725ms

    HTTP Hybrid State Machine:
        hsm_header_line:        553ms

    Table-driven Automaton (DPI)
        tbl_header_line:        473ms
        tbl_big_header_line:    840ms

    Goto-driven Automaton:
        goto_request_line:      470ms
        goto_opt_request_line:  458ms
        goto_header_line:       237ms
        goto_big_header_line:   589ms

Core (Xeon E5335)

    Nginx HTTP parser:
        ngx_request_line:       909ms
        ngx_header_line:        583ms
        ngx_lw_header_line:     661ms
        ngx_big_header_line:    1938ms

    HTTP Hybrid State Machine:
        hsm_header_line:        433ms

    Table-driven Automaton (DPI)
        tbl_header_line:        562ms
        tbl_big_header_line:    1570ms

    Goto-driven Automaton:
        goto_request_line:      747ms
        goto_opt_request_line:  736ms
        goto_header_line:       375ms
        goto_big_header_line:   975ms

Goto-driven automaton shows the better performance in all the tests on both the architectures. Also it's much easier to implement in comparison with HSM. So in Tempesta FW we migrated from HSM to goto-driven atomaton, but with some additional optimizations.

Lessons Learned
Haswell has very good BPU
Core micro-architecture has show that HSM behaves much better than switch-driven and table-driven automatons. While this is not the case for Haswell - the approach loses to both the approaches. I've tried many optimizations techniques to improve HSM performance, but the results above are the best and they still worse than the simple FSM approaches.
Profiler shows that the problem (hot spot) in HSM on Haswell is in the following code
 
    if (likely((unsigned char)(c - RNG_CB(s, 0)) <= RNG_SUB(s, 0))) {
        st = RNG_ST(s, 0);
        continue;
    }

Here we extract transition information and compare current character with the range. In most cases only this one branch is observer in the test. 3rd and 4th branches are never observed. The whole automaton was encoded with only 2 cache lines.

In first test case, when XTrans.x structure is dereferenced to get access to the ranges, the compiler generates 3 pointer dereferences. In fact these instructions (part of the disassembled branch)
 
    sub    0x4010c4(%rax),%bl
    cmp    0x4010c5(%rax),%bl
    movzbl 0x4010cc(%rax),%eax

produce 3 accesses to L1d and the cache has very limited bandwidth (64 bytes for reading and 32 bytes for writing) on each cycle with minimal latency as 4 cycles for Haswell. While the only one cache line is accessed by all the instructions. So the test case bottle neck is L1d bandwidth.

If we use XTrans.l longs (we need only l[0], which can be loaded with only one L1d access, in all the cases) and use bitwise operations to extract the data, then we get lower number of L1d accesses (4G vs 6.7G for previous cases), but branch mispredictions are increased.

The problem is that more complex statement in the conditions makes harder to Branch Prediction Unit to predict branches.

However, we can see that simple branches (for switch-driven and goto-driven automatons) show perfect performance on Haswell. So advanced Haswell BPU perfectly processes simple automatons making complex HSM inadequate.

In fact HSM is only test which is slower on Haswell in comparison with Core Xeon. Probably, this is the difference between server and mobile chips that ever old server processor beats modern mobile CPU on complex loads...

-O3 is ambiguous
Sometimes -O3 (GCC 4.8.2) generates slower code than -O2. Also benchmarks for -O3 show very strange and unexpected results. For example the below are results for -O2:

    goto_request_line: 470ms

However, -O3 shows worse results:

    goto_request_line: 852ms



Automata must be encoded statically whenever possible
Table-driven and HSM automaton are encoded using static constant tables (in difference with run-time generated tables for current DPI parser). This was done during HSM optimizations. Sometimes compiler can't optimize code using run-time generated tables. And this is crucial for real hot spots (for HSM the table is used in the if-statement described above which gets about 50-70% of whole the function execution time) - after the moving to the static data the code can get up to 50% performance improvement (the case for HSM).

Wednesday, January 15, 2014

NatSys Lock-free Queue vs LMAX Disruptor vs boost::lockfree::queue

I've been asked many times how our lock-free queue (you can read more about it here) differs from LMAX Disruptor. The last time is in discussion at Hacker News (it seems one of disruptor author was asking this). In LinkedIn discussions I also was asked about boost::lockfree::queue (it appeared in Boost recently). So I'm writing the post to share the answers.

LMAX Disruptor

Its source code available at https://github.com/redjack/varon-t.

The algorithms are simply different and solve different problems: disruptor is a messaging queue (if a producer P0 emits M0 into the queue then all consumers C0..CN receive M0) while our queue is a classical work queue (if P0 and P1 emit message M0 and M1, then only one consumer Ci receives M0 and only consumer Cj receives M1 (i could be equal to j)).


Our implementation competes with boost::lockfree::queue and it's much faster since Boost implementation uses more heavy synchronization techniques. The benchmark on GitHub also has Boost implementation, so you can compare both the queues.

Unfortunately, there is no adequate algorithm description for disruptor queue, only some indistinct descriptions mostly suitable for business people rather than engineers. So it was not easy to dig it's source code. However, I learned it and there are some notes about the implementation.

The implementation is bit inaccurate: there are a lot of branches without branch prediction information available at compile time, to avoid cache line bouncing it wastes two cache lines instead of simple align an item on cache line (I mean vrt_padded_int). I didn't pay too much attention to memory barriers usage, but giving that X86-64 provides relatively strict memory ordering, probably some of them also could be eliminated. Disruptor uses very cleaver ideas, but I believe its performance can be improved after good code review.

One more point is that while our queue implementation is C++, it's still self sufficient and can be easily ported to C for using in kernel space. It's doubtful (in my humble opinion) that generic container depends on non-standard libcork and moreover logging library (clogger).

One more thing to note about our queue and Disruptor. Dirsuptor's uses naive yielding logic. The problems with the logic is that if a thread has not job for long time it fails to sleep for increasing from yield to yield call time, this if the queue has no job for long time, but at once has a big spike, then it typically will waking up for some time before it starts to work. We solved the issue with lock-free condition wait.


Boost::lockfree::queue

 It uses at least one CAS operation (see for example do_push() in boost/lockfree/queue.hpp), which is slower than plain RMW operation (atomic increment in our case) in best case, so it's slower. You can user benchmark for both the queues at GitHub. For my Intel Core i7-4650U it gives:

     $ g++ -Wall -std=c++0x -O2 -D DCACHE1_LINESIZE=`getconf LEVEL1_DCACHE_LINESIZE` -I/opt/boost_1_55_0/include/ lockfree_rb_q.cc -lpthread
    $ ./a.out
    13969ms 
    check X data... 
    Passed 
    69357ms 
    check X data... 
    Passed 
    20240ms 
    check X data... 
    Passed


Note that I use latest boost version. The first result is for our queue, the second for naive queue implementation and the last one is for Boost. I.e. our implementation is more than 30% faster than boost::lockfree::queue.

Tuesday, January 7, 2014

Haswell AVX2 for Simple Integer Bitwise Operations

It's known that vector extension instructions of modern microprocessors are very useful for complicated SIMD (Single Instruction Multiple Data) computations, e.g. matrix multiplication, float point calculations for graphic and video applications etc. However, what about simple bitwise operations on plain integers? Do vector extensions provide any notable benefit in comparison with loops unrolling and other common optimizations on C-level?

Let's study simple example. We have 16 32-bit bitmaps and need to check them all for particular bit, just say whether the bit is set in any of the bitmaps or not. The bitmaps are placed in contiguous memory. Without loss of generality I'll consider following set of the bitmaps having or not set 11th bit (2048 = 1 << 11):

    0    2048 4096 48   5    11   8192 56   304  16384 3    204  208  60   901  208
     0    304  2048 48   5    11   8192 56   4096 16384 3    204  208  60   901  208
     0    304  4096 48   5    2048 8192 56   11   16384 3    204  208  60   901  208
     0    304  4096 48   5    11   2048 56   1024 16384 3    204  208  60   901  208
     0    304  4096 48   5    11   8192 56   2048 16384 3    204  208  60   901  208
     0    304  4096 48   5    11   8192 56   3    16384 2048 204  208  60   901  208

     0    304  4096 48   5    11   8192 56   60   16384 3    204  208  2048 901  208
     0    304  4096 48   5    11   8192 56   208  16384 3    204  208  60   901  2048
     0    304  4096 48   2000 11   8192 5    56   16384 3    204  2040 60   901  2047
     2048 2048 2048 2048 2048 2048 2048 2048 2048 2048  2048 2048 2048 2048 2048 2048


The first eight arrays of bitmaps just have the bit set in different bitmaps, 9th array doesn't have a bitmap with the set bit and finally 10th array has all the bitmaps with the bit.

Now we can write a simple C-function which checks all the bitmaps for the specified bit and returns true or false (the full example can be found at GitHub):

    inline bool
    cycle_lookup_naive(unsigned int bm, volatile unsigned int *w)
    {
        for (int i = 0; i < 16; ++i)
            if (bm & w[i])
                return true;
        return false;
    }


This function accepts desired bitmask as a first argument and array of the bitmaps as the second argument. Since the function is inlined and observes static constant array, I passed the second argument as volatile memory area to prevent compiler from optimizing out a loop in which the function is called 10M times.

I wish all available in G++ 4.7 optimizations for my Intel Core i7-4650U (Haswell), so I compile the program with -O3 -march=core-avx-i -mtune=core-avx-i -mavx2. GCC has auto-vectorization feature which is turned on on -O3 optimization level. The function running 10M times shows following results for the 10 arrays:

    19ms 18ms 24ms 31ms 35ms 42ms 48ms 54ms 53ms 10ms

And there is obvious, but still interesting property of the implementation - it runs faster on arrays which have the desired bitmap closed to the begin. The second thing is that run for 9th arrays, which doesn't have a bitmap with the set bit at all, is bit faster than for the array which has the set bit in last bitmap. This can be explained by branch prediction - processor caches branch in which there is no matching, so it stalls little bit when it finds matching last bitmap. There is interesting discussion on branch prediction of StackOveflow: Why is processing a sorted array faster than an unsorted array?.

The first optimization of the function which we can do is simply use iterate and match the bitmaps by 64-bit values instead of 32-bit:

    inline bool
    cycle_lookup(unsigned int bm, volatile unsigned int *w)
    {
        unsigned long *_w = (unsigned long *)w;
        unsigned long _bm = ((unsigned long)bm << 32) | bm;

        for (int i = 0; i < 8; ++i)
            if (_bm & _w[i])
                return true;
        return false;
    }


And this gives us real speedup:

    13ms 18ms 18ms 21ms 24ms 27ms 29ms 30ms 30ms 14ms

The reason for slight slowdown for the last array, when we find a bitmap from first try, is that we perform additional computation for 64-bit bitmap _bm. We also see that the function execution time depends on input data and this is not always desired. So since we just need to answer whether any of the bitmaps contain desired bit, we can OR all the bitmaps and check the result against bm. Also this way we can eliminate branch misprediction problem (thanks to Melzzzz for the proposed optimization which moves AND out of the loop):

    bool
    cycle_lookup_opt(unsigned int bm, volatile unsigned int *w)
    {
        unsigned long r = 0;
        unsigned long *_w = (unsigned long *)w;
        unsigned long _bm = ((unsigned long)bm << 32) | bm;

        for (int i = 0; i < 8; ++i)
                r |= _w[i];
        return r & _bm;
    }


And this gives us

    23ms 24ms 23ms 24ms 24ms 24ms 24ms 24ms 24ms 24ms

This is 24ms in average which is worse than 22.4ms for previous case. However this is only for the exact test set which can differ on real data (what if in most cases you find the bit only in the last bitmap?).

Now it's time to look under the hood of the compiler. In all the three cases the loop has constant bounds known on compile time, so it's expectable that compiler will vectorize the loop (recall that we compile the program with -03 which enables loops vectorization). However, it isn't so. Here is assembly code for cycle_lookup_opt:

    movl    %edi, %edi
    movq    %rdi, %rdx
    salq    $32, %rdx
    orq     %rdi, %rdx
    movq    %rdx, %rcx
    movq    %rdx, %rax
    andq    8(%rsi), %rax
    andq    16(%rsi), %rcx
    orq     %rax, %rcx
    movq    %rdx, %rax
    andq    (%rsi), %rax
    orq     %rcx, %rax
    movq    %rdx, %rcx
    andq    24(%rsi), %rcx
    orq     %rcx, %rax
    movq    %rdx, %rcx
    andq    32(%rsi), %rcx
    orq     %rcx, %rax
    movq    %rdx, %rcx
    andq    40(%rsi), %rcx
    orq     %rcx, %rax
    movq    %rdx, %rcx
    andq    56(%rsi), %rdx
    andq    48(%rsi), %rcx
    orq     %rcx, %rax
    orq     %rax, %rdx
    setne   %al
    ret


So the compiler simply unrolls the loop and also performs some operations grouping. It decides not to use CPU vector extension while -mavx2 was passed as optimization option.

AVX2 appeared in Haswell architecture allows 256-bit operations, so we can handle our bitmaps scanning only in two steps by 32 bytes at once. It looks promising that CPU will check 8 bitmaps at once, however there is drawback - it's costly to load/store YMM registers which are used by AVX.

First of all we need to convert our 32-bit bitmask to 256-bit value:

    __m256i m = _mm256_set_epi32(bm, bm, bm, bm, bm, bm, bm, bm);

Next, we load our 16 bitmaps to two 256-bit values:

    __m256i v0 = _mm256_set_epi32(w[0], w[1], w[2], w[3],
                                  w[4], w[5], w[6], w[7]);
    __m256i v1 = _mm256_set_epi32(w[8], w[9], w[10], w[11],
                                  w[12], w[13], w[14], w[15]);


Now we can perform AND operation on v0 and v1 with m which give us 1 only on 11ths positions in v0 and v1, so we can safely OR the values to get only one 256-bit value:

    __m256i o = _mm256_or_si256(a0, a1);

Unfortunately, we can't just evaluate o and return true or false, instead we have to unpack it:

    union {
        __m128i _128[2];
        int     _32[8];
    } mr;

    mr._128[0] = _mm256_extracti128_si256(o, 0);
    mr._128[1] = _mm256_extracti128_si256(o, 1);


and only after that evaluate the result with 8 ORs:

    #define OR8(a)  (a[0] | a[1] | a[2] | a[3] \
                     | a[4] | a[5] | a[6] | a[7])
    return !!OR8(mr._32);

This function (avx2_lookup() in the source code) gives us following timings:

    160ms 161ms 160ms 160ms 161ms 160ms 160ms 161ms 160ms 160ms

Thus our naive vectorization has nasty performance. To improve performance on the function we should reduce load/store overhead. Our optimization concludes in 3 steps which reduces initial two 256-bit values to two 64-bit ones. Also we can use _mm256_set_epi64x() which loads 256-bit YMM registers faster than _mm256_set_epi32(). Now resulting optimized function looks as below:

    bool
    avx2_lookup_opt1(unsigned int bm, volatile unsigned int *w)
    {
        union {
            __m128i _128;
            long    _64[2];
        } mr;
        __m128i m = _mm_set1_epi32(bm);
        __m256i v0 = _mm256_set_epi64x(*(long *)w,

                                       *(long *)(w + 2),
                                       *(long *)(w + 4),

                                       *(long *)(w + 6));
        __m256i v1 = _mm256_set_epi64x(*(long *)(w + 8),

                                       *(long *)(w + 10),
                                       *(long *)(w + 12),

                                       *(long *)(w + 14));
        __m256i o0 = _mm256_or_si256(v0, v1);
        __m128i h0 = _mm256_extracti128_si256(o0, 0);
        __m128i h1 = _mm256_extracti128_si256(o0, 1);
        __m128i o1 = _mm_or_si128(h0, h1);
        mr._128 = _mm_and_si128(o1, m);
        return mr._64[0] || mr._64[1];
}


Note that we perform first OR on v0 and v1 reducing them to one 256-bit value o0, next we loads first and second its halves into 128-bit h0 and h1 respectively. We do next reduction by OR against h0 and h1 getting o1. Only here we perform our AND operation. And finally, we load two halves of o1 into to 64-bit longs and return result of last OR operation. This optimizations give us

    71ms 70ms 70ms 69ms 70ms 70ms 71ms 70ms 69ms 71ms

Much better, but still worse than our even non-optimized plain C-loop. Hopefully, we have VTESTPS AVX instruction which can perform AND operation on two 256-bit operands and set ZF flag if all 32-bit words of the result are zero after the operation. Using the instruction (appropriate compiler intrinsic can be found in Intel® 64 and IA-32 Architectures Software Developer’s Manual Volume 2 (2A, 2B & 2C): Instruction Set Reference, A-Z) we can rewrite the function in following way:

    bool
    avx2_lookup_opt2(unsigned int bm, volatile unsigned int *w)
    {
 
        __m256i m = _mm256_set1_epi32(bm);
        __m256i v0 = _mm256_set_epi64x(*(long *)w,

                                       *(long *)(w + 2),
                                       *(long *)(w + 4),

                                       *(long *)(w + 6));
        __m256i v1 = _mm256_set_epi64x(*(long *)(w + 8),

                                       *(long *)(w + 10),
                                       *(long *)(w + 12),

                                       *(long *)(w + 14));
        __m256i o = _mm256_or_si256(v0, v1);
        return !_mm256_testz_si256(o, m);
}


which has very short assembly code and give us amazing results

    20ms 20ms 20ms 20ms 20ms 21ms 20ms 20ms 20ms 20ms

Which are not only faster than plain C loop in average, but also has very stable execution time.

Thus regardless GCC doesn't vectorize this loop with known bounds for simple bitwise operations on inegers,  manual vectorization of the loop with new AVX2 instruction set gives about 17% performance benefit in average.


 UPD.

Unfortunately,  the measurements were wrong. The compiler unrolls loops in test() and test_all() macroses and moves called function initialization code (more precisely, loading of wc array) out of the loop. So calling avx2_lookup_opt2() can be depicted in pseudo code in following way:

    load all the arrays to YMM0-YMM14 registers and make OR
    for [0,10M]:
        load bm to YMM14 register
        YMM14 AND YMM0-YMM14


So most heavy part of avx2_lookup_opt2(), YMM registers loading , was running out of the loop.

To avoid unnecessary optimization we should trick the compiler:

    volatile unsigned int * volatile wc;
    static unsigned int __wc[10][16] __attribute__((aligned(64)));


    wc = (volatile unsigned int * volatile)__wc;

Note double volatile specifier in wc declaration, which says that not only the pointer itself, but also the pointed memory, are volatile. Surely, we also should adjust code which uses wc now. And now we can rerun our tests:

First naive loop:

    18ms 20ms 31ms 36ms 45ms 50ms 62ms 83ms 83ms 12ms

Naive loop with 64-bit steps:

    18ms 18ms 19ms 24ms 27ms 32ms 38ms 42ms 41ms 18ms

Loop with OR instead of conditional return:

    43ms 43ms 43ms 43ms 44ms 42ms 43ms 42ms 43ms 43ms

First naive AVX2 implementation:

    160ms 159ms 160ms 161ms 160ms 161ms 159ms 160ms 159ms 161ms

Optimized AVX2 version:

    72ms 72ms 72ms 73ms 72ms 73ms 72ms 73ms 73ms 73ms

AVX2 version with VTESTPS instuction:

    65ms 63ms 64ms 64ms 64ms 63ms 64ms 64ms 65ms 63ms

Thus execution time of all the tests increased and my "cool" AVX2 implementation is actually much slower, then naive C implementation.

Thanks to Melzzzzz who proposed very short implementation for the function in bare assembly at comp.lang.asm.x86 group. So I had a look at generated by G++ code for avx2_lookup_opt2() and it looks messy:

          # Messy loadings of the array pieces from memory to YMM2 and YMM0 registers.     
    vmovq   8(%rsi), %xmm1
    vpinsrq $1, (%rsi), %xmm1, %xmm0
    vmovq   24(%rsi), %xmm1
    vpinsrq $1, 16(%rsi), %xmm1, %xmm2
    vmovq   40(%rsi), %xmm1
    vinserti128     $0x1, %xmm0, %ymm2, %ymm2
    pinsrq $1, 32(%rsi), %xmm1, %xmm0
    vmovq   56(%rsi), %xmm1
    vpinsrq $1, 48(%rsi), %xmm1, %xmm3
    vinserti128     $0x1, %xmm0, %ymm3, %ymm0 


    # __m256i o = _mm256_or_si256(v0, v1)
    vpor    %ymm0, %ymm2, %ymm0


    #  __m256i m = _mm256_set1_epi32(bm)
    vmovd   %edi, %xmm2
    vpshufd $0, %xmm2, %xmm1
    vinserti128     $1, %xmm1, %ymm1, %ymm1
    

    # ! _mm256_testz_si256(o, m)
    vptest  %ymm1, %ymm0
    setne   %al
 

    # Zero upper halves of all YMM registers for interoperability
    # with legacy SSE code.
    vzeroupper

    ret


Instead of the messy loading to YMM0 and YMM2 registers, Melzzzzz proposed to load only one YMM register using VMOVUPS instruction and perform OR on the register and memory.

In fact we don't need to use explicit loading into YMM0 and YMM2 - compiler can do this for us if we rewrite the function in this way (knowing that we can load 256-bit memory operands into the registers):

    __m256i m = _mm256_set1_epi32(bm);
    __m256i o = _mm256_or_si256(*(__m256i *)w, *(__m256i *)(w + 8));
    return !_mm256_testz_si256(o, m);


Also we don't need to be compatible with legacy SSE code, so we can use -mno-vzeroupper compiler option to avoid emitting of vzeroupper instruction. Thus as a result we get very short assembly code, which is very close to Melzzzzz proposal:

    # Move 256-bit at once from aligned memory to YMM0
    # (Also proposed by Joker-eph in the comments below). 
    vmovdqa (%rsi), %ymm0
 

    vmovd   %edi, %xmm2
    vpshufd $0, %xmm2, %xmm1
    vinserti128     $1, %xmm1, %ymm1, %ymm1


    # Use memory operand in OR
    vpor    32(%rsi), %ymm0, %ymm0

    vptest  %ymm1, %ymm0
    setne   %al
    ret


And now it runs with following times:

    21ms 21ms 21ms 20ms 21ms 21ms 20ms 21ms 21ms 21ms

Giving that our faster C-implementation (naive loop with 64-bit steps) runs for 26ms in average after the benchmark fixes, we get 19% performance improvement in average and 100% in worse case!