High Performance Linux



> Try Tempesta FW, a high performance open source application delivery controller for the Linux/x86-64 platform.

> Or check custom high-performance solutions from Tempesta Technologies, INC.

> Careers: if you love low-level C/C++ hacking and Linux, we'll be happy to hear from you.


Tuesday, January 7, 2014

Haswell AVX2 for Simple Integer Bitwise Operations

It's known that vector extension instructions of modern microprocessors are very useful for complicated SIMD (Single Instruction Multiple Data) computations, e.g. matrix multiplication, float point calculations for graphic and video applications etc. However, what about simple bitwise operations on plain integers? Do vector extensions provide any notable benefit in comparison with loops unrolling and other common optimizations on C-level?

Let's study simple example. We have 16 32-bit bitmaps and need to check them all for particular bit, just say whether the bit is set in any of the bitmaps or not. The bitmaps are placed in contiguous memory. Without loss of generality I'll consider following set of the bitmaps having or not set 11th bit (2048 = 1 << 11):

    0    2048 4096 48   5    11   8192 56   304  16384 3    204  208  60   901  208
     0    304  2048 48   5    11   8192 56   4096 16384 3    204  208  60   901  208
     0    304  4096 48   5    2048 8192 56   11   16384 3    204  208  60   901  208
     0    304  4096 48   5    11   2048 56   1024 16384 3    204  208  60   901  208
     0    304  4096 48   5    11   8192 56   2048 16384 3    204  208  60   901  208
     0    304  4096 48   5    11   8192 56   3    16384 2048 204  208  60   901  208

     0    304  4096 48   5    11   8192 56   60   16384 3    204  208  2048 901  208
     0    304  4096 48   5    11   8192 56   208  16384 3    204  208  60   901  2048
     0    304  4096 48   2000 11   8192 5    56   16384 3    204  2040 60   901  2047
     2048 2048 2048 2048 2048 2048 2048 2048 2048 2048  2048 2048 2048 2048 2048 2048


The first eight arrays of bitmaps just have the bit set in different bitmaps, 9th array doesn't have a bitmap with the set bit and finally 10th array has all the bitmaps with the bit.

Now we can write a simple C-function which checks all the bitmaps for the specified bit and returns true or false (the full example can be found at GitHub):

    inline bool
    cycle_lookup_naive(unsigned int bm, volatile unsigned int *w)
    {
        for (int i = 0; i < 16; ++i)
            if (bm & w[i])
                return true;
        return false;
    }


This function accepts desired bitmask as a first argument and array of the bitmaps as the second argument. Since the function is inlined and observes static constant array, I passed the second argument as volatile memory area to prevent compiler from optimizing out a loop in which the function is called 10M times.

I wish all available in G++ 4.7 optimizations for my Intel Core i7-4650U (Haswell), so I compile the program with -O3 -march=core-avx-i -mtune=core-avx-i -mavx2. GCC has auto-vectorization feature which is turned on on -O3 optimization level. The function running 10M times shows following results for the 10 arrays:

    19ms 18ms 24ms 31ms 35ms 42ms 48ms 54ms 53ms 10ms

And there is obvious, but still interesting property of the implementation - it runs faster on arrays which have the desired bitmap closed to the begin. The second thing is that run for 9th arrays, which doesn't have a bitmap with the set bit at all, is bit faster than for the array which has the set bit in last bitmap. This can be explained by branch prediction - processor caches branch in which there is no matching, so it stalls little bit when it finds matching last bitmap. There is interesting discussion on branch prediction of StackOveflow: Why is processing a sorted array faster than an unsorted array?.

The first optimization of the function which we can do is simply use iterate and match the bitmaps by 64-bit values instead of 32-bit:

    inline bool
    cycle_lookup(unsigned int bm, volatile unsigned int *w)
    {
        unsigned long *_w = (unsigned long *)w;
        unsigned long _bm = ((unsigned long)bm << 32) | bm;

        for (int i = 0; i < 8; ++i)
            if (_bm & _w[i])
                return true;
        return false;
    }


And this gives us real speedup:

    13ms 18ms 18ms 21ms 24ms 27ms 29ms 30ms 30ms 14ms

The reason for slight slowdown for the last array, when we find a bitmap from first try, is that we perform additional computation for 64-bit bitmap _bm. We also see that the function execution time depends on input data and this is not always desired. So since we just need to answer whether any of the bitmaps contain desired bit, we can OR all the bitmaps and check the result against bm. Also this way we can eliminate branch misprediction problem (thanks to Melzzzz for the proposed optimization which moves AND out of the loop):

    bool
    cycle_lookup_opt(unsigned int bm, volatile unsigned int *w)
    {
        unsigned long r = 0;
        unsigned long *_w = (unsigned long *)w;
        unsigned long _bm = ((unsigned long)bm << 32) | bm;

        for (int i = 0; i < 8; ++i)
                r |= _w[i];
        return r & _bm;
    }


And this gives us

    23ms 24ms 23ms 24ms 24ms 24ms 24ms 24ms 24ms 24ms

This is 24ms in average which is worse than 22.4ms for previous case. However this is only for the exact test set which can differ on real data (what if in most cases you find the bit only in the last bitmap?).

Now it's time to look under the hood of the compiler. In all the three cases the loop has constant bounds known on compile time, so it's expectable that compiler will vectorize the loop (recall that we compile the program with -03 which enables loops vectorization). However, it isn't so. Here is assembly code for cycle_lookup_opt:

    movl    %edi, %edi
    movq    %rdi, %rdx
    salq    $32, %rdx
    orq     %rdi, %rdx
    movq    %rdx, %rcx
    movq    %rdx, %rax
    andq    8(%rsi), %rax
    andq    16(%rsi), %rcx
    orq     %rax, %rcx
    movq    %rdx, %rax
    andq    (%rsi), %rax
    orq     %rcx, %rax
    movq    %rdx, %rcx
    andq    24(%rsi), %rcx
    orq     %rcx, %rax
    movq    %rdx, %rcx
    andq    32(%rsi), %rcx
    orq     %rcx, %rax
    movq    %rdx, %rcx
    andq    40(%rsi), %rcx
    orq     %rcx, %rax
    movq    %rdx, %rcx
    andq    56(%rsi), %rdx
    andq    48(%rsi), %rcx
    orq     %rcx, %rax
    orq     %rax, %rdx
    setne   %al
    ret


So the compiler simply unrolls the loop and also performs some operations grouping. It decides not to use CPU vector extension while -mavx2 was passed as optimization option.

AVX2 appeared in Haswell architecture allows 256-bit operations, so we can handle our bitmaps scanning only in two steps by 32 bytes at once. It looks promising that CPU will check 8 bitmaps at once, however there is drawback - it's costly to load/store YMM registers which are used by AVX.

First of all we need to convert our 32-bit bitmask to 256-bit value:

    __m256i m = _mm256_set_epi32(bm, bm, bm, bm, bm, bm, bm, bm);

Next, we load our 16 bitmaps to two 256-bit values:

    __m256i v0 = _mm256_set_epi32(w[0], w[1], w[2], w[3],
                                  w[4], w[5], w[6], w[7]);
    __m256i v1 = _mm256_set_epi32(w[8], w[9], w[10], w[11],
                                  w[12], w[13], w[14], w[15]);


Now we can perform AND operation on v0 and v1 with m which give us 1 only on 11ths positions in v0 and v1, so we can safely OR the values to get only one 256-bit value:

    __m256i o = _mm256_or_si256(a0, a1);

Unfortunately, we can't just evaluate o and return true or false, instead we have to unpack it:

    union {
        __m128i _128[2];
        int     _32[8];
    } mr;

    mr._128[0] = _mm256_extracti128_si256(o, 0);
    mr._128[1] = _mm256_extracti128_si256(o, 1);


and only after that evaluate the result with 8 ORs:

    #define OR8(a)  (a[0] | a[1] | a[2] | a[3] \
                     | a[4] | a[5] | a[6] | a[7])
    return !!OR8(mr._32);

This function (avx2_lookup() in the source code) gives us following timings:

    160ms 161ms 160ms 160ms 161ms 160ms 160ms 161ms 160ms 160ms

Thus our naive vectorization has nasty performance. To improve performance on the function we should reduce load/store overhead. Our optimization concludes in 3 steps which reduces initial two 256-bit values to two 64-bit ones. Also we can use _mm256_set_epi64x() which loads 256-bit YMM registers faster than _mm256_set_epi32(). Now resulting optimized function looks as below:

    bool
    avx2_lookup_opt1(unsigned int bm, volatile unsigned int *w)
    {
        union {
            __m128i _128;
            long    _64[2];
        } mr;
        __m128i m = _mm_set1_epi32(bm);
        __m256i v0 = _mm256_set_epi64x(*(long *)w,

                                       *(long *)(w + 2),
                                       *(long *)(w + 4),

                                       *(long *)(w + 6));
        __m256i v1 = _mm256_set_epi64x(*(long *)(w + 8),

                                       *(long *)(w + 10),
                                       *(long *)(w + 12),

                                       *(long *)(w + 14));
        __m256i o0 = _mm256_or_si256(v0, v1);
        __m128i h0 = _mm256_extracti128_si256(o0, 0);
        __m128i h1 = _mm256_extracti128_si256(o0, 1);
        __m128i o1 = _mm_or_si128(h0, h1);
        mr._128 = _mm_and_si128(o1, m);
        return mr._64[0] || mr._64[1];
}


Note that we perform first OR on v0 and v1 reducing them to one 256-bit value o0, next we loads first and second its halves into 128-bit h0 and h1 respectively. We do next reduction by OR against h0 and h1 getting o1. Only here we perform our AND operation. And finally, we load two halves of o1 into to 64-bit longs and return result of last OR operation. This optimizations give us

    71ms 70ms 70ms 69ms 70ms 70ms 71ms 70ms 69ms 71ms

Much better, but still worse than our even non-optimized plain C-loop. Hopefully, we have VTESTPS AVX instruction which can perform AND operation on two 256-bit operands and set ZF flag if all 32-bit words of the result are zero after the operation. Using the instruction (appropriate compiler intrinsic can be found in Intel® 64 and IA-32 Architectures Software Developer’s Manual Volume 2 (2A, 2B & 2C): Instruction Set Reference, A-Z) we can rewrite the function in following way:

    bool
    avx2_lookup_opt2(unsigned int bm, volatile unsigned int *w)
    {
 
        __m256i m = _mm256_set1_epi32(bm);
        __m256i v0 = _mm256_set_epi64x(*(long *)w,

                                       *(long *)(w + 2),
                                       *(long *)(w + 4),

                                       *(long *)(w + 6));
        __m256i v1 = _mm256_set_epi64x(*(long *)(w + 8),

                                       *(long *)(w + 10),
                                       *(long *)(w + 12),

                                       *(long *)(w + 14));
        __m256i o = _mm256_or_si256(v0, v1);
        return !_mm256_testz_si256(o, m);
}


which has very short assembly code and give us amazing results

    20ms 20ms 20ms 20ms 20ms 21ms 20ms 20ms 20ms 20ms

Which are not only faster than plain C loop in average, but also has very stable execution time.

Thus regardless GCC doesn't vectorize this loop with known bounds for simple bitwise operations on inegers,  manual vectorization of the loop with new AVX2 instruction set gives about 17% performance benefit in average.


 UPD.

Unfortunately,  the measurements were wrong. The compiler unrolls loops in test() and test_all() macroses and moves called function initialization code (more precisely, loading of wc array) out of the loop. So calling avx2_lookup_opt2() can be depicted in pseudo code in following way:

    load all the arrays to YMM0-YMM14 registers and make OR
    for [0,10M]:
        load bm to YMM14 register
        YMM14 AND YMM0-YMM14


So most heavy part of avx2_lookup_opt2(), YMM registers loading , was running out of the loop.

To avoid unnecessary optimization we should trick the compiler:

    volatile unsigned int * volatile wc;
    static unsigned int __wc[10][16] __attribute__((aligned(64)));


    wc = (volatile unsigned int * volatile)__wc;

Note double volatile specifier in wc declaration, which says that not only the pointer itself, but also the pointed memory, are volatile. Surely, we also should adjust code which uses wc now. And now we can rerun our tests:

First naive loop:

    18ms 20ms 31ms 36ms 45ms 50ms 62ms 83ms 83ms 12ms

Naive loop with 64-bit steps:

    18ms 18ms 19ms 24ms 27ms 32ms 38ms 42ms 41ms 18ms

Loop with OR instead of conditional return:

    43ms 43ms 43ms 43ms 44ms 42ms 43ms 42ms 43ms 43ms

First naive AVX2 implementation:

    160ms 159ms 160ms 161ms 160ms 161ms 159ms 160ms 159ms 161ms

Optimized AVX2 version:

    72ms 72ms 72ms 73ms 72ms 73ms 72ms 73ms 73ms 73ms

AVX2 version with VTESTPS instuction:

    65ms 63ms 64ms 64ms 64ms 63ms 64ms 64ms 65ms 63ms

Thus execution time of all the tests increased and my "cool" AVX2 implementation is actually much slower, then naive C implementation.

Thanks to Melzzzzz who proposed very short implementation for the function in bare assembly at comp.lang.asm.x86 group. So I had a look at generated by G++ code for avx2_lookup_opt2() and it looks messy:

          # Messy loadings of the array pieces from memory to YMM2 and YMM0 registers.     
    vmovq   8(%rsi), %xmm1
    vpinsrq $1, (%rsi), %xmm1, %xmm0
    vmovq   24(%rsi), %xmm1
    vpinsrq $1, 16(%rsi), %xmm1, %xmm2
    vmovq   40(%rsi), %xmm1
    vinserti128     $0x1, %xmm0, %ymm2, %ymm2
    pinsrq $1, 32(%rsi), %xmm1, %xmm0
    vmovq   56(%rsi), %xmm1
    vpinsrq $1, 48(%rsi), %xmm1, %xmm3
    vinserti128     $0x1, %xmm0, %ymm3, %ymm0 


    # __m256i o = _mm256_or_si256(v0, v1)
    vpor    %ymm0, %ymm2, %ymm0


    #  __m256i m = _mm256_set1_epi32(bm)
    vmovd   %edi, %xmm2
    vpshufd $0, %xmm2, %xmm1
    vinserti128     $1, %xmm1, %ymm1, %ymm1
    

    # ! _mm256_testz_si256(o, m)
    vptest  %ymm1, %ymm0
    setne   %al
 

    # Zero upper halves of all YMM registers for interoperability
    # with legacy SSE code.
    vzeroupper

    ret


Instead of the messy loading to YMM0 and YMM2 registers, Melzzzzz proposed to load only one YMM register using VMOVUPS instruction and perform OR on the register and memory.

In fact we don't need to use explicit loading into YMM0 and YMM2 - compiler can do this for us if we rewrite the function in this way (knowing that we can load 256-bit memory operands into the registers):

    __m256i m = _mm256_set1_epi32(bm);
    __m256i o = _mm256_or_si256(*(__m256i *)w, *(__m256i *)(w + 8));
    return !_mm256_testz_si256(o, m);


Also we don't need to be compatible with legacy SSE code, so we can use -mno-vzeroupper compiler option to avoid emitting of vzeroupper instruction. Thus as a result we get very short assembly code, which is very close to Melzzzzz proposal:

    # Move 256-bit at once from aligned memory to YMM0
    # (Also proposed by Joker-eph in the comments below). 
    vmovdqa (%rsi), %ymm0
 

    vmovd   %edi, %xmm2
    vpshufd $0, %xmm2, %xmm1
    vinserti128     $1, %xmm1, %ymm1, %ymm1


    # Use memory operand in OR
    vpor    32(%rsi), %ymm0, %ymm0

    vptest  %ymm1, %ymm0
    setne   %al
    ret


And now it runs with following times:

    21ms 21ms 21ms 20ms 21ms 21ms 20ms 21ms 21ms 21ms

Giving that our faster C-implementation (naive loop with 64-bit steps) runs for 26ms in average after the benchmark fixes, we get 19% performance improvement in average and 100% in worse case!

5 comments:

  1. hmm, i can't compile code from github :( [error]
    g++ -O3 -march=core-avx-i -mtune=core-avx-i -mavx2 test.cpp
    test.cpp: In function ‘bool cycle_lookup(unsigned int, volatile unsigned int*)’:
    test.cpp:42:51: warning: left shift count >= width of type [enabled by default]
    unsigned long _bm = ((unsigned long)bm << 32) | bm;
    ^
    test.cpp: In function ‘bool cycle_lookup_opt(unsigned int, volatile unsigned int*)’:
    test.cpp:55:51: warning: left shift count >= width of type [enabled by default]
    unsigned long _bm = ((unsigned long)bm << 32) | bm;
    ^
    [niemi@localhost ~]$ g++ --version
    g++ (GCC) 4.8.2 20131212 (Red Hat 4.8.2-7)
    $uname -a
    Linux localhost.localdomain 3.11.10-200.fc19.i686 #1 SMP Mon Dec 2 20:48:22 UTC 2013 i686 i686 i386 GNU/Linux

    ReplyDelete
  2. Hi Aleksei!

    Thank you for the report. The problem is that the benchmark was written for modern Haswell CPUs - it uses AVX2 appeared in the architecture and also it relies on 64-bit mode, while I see that you use 32-bit machine.

    ReplyDelete
  3. Why aren't you using _mm256_loadu_si256() instead of _mm256_set_epi64x()?

    ReplyDelete
  4. You may also want to try _mm256_lddqu_si256(), and if you can align your array on a 32 bytes boundary, definitively _mm256_load_si256() :)
    I read somewhere that the new Haswell relaxed the restriction on aligned load but I don't find in the intrinsics' ref.

    ReplyDelete
  5. Hi Joker-eph,

    yes, I've updated the post. My fail that I didn't pay enough attention to generated assembly code for the benchmark, either for the loops optimizations and the AVX implementation.

    I also updated the code on GitHub. And yes, we I aligned the arrays and used vmovdqa.

    Thank you for the proposal!

    ReplyDelete

Note: Only a member of this blog may post a comment.