High Performance Linux



> Try Tempesta FW, a high performance open source application delivery controller for the Linux/x86-64 platform.

> Or check custom high-performance solutions from Tempesta Technologies, INC.

> Careers: if you love low-level C/C++ hacking and Linux, we'll be happy to hear from you.


Wednesday, January 15, 2014

NatSys Lock-free Queue vs LMAX Disruptor vs boost::lockfree::queue

I've been asked many times how our lock-free queue (you can read more about it here) differs from LMAX Disruptor. The last time is in discussion at Hacker News (it seems one of disruptor author was asking this). In LinkedIn discussions I also was asked about boost::lockfree::queue (it appeared in Boost recently). So I'm writing the post to share the answers.

LMAX Disruptor

Its source code available at https://github.com/redjack/varon-t.

The algorithms are simply different and solve different problems: disruptor is a messaging queue (if a producer P0 emits M0 into the queue then all consumers C0..CN receive M0) while our queue is a classical work queue (if P0 and P1 emit message M0 and M1, then only one consumer Ci receives M0 and only consumer Cj receives M1 (i could be equal to j)).


Our implementation competes with boost::lockfree::queue and it's much faster since Boost implementation uses more heavy synchronization techniques. The benchmark on GitHub also has Boost implementation, so you can compare both the queues.

Unfortunately, there is no adequate algorithm description for disruptor queue, only some indistinct descriptions mostly suitable for business people rather than engineers. So it was not easy to dig it's source code. However, I learned it and there are some notes about the implementation.

The implementation is bit inaccurate: there are a lot of branches without branch prediction information available at compile time, to avoid cache line bouncing it wastes two cache lines instead of simple align an item on cache line (I mean vrt_padded_int). I didn't pay too much attention to memory barriers usage, but giving that X86-64 provides relatively strict memory ordering, probably some of them also could be eliminated. Disruptor uses very cleaver ideas, but I believe its performance can be improved after good code review.

One more point is that while our queue implementation is C++, it's still self sufficient and can be easily ported to C for using in kernel space. It's doubtful (in my humble opinion) that generic container depends on non-standard libcork and moreover logging library (clogger).

One more thing to note about our queue and Disruptor. Dirsuptor's uses naive yielding logic. The problems with the logic is that if a thread has not job for long time it fails to sleep for increasing from yield to yield call time, this if the queue has no job for long time, but at once has a big spike, then it typically will waking up for some time before it starts to work. We solved the issue with lock-free condition wait.


Boost::lockfree::queue

 It uses at least one CAS operation (see for example do_push() in boost/lockfree/queue.hpp), which is slower than plain RMW operation (atomic increment in our case) in best case, so it's slower. You can user benchmark for both the queues at GitHub. For my Intel Core i7-4650U it gives:

     $ g++ -Wall -std=c++0x -O2 -D DCACHE1_LINESIZE=`getconf LEVEL1_DCACHE_LINESIZE` -I/opt/boost_1_55_0/include/ lockfree_rb_q.cc -lpthread
    $ ./a.out
    13969ms 
    check X data... 
    Passed 
    69357ms 
    check X data... 
    Passed 
    20240ms 
    check X data... 
    Passed


Note that I use latest boost version. The first result is for our queue, the second for naive queue implementation and the last one is for Boost. I.e. our implementation is more than 30% faster than boost::lockfree::queue.

6 comments:

  1. Interesting article!

    Just have a side comment. I noticed you use the words "cache line". However, I think this should be "cache block". This is a general misconception. Cache line and cache block are different. The cache line is the physical placeholder inside a cache, however, a cache block is the payload inside that physical placeholder. People tend to confuse them together since they are both of the same size. Hence, movement is supposed to be for cache blocks and not for cache lines :)

    ReplyDelete
    Replies
    1. Hi Ahmad,

      probably I've never faced "cache block" in literature, and "cache line" could be misleading, but in most cases it's clear from context whether an author writes about physical placeholder in a cache or a payload. So I just prefer to hold common practise.

      Delete
  2. This comment has been removed by the author.

    ReplyDelete
  3. Alexander, code review is great as usual, but the article has following flaw: you are seemingly attributing varon-t's C implementation as LMAX Java original Distruptor ("LMAX Disruptor .. Its source code available"), which isn't correct.

    ReplyDelete
  4. Hi Igor,

    that was not my idea to reference varon-t's C implementation as LMAX Disrutor. Firstly, I was asked many times how our approach (algorithm, not particular C or Java implementation) differs from Disruptor design. And secondly akadien referred the C implementation as a plug for C at Hacker News (https://news.ycombinator.com/item?id=7042771).

    ReplyDelete
  5. This comment has been removed by a blog administrator.

    ReplyDelete

Note: Only a member of this blog may post a comment.