Monday, July 10, 2017

Left-Right and C-RW-WP with Flat Combining

Flat Combining is a cool technique that allows multiple threads to help each other.
http://dl.acm.org/citation.cfm?id=1810540
It was designed to be used with mutual exclusion locks but it has been used with other things,
such as Hardware Transactional Memory (see "Transactional Lock Elision Meets Combining" in PODC 2017).
We have used it to improve the throughput of C-RW-WP and Left-Right.

C-RW-WP can be seen here:
http://dl.acm.org/citation.cfm?id=2442532
and Left-Right here:
https://github.com/pramalhe/ConcurrencyFreaks/blob/master/papers/left-right-2014.pdf

Here is how it works for C-RW-WP:
- Writers publish their operation (a std::function) using a publishing mechanism, in our case it's just an array of atomic pointers to std::function. Each std::function typically stores a lambda on which we captured whatever parameters are needed to make things simpler;
- A writer spins until it gets the exclusive lock or until its entry in the publishing array goes to nullptr, thus meaning that some other writer has completed its operation and has placed the result in the corresponding entry on the array of results;
- If the writer gets the exclusive lock, then it is his responsibility to execute the other thread's operations that are in the publishing array, save the results in the results array, and set the pointer in the publishing array back to nullptr;
- Readers don't need to publish its operation in the publishing array, unless the lock is already in exclusive mode. In that case, they publish the operation and wait for the lock to be released, or for the entry in the publishing array to become nullptr;

Example code can be seen here:
https://github.com/pramalhe/ConcurrencyFreaks/blob/master/CPP/locks/CRWWPFlatCombining.hpp


There are a couple of minor tricks:
1st, the writer holding the lock will execute each operation at a time
https://github.com/pramalhe/ConcurrencyFreaks/blob/master/CPP/locks/CRWWPFlatCombining.hpp#L121
and only then will it set the entry to nullptr with a store release
https://github.com/pramalhe/ConcurrencyFreaks/blob/master/CPP/locks/CRWWPFlatCombining.hpp#L122
this way, another thread is guaranteed to see the result there if it sees its own entry at nullptr
https://github.com/pramalhe/ConcurrencyFreaks/blob/master/CPP/locks/CRWWPFlatCombining.hpp#L102

2nd, the writers don't have a starvation-free cohort lock, it's a simple spin-lock, but because the Flat Combining technique is starvation-free, the writers end up being starvation-free (among themselves), a very good property to have without having to make a more complex or strick-fairness lock, such as a ticket lock.
https://github.com/pramalhe/ConcurrencyFreaks/blob/master/CPP/locks/CRWWPFlatCombining.hpp#L100

3rd, the readers attempt to get the lock in shared mode first, and only if it is not available (a writer is already there) then they publish their operations
https://github.com/pramalhe/ConcurrencyFreaks/blob/master/CPP/locks/CRWWPFlatCombining.hpp#L100
A reader's operation is not mutative (by definition) but the result needs to be known, so it's up to the writer (who doesn't care if the operation is mutative or not) to apply the operation and place the result in the results array.
The consequence of this is that in low contention, C-RW-WP works as usual (which is blazingly fast), but when there is more contention and some threads are writers, then it goes on the "flat combining path" which is starvation-free.

This means that the end result is a technique that is fully starvation-free, i.e. readers don't starve writers, writers don't starve readers, readers don't starve other readers, and writers don't starve other writers.
I don't know about you, but I think it's pretty cool to take something that has a large propensity for starvation like C-RW-WP (writers starve readers and writers will starve other writers if the cohort is a spinlock), apply Flat Combining to it, and get something that is fully starvation-free, and on top of that has lower tail latency and higher throughput on most workloads  :-O


So how about applying Flat Combining to Left-Right ?
This was so obvious that even back in 2012 Andreia was speaking of doing this (before we even read the Flat combining paper) but it was too simplistic and we wanted full wait-freedom so she dropped it. But now, here it comes again  :)
https://github.com/pramalhe/ConcurrencyFreaks/blob/master/CPP/leftright/LeftRightFlatCombining.hpp

Adding Flat Combining to Left-Right is a bit trickier than adding Flat Combining to C-RW-WP, but not by much.

The first difference is that readers are wait-free in Left-Right, so there is no point for them to add their operations to the publishing array. Flat Combining is great but it provides at best starvation-free progress, while Left-Right provides wait-freedom for readers.
In Left-Right neither readers starve writers nor writers starve readers, so there isn't much of an advantage in progress of adding FC to Left-Right likes there was for C-RW-WP, but there is still the advantage of using a spin lock as the writersMutex and still providing starvation-freedom amongst writers, so we did that
https://github.com/pramalhe/ConcurrencyFreaks/blob/master/CPP/leftright/LeftRightFlatCombining.hpp#L124

The second difference is that we don't want the writer currently holding the writersMutex lock, to apply operations from other writers in one of the instances but not the other, because it would leave the instances in an inconsistent state. To solve this, we first copy the entire publishing array (it's just pointers) to a local (stack-allocated) array:
https://github.com/pramalhe/ConcurrencyFreaks/blob/master/CPP/leftright/LeftRightFlatCombining.hpp#L135
and only then do we start applying the mutations from the local array
https://github.com/pramalhe/ConcurrencyFreaks/blob/master/CPP/leftright/LeftRightFlatCombining.hpp#L143
After the toggle, we apply the same mutations on the opposite instances
https://github.com/pramalhe/ConcurrencyFreaks/blob/master/CPP/leftright/LeftRightFlatCombining.hpp#L153
and only then do we set the corresponding entries in the publishing array to nullptr
https://github.com/pramalhe/ConcurrencyFreaks/blob/master/CPP/leftright/LeftRightFlatCombining.hpp#L156


Just for fun, here are some plots.
Unfortunately I don't have the same plots with/without Flat Combining, but with FC is generally better:



For mutative only workloads (100% writes) using a global lock is better than Left-Right because on Left-Right we have to apply the same mutation twice (once on each instance). As soon as the number of readers increases, Left-Right takes the upper hand (regardless of using Flat Combining or not) and start to be better.

This kind of benchmark uses "mixed-ratio" threads and without over-subscription, so we don't really see the effects of starvation-freedom or wait-freedom, but it's good to know that Left-Right can be combined with Flat Combining and still come out ahead  :)

No comments:

Post a Comment