Monthly Archives: March 2014

An update on STM speed

This continues on the previous post, Benchmarking an STM.

After the last post and those benchmark results, I’ve been busy lately doing various optimizations on Shielded. In the post I announced that I would be changing the Shielded<T>.Assign method to no longer be commutable, due to the cost it incurs. But I’ve also managed to find many other improvements, which you can check out in the project commit log. By far the biggest improvement came from optimizing the process of triggering conditional transactions.

A small but interesting optimization was achieved by just simplifying the commit code. It used to copy the enlisted fields HashSet into a List, instead of iterating over it directly, so that in case of conflict it can iterate backwards, rolling back only the potentially locked fields, and then completing the roll-back out of lock. However, the list allocation and copying is actually slower than any gain there, particularly given that conflicts are not that common. (Seriously, they are quite uncommon. A test launching 10,000 Tasks each trying to increment one of 100 int fields completes with just ~10-15% repetitions due to conflicts.) A fine example of premature optimization.

Here are the new results, with some new data points not collected before (and again, Mono 2.10, Ubuntu 13.10 64 bit, i5-2430M @ 2.4 GHz, 4 GB RAM):

cost of empty transaction = 1.050 us
cost of the closure in InTransaction<T> = 0.190 us
cost of the first read = 0.720 us
cost of an additional read = 0.104 us
cost of Modify after read = 0.970 us
cost of Assign after read = 1.005 us
cost of the first Modify = 1.790 us
cost of an additional Modify = 0.051 us
cost of a Read after Modify = 0.042 us
cost of the first Assign = 1.635 us
cost of an additional Assign = 0.048 us
cost of the first commute = 3.975 us
cost of an additional commute = 0.904 us

Performance is much better on almost all points. Read-transactions are more or less the same, but reducing the calls to Shield.Enlist has reduced repeated access to an already modified field down to ~50 ns, and the cost of the write operations themselves is ~3x smaller. Even the commute is faster, although it is still more expensive.

For comparison, here are the results of the same test, on the same machine, but executed in a virtual Windows machine on Microsoft .NET 4.5:

cost of empty transaction = 0.485 us
cost of the closure in InTransaction<T> = 0.030 us
cost of the first read = 0.210 us
cost of an additional read = 0.120 us
cost of Modify after read = 1.135 us
cost of Assign after read = 1.235 us
cost of the first Modify = 1.320 us
cost of an additional Modify = 0.058 us
cost of a Read after Modify = 0.039 us
cost of the first Assign = 1.245 us
cost of an additional Assign = 0.055 us
cost of the first commute = 1.865 us
cost of an additional commute = 0.235 us

Some numbers are pretty similar, but there are also some striking differences. Empty transactions are twice as fast. Cost of the closure is ridiculously small compared to the same on Mono. And you may notice that every “first” operation on a field is much faster on MS.NET, while any second operation on the same field is roughly equally fast. It matters little whether the second operation is a read or a write, which indicates that the bookkeeping is probably causing the expense. Also interesting is that writes after a read are persistently a little slower on MS.NET, which I can’t explain (but note that the sum of empty trans + first read + a modification, the total time of one simple read-then-write transaction, is still better).

Bookkeeping and the closure are operations that involve allocating objects, so I presume that the fault for the slower score lies in Mono 2.10’s conservative garbage collector. I’m looking forward to Ubuntu 14.04, which will be packing Mono 3.2 with his generational garbage collector (which is included in 2.10 as an option, but is just as slow there, and in some tests much slower) and hopefully fixes for the bugs in concurrent collections.

So, a million empty transactions per second, several hundreds of thousands of simpler transactions per second, and the most complex test, the BetShop (note that it uses structs, it is older than the class proxy-generator feature), running at several tens of thousands. Plus, repeated access to a field is close to negligible. Given the benefits that Software Transactional Memory can bring to more complicates projects, I think this is OK. For now, at least.

Benchmarking an STM

While recently discussing with a colleague about a project we’re working on, I mentioned how it might be a good fit for using an STM. I actually think it is a perfect fit. Working on that project is what spurred my interest in the STM concept itself and inspired me to write Shielded. However, the colleague disagreed, and made a remark which struck me as significant. He’s a talented guy, very pragmatic, with a keen gift for smelling problems way before most programmers. His Titanic would see that fatal iceberg even before leaving Southampton. So, the remark?

“If we use it, every access to a field has to go through it.”

Now, granted, this is not big news. The fact that STM slows things down is well known and is probably what makes the concept unappealing to a lot of programmers. But what particularly struck me is the phrase “every access”. A complex system is sure to make a lot of reads and writes within one transaction, many of those hitting the same fields. Cost of the first access in STM should compare well with the cost of taking an ordinary lock, because an STM is roughly equivalent to perfectly granular locking. (And consider how difficult it is to do perfectly granular locking…) But in locking solutions, additional access to a field is negligible.

So, I decided I should measure and see how much it costs to access a field once, and how much to access it after that, in the same transaction. The test I wrote can be seen here. (Please forgive the mess, the ConsoleTests project is a quick test bed, I don’t bother much with keeping it tidy.) Here are the cost calculation results:

cost of empty transaction = 0.890 us
cost of the closure in InTransaction<T> = 0.270 us
cost of the first read = 0.920 us
cost of an additional read = 0.100 us
cost of the first Modify = 5.470 us
cost of an additional Modify = 0.084 us
cost of a Read after Modify = 0.068 us
cost of the first Assign = 8.370 us
cost of an additional Assign = 0.648 us
cost of the first degenerated Assign = 5.280 us
cost of an additional degenerated Assign = 0.563 us

This is all in microseconds. Only the first non-zero digit should be considered relevant, due to variations between test runs. The test was executed on my laptop – an i5-2430M processor ticking at 2.4 GHz, with 4 GB RAM, running Ubuntu 13.10 (64 bit) and Mono

Empty transactions is a simple measure – just a NOP delegate passed into the InTransaction method. The “cost of the closure in InTransaction<T>” refers to a handy overload of the same method which takes a Func<T> instead of an Action, runs it transactionally and returns it’s final result. This comes in very handy. To do this, it has to create a closure, which it passes to the non-generic InTransaction. The measure indicates the additional cost of creating that closure. I’m not very happy with that score, but there’s nothing I can do about it, it is the result of how this is implemented in Mono. On the plus side, it means some of the key Shielded operations are close to the same order of magnitude as allocating a simple closure.

Further results indicate how much it costs to perform certain operations for the first time, and for every additional time, within a transaction. Note that this is not just the running time of the particular method call – adding a field to a transaction’s footprint means that some pre-commit checking will now also be done, depending on the operation.

The cost of an additional access is roughly an order of magnitude smaller than the cost of the first access. After a Modify call it is even better. This might surprise you, but bear in mind that a write prepares a field’s thread-local storage. This is probably the most expensive part of it, but any future read or write is then executed directly against that local storage, making it faster.

Overall, the results seem OK. Something, however, completely sticks out, and that’s the performance of Assign. As a commutable operation, introducing it into a transaction includes the commute sub-transaction mechanism, and the hit this introduced is big. Also, note how, since commutes involve delegates and the creation of closures (specifically, at least two closures will be allocated), additional Assign calls also remain slow when compared to additional Modify calls, whose delegates in the test were not closing over any local variables. It cannot go below the cost of a closure. When degenerated (when the field was read in the transaction, which disables commuting over it) it executes directly, and it’s first call performance is roughly equal to Modify, as would be expected. But additional calls again remain too slow – again, most likely, due to the closures.

To address this, Assign will be made non-commutable, an ordinary op, which should make it faster than Modify. The current method will remain, only bearing a different name, for use when commutability is actually needed. Commutables are very useful when they don’t degenerate, since the price of a needless conflict is far bigger than these microseconds here.

Let me know in the comments what you think about these results, how you feel about the costs, and if you see something here that I have missed. Would you consider using Shielded?

Why I still like obstruction freedom

Recently I’ve tried to change Shielded to use encounter-time locking. As opposed to the current implementation, which employs commit-time locking, this new version would lock an object on every write attempt. This would certainly speed it up, I figured. No IShielded would have to have thread-local storage for it’s temporary data – if a thread locks a field, it can write directly into some private members*. Commit time checking simplifies, knowing that every changed field must already be locked. Sounds great.

The first thing that broke was a test. Some of them would create and wait for a thread which is in deliberate conflict with them. After changing to encounter-time locking, this produced a deadlock. This is, of course, not a problem, since no thread should do that 😉

But, one thing would not fit into this concept well – the commutes. For as long as a thread keeps a field locked, no other thread can do anything with it, not even a commute. This seriously undermines the concept. The results of the tests might have indicated this, with performance becoming less smooth, the speed varying more during a longer test. (Shielded is full of commutable ops, most notably on Count fields of the included collections.)

Locking a field to make sure a longer running transaction succeeds is easy to do. A Changed event handler which throws (or, I don’t know, Monitor.Wait-s) for all but one thread would do the trick just fine. The encounter-time version was, basically, doing this to EVERY field, and producing very little gain otherwise.

Ennals argues that obstruction freedom is an unnecessary requirement for an STM, and that it could be made faster by removing it. Although much of the argument is true, and an encounter-time solution would be faster, the paper creates another problem. It requires that the value of a field be available without any indirection, in order to reduce cache misses. This makes it impossible to have MVCC, which in turn means that reader progress cannot be guaranteed. A thread which only reads some data will, in the commit-time locking version, proceed without even trying to enter the global lock. With encounter-time locking and the no-indirection requirement, it must block on conflict with a writer, and then be restarted!

Another claim in the paper is made, which seems almost ridiculous. The paper says that obstruction freedom means we are unable to control the number of concurrently executing transactions, and must allow for all N of them to execute in parallel. In all practical settings, nobody would be prevented from counting and limiting the number of parallel transactions simply to be so thoroughly obstruction-free. You are free to start and run them at any pace you please. (See Queue.cs for a simple example.)

All in all, commutes and reader progress are the reasons why Shielded still employs commit-time locking, and is obstruction-free during a transaction run. If needed, locking and parallelism control can easily be added to ensure proper prioritization.

* Something similar was implemented in LocalStorage recently. If there’s only one thread changing a field, it will use a private member for storage.