An update on STM speed

This continues on the previous post, Benchmarking an STM.

After the last post and those benchmark results, I’ve been busy lately doing various optimizations on Shielded. In the post I announced that I would be changing the Shielded<T>.Assign method to no longer be commutable, due to the cost it incurs. But I’ve also managed to find many other improvements, which you can check out in the project commit log. By far the biggest improvement came from optimizing the process of triggering conditional transactions.

A small but interesting optimization was achieved by just simplifying the commit code. It used to copy the enlisted fields HashSet into a List, instead of iterating over it directly, so that in case of conflict it can iterate backwards, rolling back only the potentially locked fields, and then completing the roll-back out of lock. However, the list allocation and copying is actually slower than any gain there, particularly given that conflicts are not that common. (Seriously, they are quite uncommon. A test launching 10,000 Tasks each trying to increment one of 100 int fields completes with just ~10-15% repetitions due to conflicts.) A fine example of premature optimization.

Here are the new results, with some new data points not collected before (and again, Mono 2.10, Ubuntu 13.10 64 bit, i5-2430M @ 2.4 GHz, 4 GB RAM):

cost of empty transaction = 1.050 us
cost of the closure in InTransaction<T> = 0.190 us
cost of the first read = 0.720 us
cost of an additional read = 0.104 us
cost of Modify after read = 0.970 us
cost of Assign after read = 1.005 us
cost of the first Modify = 1.790 us
cost of an additional Modify = 0.051 us
cost of a Read after Modify = 0.042 us
cost of the first Assign = 1.635 us
cost of an additional Assign = 0.048 us
cost of the first commute = 3.975 us
cost of an additional commute = 0.904 us

Performance is much better on almost all points. Read-transactions are more or less the same, but reducing the calls to Shield.Enlist has reduced repeated access to an already modified field down to ~50 ns, and the cost of the write operations themselves is ~3x smaller. Even the commute is faster, although it is still more expensive.

For comparison, here are the results of the same test, on the same machine, but executed in a virtual Windows machine on Microsoft .NET 4.5:

cost of empty transaction = 0.485 us
cost of the closure in InTransaction<T> = 0.030 us
cost of the first read = 0.210 us
cost of an additional read = 0.120 us
cost of Modify after read = 1.135 us
cost of Assign after read = 1.235 us
cost of the first Modify = 1.320 us
cost of an additional Modify = 0.058 us
cost of a Read after Modify = 0.039 us
cost of the first Assign = 1.245 us
cost of an additional Assign = 0.055 us
cost of the first commute = 1.865 us
cost of an additional commute = 0.235 us

Some numbers are pretty similar, but there are also some striking differences. Empty transactions are twice as fast. Cost of the closure is ridiculously small compared to the same on Mono. And you may notice that every “first” operation on a field is much faster on MS.NET, while any second operation on the same field is roughly equally fast. It matters little whether the second operation is a read or a write, which indicates that the bookkeeping is probably causing the expense. Also interesting is that writes after a read are persistently a little slower on MS.NET, which I can’t explain (but note that the sum of empty trans + first read + a modification, the total time of one simple read-then-write transaction, is still better).

Bookkeeping and the closure are operations that involve allocating objects, so I presume that the fault for the slower score lies in Mono 2.10’s conservative garbage collector. I’m looking forward to Ubuntu 14.04, which will be packing Mono 3.2 with his generational garbage collector (which is included in 2.10 as an option, but is just as slow there, and in some tests much slower) and hopefully fixes for the bugs in concurrent collections.

So, a million empty transactions per second, several hundreds of thousands of simpler transactions per second, and the most complex test, the BetShop (note that it uses structs, it is older than the class proxy-generator feature), running at several tens of thousands. Plus, repeated access to a field is close to negligible. Given the benefits that Software Transactional Memory can bring to more complicates projects, I think this is OK. For now, at least.

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s