Author Archives: Josip Bakić

An Elegant Hack

What is the ideal piece of code? An Elegant Hack.

Elegant code is clean, nicely organized code which separates all concerns, properly names everything, has methods that don’t exceed a screen of length, uses dependency injection everywhere… It can however get repetitive, bloated, and costly to maintain. A hack on the other hand is a quick and dirty solution. Something that gets the job done, but looks terrible, and maybe isn’t quite stable. However it is short and to the point.

Striking a balance between these two is the ideal. Code that can be written quickly, yet is very clean and neat. It is short, but reads like sentences of English. If anything needs changing, all concerns are neatly separated. New developers can quickly grasp what is going on and how to do things. Nothing is too complicated, or over designed. Any kind of API developed has a small surface and nudges, or even forces you into the right direction. Components are tiny and well isolated.

A thing worth considering when making design decisions is that, of the many choices available, some of them may be easily upgradeable. A simple solution might serve well in the beginning, and if it is well isolated, then a later upgrade to a more robust choice is a relatively painless process. In type-checked languages the compiler can help you quickly locate places that need changing. Using a light wrapper from the get go is maybe even better. A common complaint against using static classes and methods in C# is that they don’t allow dependency injection, thus making testing more difficult. But a static class as a facade to access a mechanism which makes sense to have globally accessible, can easily see its implementation changed, and delegated to injected dependencies. At the same time it may make the code elsewhere more readable, and its dependencies to the mechanism easily locatable. Always question the dogma! Considering such options is worth it. It combines development efficiency with following the agile principle of delaying commitment – don’t make any decisions too soon, which might limit your options in the future.

Elegance should always be the goal. As Einstein said, things should be made as simple as possible, but no simpler than that. Making code maintainable, easy to understand, change, and upgrade, is exactly the same as making code that developers will love to read. Code that is interesting, systems that don’t require much mental effort to comprehend. Never too simple, nor too complex.

The term “an elegant hack” is a contradiction, which it must be if it is to come even close to capturing the frustration involved in writing good, clean code. Finding the right design to capture the essence of the problem in an understandable and manageable way is difficult, particularly given the hectic nature of the software development process, with requirements constantly changing and deadlines always coming too soon. Developing a piece of software is a bit like conducting research. You cannot predict it fully. You discover things along the way which take you down wholly unexpected paths. Many of your assumptions turn out to be wrong, and many design decisions later cause headaches. To avoid creating a codebase that developers will complain needs to be written from scratch, remember this principle – the ideal code is an Elegant Hack.

Advertisements

Comparing STM with optimal locking

Software Transactional Memory (STM) is a useful, but expensive tool. It eliminates whole classes of problems inherent to multi-threaded programming, making the developer’s work easier, but its overhead is too much for many projects. The cost of keeping track of access to transactional fields is high, particularly for performance-critical systems, or for all those whose safety can be ensured by simple locking. To better understand its potential applicability, it may be worthwhile to look more closely at how STM compares to a lock-based solution. This post will try to do that, specifically looking at the behavior of the Shielded STM library.

The run of an STM based program will, in effect, be equivalent to an optimally granular locking solution. If you observe the simplest case, with two threads “simultaneously” writing in one field, only one will succeed from the first try – the one that goes into the commit check first. (This is specific to commit-time locking variants of STM, like Shielded.) If we observe only those transaction repetitions that end successfully, we can see that they must execute serially. If their runtimes overlapped, and they were writing into the same field, the conflict would be detected by the STM and one of those overlapping runs would be discarded. The runs that leave an effect are thus clearly serialized.

How do we know STM is so optimal? It automatically detects conflicts, and only discards those runs that actually conflict with another, successful one. If two transactions access different fields, they will run with only minimal interactions, and commit from the first go. To achieve this with locking, you must do it with optimal granularity. If you lock too broadly, it will cause needless waiting.

The main comparative costs of the STM solution are the need for tracking access to fields, and, of course, the cost of the runs that end up discarded. Both of these are non-existent in a lock-based solution. Taking and releasing locks is generally fast, after which any access is roughly as fast as in non-concurrent code. And the discarded runs are gone, replaced with sleeping threads – a lock based solution allows just one thread to run at a time, while others wait for the locks to be released.

Since Shielded does MVCC and commit-time locking, Shielded is best compared to an optimal locking solution that employs read-write locks. This follows from the fact that reads generally proceed unobstructed (in MVCC, writes just create new versions…), and that read-only transactions always complete without repetitions. A locking solution can only come close to this by using read-write locks, allowing multiple readers to run in parallel, and blocking only when one of them wants to write. This will, of course, further complicate things. And it’s important to note that a read-write lock puts everyone to sleep when one thread wants to write, while Shielded can still allow parallel read-only transactions to proceed, with only minor blocks under some circumstances, necessary to guarantee consistent reading.

But the equivalence between STM and optimally granular locking does not actually hold, because a locking solution can only be as optimal as an STM solution, or worse. Naturally, I’m ignoring the cost of needless runs here, but bear with me – it is still interesting to note that achieving optimally granular locking is not only hard, it is sometimes impossible. The simplest example – a transaction that reads one field, and then decides whether to access another field based on the first field’s value. Locking at access time cannot be safely done “manually”. The existence of another transaction, which does the same but accesses the fields in the opposite order, will result in deadlocks. In practice, you will lock both fields up front, needlessly blocking some runs of these transactions which would not have actually conflicted, or perhaps resort to some optimistic approach, retrying the whole thing if you end up needing more locks than you took. Neither approach is as good at correctly minimally serializing these transactions as an STM would be.

To top it off, both of the above approaches suffer from the classic problems of lock based solutions – weak controls on developer behavior, allowing access to fields without taking all the needed locks, and the inability to easily compose smaller transactions into bigger ones, which is probably the most touted capability of STM. No matter how many smaller operations you stuff together in a big one, with an STM conflicts will be correctly determined and the transactions will run as parallel as possible.

The issue of needless runs is important, but locks have an issue of their own in sleeping threads. An interesting difference between Shielded transactions and ordinary lock-based ones is when you consider a simple example of several threads trying to run the same transaction in parallel. Since Shielded locks only at commit-time, they will all get to run. The one that enters the commit check first, wins. A lock-based solution takes locks for a longer span of time, typically taking them up front and holding until fully complete. Should the thread that holds the locks be blocked, all of them are blocked. The likelihood of this happening is much higher with a regular lock-based solution. Shielded does use locks (one global regular lock, and individual partial locks, version-based, per field), but for a shorter time, and never during a transaction run. So most of the time, exactly by allowing useless runs, STM gives advantage to threads that are currently running, which may increase overall performance. But this gain is significantly offset by the very time spent on the needless runs.

Very contentious systems will profit from utilizing CPU time more optimally when using locks, than when using Shielded transactions. This should be obvious since the needless runs, though maybe speeding the resolution of one conflict, are slowing the resolution of another if they crowd out non-conflicting, “useful” threads from the CPU. Other STM implementations, specifically the ones using encounter-time locking, do not suffer as much from this problem, but they cannot provide the same level of unobstructed reading, which directly depends on not taking locks for any longer time than needed.

In conclusion, the overall picture that all of this paints is not very favorable for STM. It isn’t the concurrent programming panacea it was once believed to be. But I hope this text illustrates some of the specifics of its behavior well enough to make clear the trade-offs that come with the decision to use STM. The advantages are clear – much easier writing of correct concurrent code, effortlessly achieving optimal parallelization. Complex multi-threaded systems may reap benefits in reduced development effort. Yet the cost is there – transactional code is slower, and the useless runs hurt performance under high contention. The sweet spot for STM are primarily systems which can afford a performance hit in exchange for easier and safer multi-threading, and also systems with complex transactions whose varying access patterns are difficult to predict. Perhaps anytime you find yourself doing canonic lock ordering, you might consider using an STM instead. All in all, it is a very useful tool, but one that should be applied only where it makes sense.

Is Functional Programming really slow?

There is a widespread understanding that functional programming produces lower performance. After all, several factors key to the concept have indubitable costs – garbage collection is a must, and commonly used persistent data structures are slower, to name a few. This probably lead Alan Perlis to comment that “LISP programmers know the value of everything, and the cost of nothing.”

It is certainly true that imperative code can be faster. Anything done functionally can be mapped to an equivalent imperative program, while many well optimized imperative programs probably do not map to anything that could pass as idiomatic functional programming. But the interesting question is: do we always write that well optimized imperative code?

I have recently been doing a lot of data converting, one format to another. Since the logic is relatively complex, you write many classes, each in charge of some part of the format. Nothing surprising. I have seen several approaches, used by different programmers, and the interesting bit is that an approach which can be characterized as functional has been the most performant in practice.

The standard, imperative and object-oriented approach, though probably much more optimal on the level of one method, ends up using a lot of allocations. Many smaller converter objects get created. (This can be avoided by using static methods, though this leads to an unpleasant explosion of function parameters everywhere.) The results are typically passed in lists or arrays, which leads to a lot of collection allocations. The various collections do not live long, they are soon merged into a larger collection, which may expect a similar fate further up the call chain. To avoid this obvious cost, collections can be passed around, allocating one collection and filling it along the way. Yet, this tends to break encapsulation, so will often be avoided. All of this can be worked out, and an optimal solution achieved, but it requires careful thought, and it would possibly violate some basic principles of object-oriented programming.

The more functional approach was to rely on laziness. C# has for a long time had generator functions. If the function you’re writing returns an IEnumerable<T> (the standard .NET interface for a sequence, implemented by all collection classes), the keywords “yield return” allow you to return items one by one, as they get requested, when the resulting enumerable is iterated. Care is required, since enumerating such a sequence twice will execute the whole logic twice. The result does not get stored. In fact, it can easily be different! However, if used correctly, generators allow you to avoid allocating temporary collections.

The converters written with the functional approach were generally much faster than their more imperative counterparts, and consumed less RAM. All of them had parts written in both styles, and this is all a shaky estimate, of course. Yet it still indicates that just following FP principles may sometimes produce faster programs.

The OOP/imperative program could have been much faster, but the best practices of OOP do not lead to such a solution. They hardly lead at all. Rather, the best practices offer many paths to take. Some will be optimal, some will not. A professional programmer working on some project will probably not spend too much time bothering with this, nor is it in the interest of the project for them to produce an overly complex solution. Other people need to be able to understand and maintain the code. So a tendency to simple, safe solutions, which duly follow the SOLID principles, will be strong. Optimality is less likely to be reached.

The functional approach on the other hand tends to push the programmer towards a certain type of code. Static methods (in OOP terms), returning lazy collections, striving for functional purity – these are much stricter rules, leaving less choice, and leading towards a certain type of solution. With simpler problems like these data conversions, it can feel almost like the program writes itself. And though many individual methods seem slow, the overall program is faster, without nearly any effort put into optimizing it.

It is interesting to note that generator methods in C# have an old reputation of being slow. A state-object is allocated every time an enumerable is iterated over, which compares badly to, say, a for loop iterating over an array. I have personally felt driven away from using too many enumerables, just by this piece of knowledge hanging on in my memory, more like a silly prejudice than a well thought out opinion. Obviously, in situations when they enable you to avoid allocating that array, generators must be faster.

This is all based on a small set of examples, all of them programs of the same, specific kind. It’s impossible to draw any strong, general conclusions. Yet they raise an interesting question. Yes, imperative programming can be much faster, but, honestly – how often will we actually write that optimal imperative program?

In many areas, where every last bit of performance is needed, the imperative approach will win easily. (Though, reading John Carmack on the topic of FP, with his background in game programming, makes the question look much more complicated.) But for the everyday programmer, working on everyday problems in that typical enterprise(y) setting, the functional approach may even lead us to produce more performant code, despite not entirely unreasonable expectations to the contrary. Given all the other benefits functional programming brings, which have been written about extensively (“Why Functional Programming Matters” is one of the more famous texts on the matter, it’s a very good read), being more functional in our work seems a no-brainer. Not because we couldn’t write the faster imperative program, but rather, let’s face it – we probably won’t.

Parallel committing

If you were to use transactional memory in a project, you would likely also need some kind of backing, persistent storage for the data held by it. An STM can be used in memory only, coordinating access to some temporary shared data, but adding the ability to synchronize it with a backing storage makes it much more useful. This is what the new version of Shielded is about.

The new development was actually inspired by trying to find the simplest way to implement data distribution in Shielded, to connect multiple servers in a cluster, and have distributed in-memory transactions. This is something I was asked a couple of times. It’s then only natural to think about connecting to an external database. This could be an SQL or NoSQL database, or something like Redis. It seemed best to add a general mechanism allowing different kinds of distribution implementations to be plugged in easily. This distilled further down to one crucial feature – plugging arbitrary code inside the commit process.

The Shield static class now has new methods, called WhenCommitting, which enable just that. You can now subscribe a method to be called from within the commit process. This method gets called after a commit is checked and allowed to proceed, but before any write actually occurs. During this time, the individual shielded fields are locked. Whatever your method is doing, you are guaranteed that other transactions, which depend on the outcome of your transaction, are waiting for you to complete. You are safe to do changes in an external database, or to publish changes to other servers, whatever you wish. You are also allowed to roll the current transaction back, causing a retry, and to make further changes to the involved fields, but only those which were already changed in the main transaction (since only they were checked and are allowed to commit).

Previously, one would have to use Conditional subscriptions to achieve something like this. The conditional could then use the SideEffect method to execute external changes. But between the transaction, the conditional triggered by it, and the side effect, other threads could jump in, see the new changes, and even make further ones. Although a lot can still be done like this, it lacks the simplicity and strong ordering guarantees provided to WhenCommitting subscriptions. As part of the commit process, you are guaranteed that your external changes are executed in the exact order as in the shielded fields themselves. No additional methods of synchronizing are needed.

Of course, executing arbitrary code during a commit is a dangerous game. These methods could easily take a lot of time to complete, keeping some shielded fields blocked. Shielded has up to this point by default used spin-waiting while some fields are locked in a commit. That is not possible any more. It could be changed with a compiler symbol, SERVER, to use Monitor.Wait and PulseAll to wait for the fields to unlock. However, this is too expensive when not needed. So an important change was made under the hood. The new StampLocker class implements an adaptive locker, which spin-waits first, and after a couple of yields switches to Monitor.Wait/PulseAll. When used with quick, in-memory transactions, it will mostly spin-wait giving better performance. But a longer wait will cause it to switch to Monitor waiting, which saves us from wasting CPU time. In tests which involve thread sleeps during a commit, using Monitor waiting also increases their throughput.

I believe this could be the most significant feature in Shielded. The library can now interop with arbitrary other systems, greatly expanding its possible uses. It can probably do a great job as a wrapping layer around a database, a kind of active caching, allowing faster execution of changes in the data, and particularly faster reads. But with a simple, general mechanism like this, there are many possibilities. Like for example combining Shielded with a distributed consensus implementation, and achieving data distribution. Or even combining both.

I hope you will find the library more useful now. If anyone does anything interesting with it, and is willing to share, please do!

Thank you, Mono!

This post is a long overdue homage to the wonderful work done recently by The Mono Project team.

The first thing I want to commend is the fact that installs have been made easier. Ironically, installing Mono on Linux used to be a bit more difficult than installing it on Windows, with the Mono site simply informing you what ancient version your distribution has in its official packages. Good luck waiting for an upgrade! I wanted a working, stable but up-to-date version, which involved finding the appropriate Ubuntu PPA myself… Now, the site gives instructions for adding an official PPA and they work wonderfully. It also upgraded MonoDevelop, and I’m really enjoying version 5 of the IDE. The new GUI put me off back when I tried it out for the first time. Now I feel it’s actually quite distraction-free, elegant and to the point.

So that’s wonderful. But the real reason for this post are the improvements to the performance of the Mono platform.

This is how the Shielded library performs when doing serial micro-transactions now:

cost of empty transaction = 0.180 us
cost of the closure in InTransaction<T> = 0.025 us
cost of the first read = 0.145 us
cost of an additional read = 0.048 us
cost of Modify after read = 0.765 us
cost of Assign after read = 0.680 us
cost of the first Modify = 0.800 us
cost of an additional Modify = 0.045 us
cost of a Value after Modify = 0.038 us
cost of the first Assign = 0.840 us
cost of an additional Assign = 0.041 us
cost of the first commute = 1.470 us
cost of an additional commute = 0.388 us

Small reminder: the purpose of the serial test is to roughly determine the overhead of STM in non-conflicting situations. The test runs a large number of transactions of various kinds, measuring their time, and based on the differences estimates the total cost (not method running time!) of using a certain operation in a transaction.

Now, it’s been some time since the last post on performance of Shielded, and the gains here do not just come from the Mono upgrade. The most notable improvement in the library itself is the new lock-free version tracking mechanism, which naturally unified both tracking of versions still being read and trimming of unnecessary old copies. Along with that, a custom hash set implementation (faster than the default HashSet<T> since Shielded does not need removal) and changing some foreach loops to for loops have also brought performance gains. (BTW, the test output above still mentions the method Assign, which does not exist for some time now. Shielded<T> now has a property Value with a getter and setter.)

But a lot of the difference is just due to Mono. I don’t have the old results just before the upgrade to let the difference speak for itself, but I can tell you I was blown away. After that upgrade, the performance on this test is now on par with its performance on the Microsoft .NET Framework. And actually – the Mono version is faster! I mean, it runs a million simple one-write transactions per second. I test it on an Ubuntu laptop, with an older i5 CPU, and compare it with a relatively new i7 machine running Win8, at a higher CPU frequency, and the difference is there. Win8/.NET is about 30% slower, despite the stronger CPU.

Not all tests are like this, though. The ConcurrentDictionary in Mono is still very slow compared to the MS.NET one, but apparently the Mono Project is now using the reference source of concurrent collections, and I presume the next version to come out will fix this, and perhaps beat MS.NET in all the tests.

A part of the reason for this win is probably in the Linux kernel. But none the less, the latest versions of Mono are impressive! With this platform now sporting a healthy generational garbage collector and performance like this, I think it deserves to be taken very seriously.

Kudos, Mono team!

The power of STM semantics

Software Transactional Memory is a concept which offers many advantages when writing concurrent code. Most importantly, it removes the burden of writing correct multi-threaded code from the programmer, allowing her to focus more on the logic of the problem domain. STM will automatically take care of correctness. However, this comes with a cost – the bookkeeping. Tracking every access to a transactional field in order to properly detect and resolve conflicts introduces a cost penalty. But, at the same time, it also introduces the possibility to create and expose to the programmer some very interesting new semantics.

Shielded has for some time already contained one such semantic addition – the Shield.Conditional static method. With it one may define a transaction which should run whenever a certain condition is satisfied by the current, committed “state of the world”. By virtue of its bookkeeping, Shielded can determine the exact fields your condition depends on. This way it easily knows, when another transaction commits into one of those fields, that your condition might have changed, and will re-evaluate it. (It will even track your condition for any changes in its access pattern between calls. This is important since, due to boolean expressions’ short-circuit evaluation, it is very easy for a condition to change its access pattern, and that must be taken into account.)

This feature was inspired by the Haskell STM’s “retry” command, but it does not go as far as that. The Haskell “retry” can be called anywhere in a transaction, and it will block it until any of the fields, that it has accessed before calling retry, changes. Shielded does not do any such thing, it is a deliberate design decision to not include any threading or blocking constructs into the library. The idea is to allow Shielded to easily be combined with any such construct, depending on what the programmer wants to use, and not lock the programmer into a decision made by this library. And so, the Conditional does not even try executing straight away, but merely subscribes a transaction for future execution.

The power of Haskell’s retry and orElse can be achieved by using Reactive Extensions. Combining Shielded with Rx.NET should not be a difficult task – create an IObservable by creating a Conditional which will call methods on an IObserver (tip – make such calls to IObserver as SideEffects of the Conditional transaction, otherwise you might be triggering observers repeatedly, and with non-committed data!). Once you have such an IObservable, the powerful Rx.NET library allows you to easily create very complex time-based logic, which can easily achieve what Haskell’s “retry” and “orElse” offer, and much more. We might call the combination of these two libraries a Reactive Transactional Memory system!

A recent addition to the Shielded library is yet another powerful semantic tool – the Shield.PreCommit static method. It is very similar to Conditional, but with one very important difference – it will execute within a transaction that plans to affect its condition, and it will execute just before that transaction attempts to commit.

The PreCommit method creates a lot of interesting possibilities. For a couple of examples, have a look at the PreCommit tests. Most importantly, you get the ability to prevent the other transaction from committing. Like in the NoOdds test, for example, where a PreCommit check looks like this:

Shield.PreCommit(() => (x.Read & 1) == 1, () => {
    Shield.Rollback(false);
});

This will cause any transaction, that tries to commit an odd number into the field x (which is a Shielded<int>, of course), to roll back and fail.

The Rollback(false) method will quietly abort the transaction, which the code that started the transaction might not expect. It would probably be better to throw an exception, like in the Validation test in the same file. It throws an exception whenever a transaction tries to commit incorrect state, thus guaranteeing that an invariant will always hold!

Last, but by far not the least, this introduces the ability to easily and arbitrarily define what Shielded has so far been missing the most – prioritization. It is often pointed out that STM allows for easy prioritization schemes to be grafted on it, completely eliminating the so-called priority inversion problem. In order to achieve this before, you would have to use the Changed event of a Shielded<> object, and prevent lower-priority transactions from committing from that event’s handler. However, this would require Shielded to expose events for any single piece of information which may be interesting. Instead, by using Shield.PreCommit, you define conditions which can depend on any piece of transaction-aware information. Then, in the PreCommit body, you may check the priority and perhaps block a transaction from committing. Here is a snippet from the above linked tests:

var x = new Shielded<int>();
int ownerThreadId = -1;
Shield.PreCommit(() => { int a = x; return true; }, () => {
    var threadId = ownerThreadId;
    if (threadId > -1 && threadId != Thread.CurrentThread.ManagedThreadId)
        Shield.Rollback(true);
});

This is perhaps the simplest possible prioritization scheme. The condition of the PreCommit just reads the x field and returns true, meaning it will run in every transaction which makes any kind of write into x. It then checks the ownerThreadId, and if any thread has put its ManagedThreadId into that variable, then only that thread will be able to commit. All others will be repeatedly retried, until the field is released (or they stop trying to write into it). It would be possible to insert a SideEffect call before the Rollback, and have an onRollback lambda in it which would cause the thread to block until a certain signal is issued by the thread that owns the field. That would improve this simple solution to eliminate the busy waiting on non-privileged threads.

The beauty of these schemes is that, both with Conditional and PreCommit, the “other” transactions do not need to know anything about them. Their code is equal regardless of whether a conditional or pre-commit are defined or not. In the prioritization test you can see, for example, that the lower priority transactions have no knowledge about the prioritization scheme baked into them, and yet they fully comply with it.

These powerful semantics, the easily achieved composition and correctness, are arguably the biggest advantage of an STM. I believe that in a project of sufficient complexity, paying the price of STM bookkeeping is worth it.

UPDATE – After reading this post again, the problems with Shield.Rollback(false) calls started to seem a little to serious to allow, so the library no longer supports this method. Rollback receives no arguments, and will always do an immediate retry. This way, anyone can be sure that if the InTransaction method exits normally, the transaction has completed. (Of course, just because your InTransaction call returned, does not mean you are necessarily out of a transaction – nesting…)

An update on STM speed

This continues on the previous post, Benchmarking an STM.

After the last post and those benchmark results, I’ve been busy lately doing various optimizations on Shielded. In the post I announced that I would be changing the Shielded<T>.Assign method to no longer be commutable, due to the cost it incurs. But I’ve also managed to find many other improvements, which you can check out in the project commit log. By far the biggest improvement came from optimizing the process of triggering conditional transactions.

A small but interesting optimization was achieved by just simplifying the commit code. It used to copy the enlisted fields HashSet into a List, instead of iterating over it directly, so that in case of conflict it can iterate backwards, rolling back only the potentially locked fields, and then completing the roll-back out of lock. However, the list allocation and copying is actually slower than any gain there, particularly given that conflicts are not that common. (Seriously, they are quite uncommon. A test launching 10,000 Tasks each trying to increment one of 100 int fields completes with just ~10-15% repetitions due to conflicts.) A fine example of premature optimization.

Here are the new results, with some new data points not collected before (and again, Mono 2.10, Ubuntu 13.10 64 bit, i5-2430M @ 2.4 GHz, 4 GB RAM):

cost of empty transaction = 1.050 us
cost of the closure in InTransaction<T> = 0.190 us
cost of the first read = 0.720 us
cost of an additional read = 0.104 us
cost of Modify after read = 0.970 us
cost of Assign after read = 1.005 us
cost of the first Modify = 1.790 us
cost of an additional Modify = 0.051 us
cost of a Read after Modify = 0.042 us
cost of the first Assign = 1.635 us
cost of an additional Assign = 0.048 us
cost of the first commute = 3.975 us
cost of an additional commute = 0.904 us

Performance is much better on almost all points. Read-transactions are more or less the same, but reducing the calls to Shield.Enlist has reduced repeated access to an already modified field down to ~50 ns, and the cost of the write operations themselves is ~3x smaller. Even the commute is faster, although it is still more expensive.

For comparison, here are the results of the same test, on the same machine, but executed in a virtual Windows machine on Microsoft .NET 4.5:

cost of empty transaction = 0.485 us
cost of the closure in InTransaction<T> = 0.030 us
cost of the first read = 0.210 us
cost of an additional read = 0.120 us
cost of Modify after read = 1.135 us
cost of Assign after read = 1.235 us
cost of the first Modify = 1.320 us
cost of an additional Modify = 0.058 us
cost of a Read after Modify = 0.039 us
cost of the first Assign = 1.245 us
cost of an additional Assign = 0.055 us
cost of the first commute = 1.865 us
cost of an additional commute = 0.235 us

Some numbers are pretty similar, but there are also some striking differences. Empty transactions are twice as fast. Cost of the closure is ridiculously small compared to the same on Mono. And you may notice that every “first” operation on a field is much faster on MS.NET, while any second operation on the same field is roughly equally fast. It matters little whether the second operation is a read or a write, which indicates that the bookkeeping is probably causing the expense. Also interesting is that writes after a read are persistently a little slower on MS.NET, which I can’t explain (but note that the sum of empty trans + first read + a modification, the total time of one simple read-then-write transaction, is still better).

Bookkeeping and the closure are operations that involve allocating objects, so I presume that the fault for the slower score lies in Mono 2.10’s conservative garbage collector. I’m looking forward to Ubuntu 14.04, which will be packing Mono 3.2 with his generational garbage collector (which is included in 2.10 as an option, but is just as slow there, and in some tests much slower) and hopefully fixes for the bugs in concurrent collections.

So, a million empty transactions per second, several hundreds of thousands of simpler transactions per second, and the most complex test, the BetShop (note that it uses structs, it is older than the class proxy-generator feature), running at several tens of thousands. Plus, repeated access to a field is close to negligible. Given the benefits that Software Transactional Memory can bring to more complicates projects, I think this is OK. For now, at least.