When it comes to multi-threading you have to know exactly what you are doing. I mean explore all the possible scenarios/cases that might occur when you are working in a multi-threaded environment. Lock-free multithreading is not a library or a class which we incorporate, its a knowledge/experience that we earn during our journey on threads.
I'll agree with John Skeet on this one; lock-free threading is the devil's playground, and best left up to people who know that they know what they need to know.
There is no such thing as "lock-free threading" these days. It was an interesting playground for academia and the like, back in the end of the last century when computer hardware was slow and expensive. Dekker's algorithm was always my favorite, modern hardware has put it out to pasture. It doesn't work anymore.
Two developments have ended this: the growing disparity between the speed of RAM and the CPU. And the ability of chip manufacturers to put more than one CPU core on a chip.
The RAM speed problem required the chip designers to put a buffer on the CPU chip. The buffer stores code and data, quickly accessible by the CPU core. And can be read and written from/to RAM at a much slower rate. This buffer is called the CPU cache, most CPUs have at least two of them. The 1st level cache is small and fast, the 2nd is big and slower. As long as the CPU can read data and instructions from the 1st level cache it will run fast. A cache miss is really expensive, it puts the CPU to sleep for as many as 10 cycles if the data is not in the 1st cache, as many as 200 cycles if it isn't in the 2nd cache and it needs to be read from RAM.
Every CPU core has its own cache, they store their own "view" of RAM. When the CPU writes data the write is made to cache which is then, slowly, flushed to RAM. Inevitable, each core will now have a different view of the RAM contents. In other words, one CPU doesn't know what another CPU has written until that RAM write cycle completed and the CPU refreshes its own view.
That is dramatically incompatible with threading. You always really care what the state of another thread is when you must read data that was written by another thread. To ensure this, you need to explicitly program a so-called memory barrier. It is a low-level CPU primitive that ensures that all CPU caches are in a consistent state and have an up to date view of RAM. All pending writes have to flushed to RAM, the caches then need to be refreshed.
This is available in .NET, the Thread.MemoryBarrier() method implements one. Given that this is 90% of the job that the lock statement does (and 95+% of the execution time), you are simply not ahead by avoiding the tools that .NET gives you and trying to implement your own.
The trick to getting low-lock programs right is to understand at a deep level precisely what the rules of the memory model are on your particular combination of hardware, operating system, and runtime environment.
I personally am not anywhere near smart enough to do correct low-lock programming beyond InterlockedIncrement, but if you are, great, go for it. Just make sure that you leave lots of documentation in the code so that people who are not as smart as you don't accidentally break one of your memory model invariants and introduce an impossible-to-find bug.
What you gain however with a data structure that is "lock-free" is that your "locks" are very fine grained. This decreases the chance that two concurrent threads access the same "lock" (memory location).
The trick most of the time is that you do not have dedicated locks - instead you treat e.g. all elements in an array or all nodes in a linked list as a "spin-lock". You read, modify and try to update if there was no update since your last read. If there was, you retry.
This makes your "locking" (oh, sorry, non-locking :) very fine grained, without introducing additional memory or resource requirements.
Making it more fine-grained decreases the probability of waits. Making it as fine-grained as possible without introducing additional resource requirements sounds great, doesn't it?
Most of the fun however can come from ensuring correct load/store ordering.
Contrary to one's intuitions, CPUs are free to reorder memory reads/writes - they are very smart, by the way: you will have a hard time observing this from a single thread. You will, however run into issues when you start to do multi-threading on multiple cores. Your intuitions will break down: just because an instruction is earlier in your code, it does not mean that it will actually happen earlier. CPUs can process instructions out of order: and they especially like to do this to instructions with memory accesses, to hide main memory latency and make better use of their cache.
Now, it is sure against intuition that a sequence of code does not flow "top-down", instead it runs as if there was no sequence at all - and may be called "devil's playground". I believe it is infeasible to give an exact answer as to what load/store re-orderings will take place. Instead, one always speaks in terms of mays and mights and cans and prepare for the worst. "Oh, the CPU might reorder this read to come before that write, so it is best to put a memory barrier right here, on this spot."
Matters are complicated by the fact that even these mays and mights can differ across CPU architectures. It might be the case, for example, that something that is guaranteed to not happen in one architecture might happen on another.
Locks in .NET result in an implicit memory barrier, so you are safe using them (most of the time, that is... see for example this Joe Duffy - Brad Abrams - Vance Morrison greatness on lazy initialization, locks, volatiles and memory barriers. :) (Be sure to follow the links on that page.)
...and of course, as @Eric mentioned, Joe Duffy is a definitive read on the subject.
A good STM can get as close to fine-grained locking as it gets and will probably provide a performance that is close to or on par with a hand-made implementation.
One of them is STM.NET from the DevLabs projects of MS.
If you are not a .NET-only zealot, Doug Lea did some great work in JSR-166. Cliff Click has an interesting take on hash tables that does not rely on lock-striping - as the Java and .NET concurrent hash tables do - and seem to scale well to 750 CPUs.
If you are not afraid to venture into Linux territory, the following article provides more insight into the internals of current memory architectures and how cache-line sharing can destroy performance: What every programmer should know about memory.
@Ben made many comments about MPI: I sincerely agree that MPI may shine in some areas. An MPI based solution can be easier to reason about, easier to implement and less error-prone than a half-baked locking implementation that tries to be smart. (It is however - subjectively - also true for an STM based solution.) I would also bet that it is light-years easier to correctly write a decent distributed application in e.g. Erlang, as many successful examples suggest.
MPI, however has its own costs and its own troubles when it is being run on a single, multi-core system. E.g. in Erlang, there are issues to be solved around the synchronization of process scheduling and message queues.
Also, at their core, MPI systems usually implement a kind of cooperative N:M scheduling for "lightweight processes". This for example means that there is an inevitable context switch between lightweight processes. It is true that it is not a "classic context switch" but mostly a user space operation and it can be made fast - however I sincerely doubt that it can be brought under the 20-200 cycles an interlocked operation takes. User-mode context switching is certainly slower even in the the Intel McRT library.
N:M scheduling with light-weight processes is not new. LWPs were there in Solaris for a long time. They were abandoned. There were fibers in NT. They are mostly a relic now. There were "activations" in NetBSD. They were abandoned. Linux had its own take on the subject of N:M threading. It seems to be somewhat dead by now.
From time to time, there are new contenders: for example McRT from Intel, or most recently User-Mode Scheduling together with ConCRT from Microsoft.
At the lowest level, they do what an N:M MPI scheduler does. Erlang - or any MPI system -, might benefit greatly on SMP systems by exploiting the new UMS.
I guess the OP's question is not about the merits of and subjective arguments for/against any solution, but if I had to answer that, I guess it depends on the task: for building low level, high performance basic data structures that run on a single system with many cores, either low-lock/"lock-free" techniques or an STM will yield the best results in terms of performance and would probably beat an MPI solution any time performance-wise, even if the above wrinkles are ironed out e.g. in Erlang.
For building anything moderately more complex that runs on a single system, I would perhaps choose classic coarse-grained locking or if performance is of great concern, an STM.
For building a distributed system, an MPI system would probably make a natural choice.
Note that there are MPI implementations for .NET as well (though they seem to be not as active).
Even though lock-free threading may be difficult in .NET, often you can make significant improvements when using a lock by studying exactly what needs to be locked, and minimizing the locked section... this is also known as minimizing the lock granularity.
As an example, just say you need to make a collection thread safe. Don't just blindly throw a lock around a method iterating over the collection if it performs some CPU-intensive task on each item. You might only need to put a lock around creating a shallow copy of the collection. Iterating over the copy could then work without a lock. Of course this is highly dependent on the specifics of your code, but I have been able to fix a lock convoy issue with this approach.