Archive for the ‘parallel’ Category

February 27, 2013

Stupid RCU Tricks: Read-Side Ordering Constraints

Suppose that you have an initially empty RCU-protected hash table with per-bucket-locked updates. Suppose that one thread concurrently inserts items A and B in that order (but into different buckets) while a second thread concurrently looks up item A then item B—and while yet a third thread concurrently looks up these two items in the opposite order. The code for these three threads might look something like the following:

 1 void t1(void)
 2 {
 3   spin_lock(chain(A));
 4   insert(A);
 5   spin_unlock(chain(A));
 6   spin_lock(chain(B));
 7   insert(B);
 8   spin_unlock(chain(B));
 9 }
10 
11 void t2(void)
12 {
13   rcu_read_lock();
14   l1 = lookup(A);
15   rcu_read_unlock();
16   rcu_read_lock();
17   l2 = lookup(B);
18   rcu_read_unlock();
19 }
20 
21 void t3(void)
22 {
23   rcu_read_lock();
24   l3 = lookup(B);
25   rcu_read_unlock();
26   rcu_read_lock();
27   l4 = lookup(A);
28   rcu_read_unlock();
29 }

Because there is absolutely nothing guaranteeing the order of t2()‘s and t3‘s lookups, it is quite possible that we could end up with l1==1&&l2==0&&l3==1&&l4==0, which would mean that t2() and t3() disagree on the order of insertion. However, this outcome is excluded if we place a synchronize_rcu() between lines 5 and 6 above.

But how can synchronize_rcu() possibly exclude this outcome given that t2()‘s and t3()‘s lookups can still be reordered?


Posted in parallel, stupid rcu tricks | No Comments »

November 30, 2012

RCU and Crowds

A Read-Copy Update based parallel server for distributed crowd simulations” (paywalled) recently appeared in Journal of Supercomputing. Vigueras et al. apply RCU to crowd simulation (as you might guess from the title), which combines throughput (“faster is better”) and response-time requirements (“never trust any result older than 250 milliseconds”). Use of RCU instead of straight locking allows them to support crowds with three times the number of agents while still meeting their response-time requirements.

It is good to see academics experimenting with RCU. Even better, I learned of this only after seeing the paper. ;-)


Posted in parallel, performance, rcu | No Comments »

Off to the races!

I was fortunate enough to be able to attend the first workshop on Relaxing Synchronization for Multicore and Manycore Scalability (also known as “RACES’12”), which drew concurrency researchers from across academia and industry. The topic was somewhat controversial, quoting from the call for participation:

How can we cast aside the security of correctness, the logic of a proof, and adopt a new way of thinking, where answers are good enough but not certain, and where many processors work together in parallel without quite knowing the states that the others are in? We may need some amount of synchronization, but how much? Or better yet, how little? What mental tools and linguistic devices can we give programmers to help them adapt to this challenge?

The workshop’s presenters expressed a number of views on this matter.

One group of talks covered applications using approximate methods, where the approximations might be due to inaccuracies in the input data, convergence criteria in iterative algorithms, use of discrete rather than continuous time intervals, and so on. The key point was that all of these applications maintained an error budget, and that this budget could just as well include a component for errors due to weak synchronization—or even due to entirely omitted synchronization. Applications included the Barnes-Hutt gravitational multi-body solver (Martin Rinard); kmeans clustering and breadth-first search of large graphs (Ravi Nair et al.), and web-based analytics (Michael Carbin et al.). Of course, synchronization-free numerical algorithms have been around for some decades for dense arrays, but these talks used more complex data structures. The talks were quite interesting, though I would have liked to see more consistent comparisons against unsynchronized single-threaded implementation. After all, eliminating synchronization does not necessarily eliminate synchronization overhead in the form of communications cache misses. It nevertheless felt quite good to see people taking even more aggressive approaches to parallelism than I normally take. ;-)

There was considerable discussion as to when inaccuracy do to relaxed synchronization could be tolerated. Martin Rinard pointed out that the relaxed Barnes-Hutt program actually provided greater accuracy than did some of the more heavily synchronized variants, but the fact that the unsynchronized version could omit objects completely caused some consternation. Of course, Barnes-Hutt’s use of centroids to represent large groups of bodies is inherently approximate in any case: Reasonable accuracy is achieved only when the ratio of the distances to the nearest object in the group and to the centroid of that group is reasonably close to 1.0. Doug Lea eventually closed this discussion by suggesting that: (1) Any algorithm willing to tolerate the approximations inherent in floating-point arithmetic should also be willing to tolerate the approximations inherent in relaxed or omitted synchronization, but that (2) algorithms using exact computations might wish to think twice before omitting quite so much synchronization. I suspect that important future work will include classifying what sorts of errors and non-determinism are acceptable in a given situation.

Hans-J. Boehm’s talk on the evils of data races exposed a sharp difference in the definition of “data race”. A common definition is “at least two concurrent unsynchronized accesses, at least one of which is a write,” but it turned out that there was little agreement on what constitutes “unsynchronized.” Martin Rinard argued that a synchronized access should involve locking, atomic instructions, or at the very least a memory barrier, while Hans asserted that as long as the developer informed the compiler that the memory location in question was subject to concurrent conflicting accesses, the corresponding accesses were in fact synchronized. Hans’s version of the definition has the interesting property that exactly the same assembly code might be omitted for a data-race-free program as for the analogous program having data races, but on the other hand, Hans’s definition allows Martin to write his racy algorithms in C++11 without having to drop into assembly language. I doubt that these two groups will come to agreement on this point any time soon, but such is life in computing. Therefore, if someone talks to you about data races, you would be wise to ask them exactly what definition they are using.

The next presentation examined the meaning of “FIFO” in concurrent systems. Andreas Haas et al. noted that the usual definitions of “perfect FIFO queues” used “linearization points” that are not visible to the caller, who can really only observe when the enqueue and dequeue operation begin and end. The authors therefore measured the deviation of the order in the queue from the order in which the enqueue operations began. They found that even strict FIFO-ordered queues suffered significant misordering. Interestingly enough, queues designed to emphasize performance over ordering actually produced more strongly ordered results than did the “perfect FIFO” queues, mainly because the shorter durations of the enqueue and dequeue operation exposed less opportunity for misordering. I found this to be the most interesting talk, as it gave yet another reason why it is a bad idea to push your parallel application’s entire data set through a single queue.

Philip Howard presented a talk on relativistic programming, which sparked a debate as to whether such techniques were actually usable in practice. I eventually squelched the debate by pointing out that this technique is heavily used in the Linux kernel. Trey Cain discussed hardware techniques for gaining more performance and scalability from weakly ordered hardware, David Ungar raised the question of whether increased scalability and throughput always implied degraded latency, and Max Orhai presented on parallel sort on a spatial computer, that is a computer whose computing elements are arranged in a linear array or a two-dimensional grid. Finally, Sasa Misailovic discussed a system that automatically removed synchronization from a program and the tested the degree of inaccuracy that this removal introduced. I am not so sure I would want to bet my life on a program produced by Sasa’s system, but random removal of synchronization does sound like a good way to evaluate the effectiveness of test/validation suites for parallel programs.

My presentation made the claim that parallel programming could in fact be learned by large groups of people (paper, presentation). In keeping with most of the other presentations, this claim proved to be somewhat controversial.

All in all, this was an interesting and worthwhile workshop, and I look forward to similar gatherings in the future.


Posted in is parallel programming hard, parallel, readings, stupid smp tricks | No Comments »

June 10, 2012

Transactional Memory Everywhere: HTM and Cache Geometry

The previous post described some potential pitfalls in blindly applying hardware lock elision to legacy software. This post looks instead at how changes in CPU cache geometry have increased hardware transactional memory’s (HTM’s) cache footprint, at least from a theoretical standpoint. This is yet another example of how hardware giveth and software taketh away, at least according to my colleagues working on hardware.

For example, cache associativity has increased over the past couple of decades. For example, Sequent systems of 20 years ago had two-way set-associative caches, while the laptop I am typing this on has an eight-way set-associative cache. The following graph, which models a 4KB cache with 64-byte cache lines, shows why this is increase is important to HTM:

HTM success probability for 4K cache

To use this graph, select the transaction size in cache lines on the x-axis, then read the y-axis to find the overflow (failure) probability for the desired associativity. For example, a direct-mapped (AKA one-way set-associative) cache has about a 50% failure probability for a 10-cacheline transaction, while an eight-way set-associative cache has less than a one-in-a-million failure probability at this same size. Clearly, the increases in microprocessor cache associativity over the past 20 years have profoundly reduced the failure probability of modest-sized transactions.

But a 4KB cache with 64-byte cache lines cannot support a transaction of more than 64 cache lines, no matter how large the associativity. Fortunately, on-chip cache sizes have also increased, with my laptop’s L0 cache being 32KB rather than 4KB. Unfortunately, the standard closed-form expression for success probability is severely stressed by this increase in size. For example, the expression for 64 references into a 32KB cache having 64-byte cache lines and eight-way set associativity weighs in at more than 12 megabytes. Because this expression is a sum of a very large number of terms having very small values, indefinite-precision arithmetic is a must, which makes the analytic approach quite slow for modest sized transactions (though it is quite efficient for the smallest transactions as well as the largest transactions that have some chance of success).

Fortunately, there is another approach that works quite well for modest failure probabilities, for example, down to 0.0001. This approach is monte carlo simulation, where we randomly generate a long series of sequences of references and estimate the failure probability based on the results. For example, the following figure shows the same analytic data as the earlier figure, but overlays it with billion-trial monte carlo simulations.

HTM success probability for 4K cache plus monte carlo

The monte carlo trials agree quite well with the analytic results, especially for non-infinitesimal failure probabilities, which gives us some confidence in the monte carlo results for a 32KB cache:

HTM success probability for 32K cache

As you can see, moving from a 4KB to a 32KB cache significantly decreases the failure probability for a given HTM transaction. But the x86 family normally increases L0 cache size in tandem with associativity because this allows the L0 cache lookup to be carried out concurrently with the TLB lookup. Thus, an x86 4KB cache will be direct mapped, a 8KB cache will be two-way set associative, and so on:

HTM success probability for x86 cache

The newer x86 systems are more friendly to HTM than are the older systems, and might be even more friendly if transactions are allowed to overflow from the L0 cache into the L1 or L2 caches.

But not all hardware advances have been helpful. The increase in cache-line size from 32 bytes to 64 bytes reduces the number of cache lines, which, if the addresses are randomly selected from a large memory, increases overflow probability as shown in the following figure:

HTM success probability for different cache-line sizes

In short, HTM failure probabilities have decreased with increasing hardware capabilities, which has interestingly enough made analysis more difficult. But personally, I have always preferred improved performance over easy analysis.


Posted in parallel, transactional memory everywhere | No Comments »

May 19, 2012

Transactional Memory Everywhere: Hardware Transactional Lock Elision

My earlier posting on hardware transactional memory (http://paulmck.livejournal.com/31853.html) brought an unusual amount of private email, especially on my discussion of the hazards of eliding empty critical sections (http://kernel.org/pub/linux/kernel/people/paulmck/Answers/TransactionalMemoryEverywhere/WhyEmptyLock.html). Several people made the interesting claim that any correct code involving empty lock-based critical sections that relies entirely on memory references is guaranteed either to: (1) operate correctly under HTM or (2) result in an abort, causing execution to fall back to the lock-based code, again operating correctly. I invested more time than I care to admit unsuccessfully attempting to generate a counter-example, so I am becoming more inclined to believe them, despite the lack of a formal proof.

There is a straightforward rationale for their claim, namely that in strongly atomic HTM implementations, the results of a given transaction are not visible until after the transaction completes. So if you can see that the transaction started, it has already finished, at which point a no-op will successfully wait for it. So any sequence of code that tests variables to determine when a given critical section has started, and then uses an empty lock-based critical section to wait for it to complete will operate correctly even when the empty critical section becomes essentially a no-op under hardware transactional lock elision.

Of course, this would not be true of a weakly atomic HTM implementation, but as far as I know all the currently proposed production HTM implementations are strongly atomic.

But there are other ways to communicate than relying on memory references, for example, you could instead rely on the passage of time. Please note that relying on the passage of time is very difficult to get right, especially on today’s super-scalar hardware that relies so heavily on speculative execution. And if your code is running on top of a hypervisor, relying on timing is even more risky. And if I catch you trying to sneak a timing-based patch into the Linux kernel, I will almost certainly NAK it. However, older computer systems were much more amenable to use of timing, so it would be no surprise if large legacy software made use of such techniques. The following is an example of what a timing-based algorithm might look like.

Each worker thread corresponds to a data feed out to some receiver. Each worker thread records the current time (e.g., from clock_gettime()) to a per-thread timestamp after executing each work unit. Because this is a per-thread timestamp, a normal write suffices. Because this is a real-time application, it is a fatal error for a given thread to fail to update its per-thread timestamp for more than (say) 100 microseconds. Results are placed into a shared-memory buffer, which is drained by a separate set of threads that transmit the results out over the network. Locks are used sparingly to access and update global state, and, again given that this is a real-time application, locks are granted in FIFO order within priority level.

Worker threads can finish, and when they do they must disentangle themselves from the rest of the application. Worker threads also indicate their status in a thread-local variable my_status that is initialized to -1. However, worker threads do not exit: They instead go to a thread pool to accommodate later processing requirements.

There is also a control thread that monitors the worker threads. In particular, it maintains a histogram of thread statuses for threads that have exited. This same thread is responsible for assigning new threads or reassigning old ones, and runs at a priority no higher than that of the worker threads.

Worker threads do something like the following:

	int my_status = -1;  /* Thread-local variable. */

	while (continue_working()) {
		enqueue_any_new_work(); /* Might include locks. */
		wp = dequeue_work();  /* Thread-local, no locking. */
		do_work(wp);  /* Might include locks. */
		my_timestamp = clock_gettime(...); /* Thread-local. */
	}

	acquire_lock(&departing_thread_lock);

	/*
	 * disentangle from application, might acquire other locks,
	 * can take much longer than MAX_LOOP_TIME, especially if
	 * many threads exit concurrently.
	 */

	my_status = get_return_status();

	release_lock(&departing_thread_lock);

	/* thread awaits repurposing. */

The control thread might do the following:

	for (;;) {
		for_each_thread(t) {
			ct = clock_gettime(...);
		        d = ct - per_thread(my_timestamp, t);
			if (d >= MAX_LOOP_TIME) {
				/* thread departing. */
				acquire_lock(&departing_thread_lock);
				release_lock(&departing_thread_lock);
				i = per_thread(my_status, t);
				status_hist[i]++; /* Bad idea if TLE! */
			}
		}
		/* Repurpose threads as needed. */
	}

Here the passage of time is used to ensure that the empty lock-based critical section happens after the departing worker thread has acquired the departing_thread_lock. If the empty lock is allowed to become a no-op, then the per_thread(my_status, t) access can abort the departing thread’s transaction, so that the result is the initial value of -1. This cannot happen with locking.

Note that enclosing the access to my_status in a lock/transaction does not change things. This problem is not the extra-transactional access, but rather the structure of the application.

Again, please don’t try this sort of thing on superscalar CPUs (which is almost all of them these days), and especially please don’t try it in the Linux kernel. In fact, you should be very cautious about doing this sort of thing even on systems specially designed for hard real-time response. But the key point is that if you have a large application that is not well understood, you might want to be careful about applying HTM-based lock elision to it. This should not be a surprise: What you have is a real-time program, and there has not yet been much work on TM in real-time environments.

Speaking of real-time environments, how does priority boosting interact with HTM-based lock elision?

Acknoowledgments: Many thanks to Josh Triplett, Mathieu Desnoyers, Jon Walpole, Phil Howard, and Stephan Diestelhorst for many illuminating discussions. Please note that my acknowledgment of these people’s contributions to my understanding of transactional lock elision should in no way be interpreted to imply their agreement with anything I say here. ;-)


Posted in parallel, transactional memory everywhere | No Comments »