I’ve been programming for almost 3 weeks, and today I was excited to see the following output in my terminal:
thread 0: starting with d1.data = 0 d2.data = 0
thread 1: starting with d1.data = 0 d2.data = 0
thread 2: starting with d1.data = 0 d2.data = 0
thread 3: starting with d1.data = 0 d2.data = 0
thread 3: ending with d1.data = 4 d2.data = 4
thread 1: ending with d1.data = 4 d2.data = 4
thread 0: ending with d1.data = 4 d2.data = 4
thread 2: ending with d1.data = 4 d2.data = 4
Doesn’t look significant, does it :) ? The program that creates this output is a simple Java program that creates several threads that each increment a pair of shared counters in tandem. Here’s the code for the threads:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
| public class SynchronizedCount extends Thread
{
Object lock;
Data d1, d2; // Data is a wrapper class around an int.
int id;
Barrier b; // A barrier synchronization class.
int num_iters;
// I have omitted the constructor, which just sets the
// instance variables to the constructor parameters.
public void run()
{
synchronized(lock)
{
System.out.println("thread " + id +
": starting with d1.data = " +
d1.data + " d2.data = " + d2.data);
assert(d1.data == d2.data);
}
for(int x = 0; x < num_iters; x++)
{
synchronized(lock)
{
d1.data++;
d2.data++;
assert(d1.data == d2.data);
}
}
b.block();
// No thread will reach this point until all threads
// have called b.block().
synchronized(lock)
{
System.out.println("thread " + id +
": ending with d1.data = " +
d1.data + " d2.data = " + d2.data);
assert(d1.data == d2.data);
}
}
} |
Each thread should have exclusive access to the counters when it accesses them—each access of the counters is a critical section.
Notice the assertions sprinkled throughout the code. I expect the values of both counters (d1 and d2) to always be the same, and so I check this at several places. If that condition is ever false, the program will crash and tell me which assertions failed. During development, assertions are a HUGE help in debugging and in writing bug-free code in the first place, as I very quickly learned. When releasing a production version of a program, assertions can be compiled out (in the case of C/C++) or disabled at runtime (in Java), so it is perfectly fine to leave them in source code.
So why did I write this asinine program? It’s a test to detect if the optimizations I’m making to Ronald Veldema’s distributed shared memory virtual machine for Java, which is designed in particular for programs that require massive amounts of memory (on the order of many terabytes) in a huge number of objects.
A Distributed Shared Memory (DSM) provides the application programmer with a single address space that is mapped onto the address space of multiple machines. This lets the application programmer write code as he (she?) would for a single computer, but the data and threads can be distributed automatically onto the available resources of machines the DSM is running on. Using a DSM lets one write a program that uses the resources of a computer cluster without having write message passing code or map shared memory buffers among machines. You just need to write multithreaded code.
Although I don’t like Java very much as a programming language (it has been hijacked by buzzword-loving enterprise weenies, and is kind of a dumbed-down language to begin with), it is an ideal target for a DSM for a few reasons:
- The language is designed to run on a virtual machine (VM). All DSM stuff can be implemented in the VM. Java programmers/users are accustomed to running a VM, so running on our VM that implements the DSM doesn’t really add complexity—programmers don’t have to link with special DSM libraries, and users don’t have to start an extra process, both of which would be needed for a DSM in a language without a VM.
- Threads and a memory model designed for concurrent programming are in the Java spec.
Issues arise when parallelizing an application. Unless the problem is embarrassingly parallel, there will be a possibly significant amount of data that must be shared. So, considering the shared counters program I wrote about above, when running on our DSM, the counters could be allocated on any machine running the DSM (even on different machines), but each thread needs access. What is typically done in distributed applications is to move the data where it is needed. In Ronald’s DSM, instead of moving data to threads, he moves threads to data. The fact that Java has threads as part of the language specification makes implementing thread migration a simpler than it would be otherwise. Thread migration is something that is usually only done for redundancy and for compensating for system failure. Doing it for performance is, as far as we know, a largely unexplored question.
If the program being parallelized has more reads of shared data than writes, performance can sometimes (dramatically) be improved by caching the shared data on the machines where it is often accessed. Then, data/thread migration can be bypassed for the cached data. This is what I am implementing this summer—an object replication mechanism for Ronald’s DSM that will (hopefully) offer big performance gains for many programs run on it. But caching requires synchronization when writes are performed for program behavior to be intelligible. The cache coherency protocol gets tricky fast. THAT is why I was excited when I saw the output from the counter program—it means that my synchronization protocol and implementation were correct for that run.
There are still likely race conditions present, which can be very hard to debug. An often-used strategy for detecting race conditions in a concurrent program is to run it for long periods of time (days, possibly) with a high level of concurrency—lots and lots of threads running on lots and lots of physical machines. Then you wait for your program to do the wrong thing. Before I can do this kind of testing, I need to make my caching scheme interface with Ronald’s garbage collector (which may turn out to be as complicated as the cache coherency protocol and implementation, which has taken me a couple weeks, and probably still has bugs).
After interfacing with the garbage collector and becoming confident that there aren’t in my code, I need to make sure my object replication mechanism uses a bounded amount of memory—right now, there is no limit on the amount of space the cache uses. This is bad, especially for a VM that is supposed to be very memory-efficient! What we did in Operating Systems class might actually be useful to me now—I’ll need to worry about replacement policies.
After implementing bounded memory usage for object replicas, I need to experiment with different replication heuristics, which will determine which objects should be replicated to which machines. Currently, replication is only done via a special Java method implemented in the VM that forces replication of an object to every machine.
Finally, I will need to do performance tuning and benchmarking, i.e., with my replication mechanism enabled and disabled, with real applications! Ronald has a few applications he has written or he uses to test, including a model checker (essentially, a nondeterministic Turing machine simulator), an n-body simulation, and a program that finds subgraphs in a larger graph.
Between all this work, I’ll be presenting to the Computer Science faculty here in about 3 weeks. Also, Ronald and I are planning on submitting a joint paper on his DSM with my object replication mechanism to VEE by September. If I get published it will be fantastic.