The Bad Yogurt
The other day my parents got a new fridge. I took some photos of the switch, and got these. My brother decided to have a Stonyfield Oikos Greek yogurt, which had gone bad, unrelated to the refrigerator stuff.
``Something's not right here...''
``I think this yogurt is bad.''
My mom didn't believe him at first. ``Give me a taste!'' she shouted.
Inquiry 2009!
The fifth annual issue of the Inquiry magazine from the University of New Hampshire is online now. I wrote a commentary for the magazine, discussing my research experience in Germany in the summer of 2007, and some of the misconceptions people have about computer science. You can read it online here.
Subtle laziness issues.
I am using heuristic search for a combinatorial optimization problem I'm playing with. The search space is enormous, far too large to navigate with traditional A* (unless I can think up a ridiculously good heuristic). Because I don't have a machine with unlimited memory, I am using Iterative Deepening A*, described by Richard Korf in a 1985 paper. With a reasonable heuristic, iterative deepening A* has memory complexity linear with the depth of the shallowest solution.
I coded up IDA* in Haskell, hoping I could then solve my problem on my laptop, which has a paltry 1GB of memory. My first attempt used the State monad from Control.Monad.State.Strict to keep track of search statistics, e.g. number of nodes expanded. This is cleaner and less error-prone than threading the state manually as an extra parameter.
{-# LANGUAGE BangPatterns #-} module Search where import Control.Monad.State.Strict import Data.List (foldl') data Node s e c = Node { fVal :: {-# UNPACK #-} !c -- f = g + h , gVal :: {-# UNPACK #-} !c -- cost so far , hVal :: {-# UNPACK #-} !c -- est. cost to go , state :: !s , pathToStart :: ![e] -- path from start, in reverse } deriving (Show) path :: Node s e c -> [e] path = reverse . pathToStart -- Constructs a start node from the given heuristic function and state. mkStart :: (Num c) => (s -> c) -> s -> Node s e c mkStart h s = Node { fVal = hv , gVal = 0 , hVal = hv , state = s , pathToStart = [] } where hv = h s -- Expands a node with the given heuristic and state expansion function. expandNode :: (Num c) => (s -> c) -- heuristic function -> (s -> [(s, e, c)]) -- state expansion function -> Node s e c -- node to expand -> [Node s e c] -- list of children expandNode !heur !exp !n = map mkNode $ exp $ state n where mkNode !(s,e,c) = let g = gVal n + c h = heur s in n { state = s , pathToStart = e : pathToStart n , fVal = g + h , gVal = g , hVal = h } -- Search statistics. data Stats = Stats { numExpanded :: {-# UNPACK #-} !Int , numGenerated :: {-# UNPACK #-} !Int , numPruned :: {-# UNPACK #-} !Int } -- Initial search statistics. mkStats :: Stats mkStats = Stats { numExpanded = 1, numGenerated = 0, numPruned = 0 } -- For the result of depth-limited search. data DLS s e c = Solution {-# UNPACK #-} !(Node s e c) -- solution found | Cutoff {-# UNPACK #-} !c -- more nodes available | DeadEnd -- no solution; no more -- nodes -- Iterative-deepening A* search. Supposed to have worst-case B*D -- memory complexity, where B is the maximum branching factor in the -- search space, and D is the solution depth. idaStar :: (Num c, Ord s, Ord e, Ord c) => s -- start state -> (s -> Bool) -- goal predicate -> (s -> [(s, e, c)]) -- expand function -> (s -> c) -- heuristic function -> (Maybe (Node s e c), Stats) -- possible solution, and statistics idaStar start isGoal expand heur = case runState (go $ heur start) mkStats of (Solution n, s) -> (Just n, s) (DeadEnd, s) -> (Nothing, s) where startNode = mkStart heur start go !c = do res <- dls c startNode case res of Cutoff c' -> go c' other -> return other dls !c !n | c < fVal n = return $ Cutoff $ fVal n | isGoal (state n) = return $ Solution n | otherwise = do let kids = expandNode heur expand n modify $ \s -> s { numExpanded = numExpanded s + 1 , numGenerated = numGenerated s + length kids } let f DeadEnd n = dls c n f s@(Solution _) _ = return s f a@(Cutoff v) n = do r <- dls c n case r of DeadEnd -> return a Cutoff v' -> return $ Cutoff $ min v v' s@(Solution _) -> return s foldM f DeadEnd kids
However, this version showed higher memory usage than normal A*! Within seconds my poor laptop would be swapping furiously. There is a space leak. I tried heap profiling without success (I probably just don't know how to use the heap profiler properly), and asked about space leaks with the State monads on #haskell, to which I heard just the crickets.
Frustrated, I rewrote the code without use of the State monad, by threading the Stats parameter by hand. That version showed the linear memory usage that was expected, resident process size less than 20MB.
I searched through the Haskell Cafe archives and came across an old thread where Dean Herington asked about the State monad and strictness. He was running into similar problems.
It turns out that laziness and strictness are subtle things. Heinrich Apfelmus wrote an excellent response in the thread. In summary, the State monad in Control.Monad.State.Strict is strict in the (result, state) tuple, but not strict in the components of the tuple. Hence, in my first attempt at IDA*, where I use the modify function, rather than updating the state strictly, a thunk is created, causing the space leak I was experiencing.
In my first attempt I updated the state in dls using modify like so:
modify $ \s -> s { numExpanded = numExpanded s + 1 , numGenerated = numGenerated s + length kids }
Rather than doing that, if I force the state before updating it, the space leak disappears, and the performance is comparable to the hand-threaded version:
s <- get put $! s { numExpanded = numExpanded s + 1 , numGenerated = numGenerated s + length kids }
Tricky.
Difficulties with GHC's Rewrite Rules
GHC's rewrite rules facility allows a user to easily add peephole optimizations. It's used to implement deforestation, where intermediate data structures are completely eliminated.
A trivial example involves mapping over a list twice. Consider:
map (*2) (map (+10) [1..10])
Evaluated as written, a temporary list [11..21] would be created by the inner map, which would then be consumed by the outer map, resulting in [22,24..40]. One could fuse these two traverals by hand, resulting in
map ((*2) . (+10)) [1..10]
which does not compute an intermediate list. A rewrite rule that performs this fusion automatically is defined in the rewrite rules documentation.
I was hoping to use rewrite rules to fuse more complicated expressions, such as the naive implementation of arithmetic mean of a list:
mylen xs = foldr (\_ a -> a+1) 0 xs mysum xs = foldr (+) 0 xs avg xs = mysum xs / mylen xs
Here, two traversals are made over the list, but the two traversals are not nested.
Fusing by hand, we get the following uglier code:
avg2 xs = sum / len where (sum, len) = foldr (\x (s,l) -> (s+x,l+1)) (0,0) xs
I tried the following rewrite rule in an attempt to fuse things like avg automatically:
{-# RULES
"foldr-fuse" forall c f g fs gs xs.
c (foldr f fs xs) (foldr g gs xs) =
let (v1,v2) = foldr (\x (a1,a2) -> (f x a1, g x a2)) (fs, gs) xs in c v1 v2
#-}GHC doesn't like this:
$ ghc -O --make -ddump-simpl-stats -fforce-recomp list_fusion
[1 of 1] Compiling Main ( list_fusion.hs, list_fusion.o )
list_fusion.hs:16:0:
Rule foldr-fuse:
Illegal expression: c
in left-hand side: c (foldr f fs xs) (foldr g gs xs)
LHS must be of form (f e1 .. en) where f is not forall'd
On #haskell, I was told that a general RULES pragma for doing this sort of fusion is probably not possible. Of course, I could specialize the rule for division, and this does indeed work, but it is not general-purpose. Specializing for division looks like this:
{-# RULES "foldr-fuse" forall f g fs gs xs. (/) (foldr f fs xs) (foldr g gs xs) = let (v1,v2) = foldr (\x (a1,a2) -> (f x a1, g x a2)) (fs, gs) xs in (/) v1 v2 #-} {-# INLINE mysum #-} mysum xs = foldr (+) 0 xs {-# INLINE mylen #-} mylen xs = foldr (\_ a -> a+1) 0 xs avg2 xs = mysum xs / mylen xs main = do print $ avg2 [1..10]
The INLINE pragmas are necessary to get the rewrite rule applied, since inlining and rewriting happen at the same time. The output from GHC indicates that foldr-fuse was used to fuse the definition of avg2.
Is there some way to make the more general fusion rewrite rule work? If not, Template Haskell could probably be used, but then the fusion would not be transparent. One would have to write something like
avg2 xs = $(fuse [|mysum xs / mylen xs|])
with a suitably defined fuse function. Hmmpf.
The perl one-liner I use the most.
Occasionally I find I need to perform a search-and-replace over many files in a directory hierarchy. I could open each such file, do a search-and-replace with my editor, then save, but this would be tedious.
On the other hand, writing a full-fledged script to automate the process would be silly, since I know about perl's -i option.
An example: poor man's refactoring. I find all .C files in or beneath the current directory, and change toString to to_string in those files:
$ perl -pi -e 's/toString/to_string/g' `find -name '*.C'`
Basically, the -p option coupled with the -e EXPRESSION bit causes perl to execute the given expression on each line of each file specified on the command line. The expression I use here is a regular expression replacement.
Normally, -p causes perl to print the results to stdout, but combined with -i and no argument, perl effectively edits the files in-place. Let's hope you use version control, in case you make a mistake. ;-)
An argument can be given to the -i option to cause perl to backup the original files with the specified extension. For more info, try perldoc perlrun.
Transmute `lead' into `gold'.
Doublets is a game described by Lewis Carroll, in which one attempts to transform the start word into the end word by changing one letter at a time, such that each intermediate step results in another valid word.
For example, a doublet between 'head' and 'tail' is
> head > heal > teal > tell > tall > tail
Finding doublets without assistance is satisfying and tests your vocabulary, but I knew I could have a computer find them through brute force. I wrote a Haskell program that, given a word list file, finds the shortest doublet between two words. Naturally, the quality of the results depends on the word list given to the program.
This program builds in-memory graphs whose nodes are labeled with words of the same length. That is, there will be one graph for words of length 0, one graph for words of length 1, one graph for words of length 2...
In each of these graphs, two nodes are connected if the two labels differ by only one letter. So, an edge would connect `book' and `look', but not `book' and `beak'. The shortest doublet is found by computing the shortest path between corresponding nodes.
This program can utilize multiple cores and has a simple REPL using the Haskell bindings to GNU readline. To build and run this program you the readline bindings and Martin Erwig's graph library, both available on hackage.
The source is available as a gzipped tar: doublets.tgz.
Thoughts on Version Control Systems
I have experience with four version control systems. Let's look at the pros and cons of each.
CVS
CVS may have been good 15 or 20 years ago. But today it is fragile and has a weak feature set:
- It does not have atomic commits or even ``commit sets''---commits consisting of changes to multiple files or directories---and so without meticulous coordination between committers it is easy to get a corrupted repository.
- It does not handle binary files well. By default CVS messes with newlines in the files it stores, and performs substitution for things that look like keywords. These are problems for non-text files. CVS was designed for storing text files and because of this extra steps have to be taken to allow proper treatment of binary files. Also, delta compression is not used for binary files, so repository size may balloon rapidly when tracking binary files.
- Administering a CVS repository is painful at best. pserver mode requires running a daemon or messing with xinetd, and one has to go through extra hoops to ensure security. One can also use ssh, but this requires new user accounts and a common group for the developers.
- Using CVS is painful. One must either set an environment variable for the current repository being worked on, or must pass extra parameters to the CVS commands.
- It doesn't handle renames well. History is tracked on a per-name basis, so when you rename, history becomes very difficult to work with (you will probably have to poke around ``attic'').
- Directories cannot be entirely deleted from a CVS repository without mucking in its internals.
- CVS does not offer any kind of merge tracking.
It's been a very long time since I used CVS and it was also the very first version control system I used, so I don't have more specific complaints. It is at least better than nothing at all.
Subversion
Subversion is referred to by its developers as ``CVS done right''. It offers many improvements:
- Atomic commits and ``commit sets'' (I avoid calling them changesets because this has a more precise meaning regarding storage model, which svn does not have). Changes to multiple paths can be bundled up as one commit. If for some reason the commit fails midway through (e.g. network troubles), the changes are rolled back---the repository has much less chance of becoming corrupted.
- Revisions are stored compactly, even for binary files: instead of storing a complete copy of a versioned object for each revision, only a delta is stored (this might not be entirely accurate, e.g. for performance reasons each nth revision may be stored in its entirity---but it is mostly accurate).
- Administration is slightly easier. There is an svn protocol which has most of the same issues as pserver for CVS. There is also HTTP(S) mode; svn can integrate with a webserver. It can also be set up to use SSH, but this is as much a pain in the ass as with CVS, and runs a greater risk of repository corruption than any of svn's web-based modes.
- Using Subversion is nicer than using CVS. Instead of having to set an environment variable or specify an extra repository parameter, most commands can figure out what to do when run within a working copy of a repository.
- Renames are handled more elegantly than with CVS---`svn mv' is implemented as a copy and delete.
Other advantages:
- Good integration with IDEs. CVS has this too. I don't use IDEs, so I don't care about this very much.
- Many people are familiar with Subversion already, so if you have to work with others who are not willing to learn new tools, it may be a reasonable choice.
Subversion has many issues, however:
- No support for merge tracking. Merging is hokey and painful.
- It's very slow! When it was young it was at least an order of magnatude slower than CVS for most operations. That has surely improved by now, but it is still comparatively quite slow. Mercurial is much faster, and git is supposedly even faster than Mercurial for many operations. Perhaps I'll benchmark version control systems one day.
- It is difficult to identify the differences between branches. One has to resort to scripting this or performing a pretend merge.
- It is more difficult than it should be to revert to older versions. One has to either merge with the old revision (which svn doesn't make easy), or cat the changes to each file you want to roll back and then commit.
- The supplied script for sending commit messages lacks desired features---like putting the first line of the commit message as the email subject.
- It is not too difficult to get a working copy completely screwed up. If you manage to delete the .svn directory in a versioned directory, you are toast---probably the only way to recover is to do a brand new working-copy checkout. I hope you commit your changes before you make this mistake.
- Did I mention that it is slow?
- Many commands are not available from the main svn program, and instead one must execute svn-PROGNAME or svnPROGNAME. These commands also do not all accept arguments in a uniform, coherent way (I ran into this the other day but don't remember the specific case).
- The commands that operate on a repository (rather than a working copy) do not accept raw paths: for example, if the repository is at ~/repos/silly_svn_repo, one cannot use that as an argument. One must type instead file:///PATH-TO-HOME/repos/silly_svn_repo. This is stupid. Paths without ``file://'' should be treated as though they do by default.
- Shitty/nonexistent man pages. Instead one must type ``COMMAND help''. It also bothers me that when one enters a command with bad parameters, you get just a pity error message followed by ``Type 'svn help' for usage''. I would rather have it print help by default when a bad command/bad parameters are entered.
- It is difficult or impossible to split up an existing repository or combine multiple repositories using the supplied tools. For example, for splitting, the svndumpfilter tool has issues with renames/deletions. This is a known flaw in the tool. Combining repositories only works properly if the histories do not overlap in time at all. Otherwise the history in the resulting repo will be borked.
- Tags and branches are not entities that Subversion knows about; instead they are merely convention of the users.
- Its commands don't pipe into a pager by default if they output more than one terminal screen. This is an annoyance.
I sympathize with Linus Torvalds' opinion of Subversion---it is broken by design.
Rational ClearCase
This is a classic example of overengineering, feature accretion, and poor design. I can't say anything good about ClearCase. Unfortunately it seems to be used in many large corporations (this is one reason not to work for such a company). I would rather use CVS or even no version control than ClearCase.
- It is much more bloated and probably even slower than Subversion, especially if you use dynamic views.
- Integration between operating systems is very painful.
- There are different UIs for each OS.
- Its reference manual is a book of over 1000 pages. That's bigger than most Robert Jordan novels. And that is only the users manual; there is an administrator's manual and others. WTF. And these aren't even freely available. I'm not sure if they are even included with each user license of ClearCase.
- It is absurdly expensive. Each license is over $4000 per year.
- There seems to be no way from the command line to list all files that you have made changes to with a single command. Instead you must resort to writing a shell script or piping between programs and using backticks or shell loops. So instead of saying something like ``ct status'' and getting a list of all modified files, you must type something like ``ct lsco -r -cview|xargs ct diff -pred -quiet''.
- You have to explicitly list files when you check in using the command line tools.
- Hardly anyone (or perhaps no one) fully understands it. It's too massive and poorly designed.
- The GUI tools on Unix/Linux are ancient---no tooltips, no mousewheel support, etc.
- Did I mention that it is incredibly slow?
- It is difficult to import a directory tree. On the first occasion I needed to do this, I didn't know of a command to do it, so I ended up scripting `clearcase add' for each file in a tree then waiting for a couple hours. When I had to do this again later, I tried to use the single command that adds a tree, but it required ClearCase administrator privileges. Go figure.
- The command line program's name is ``cleartool''. That's an awful lot of typing.
- With dynamic views, there are a couple different syntaxes for specifying a window of time to look at. These have undocumented differing semantics (I scoured the Jordanesque documentation for this and saw no mention. So it's probably a bug). For the curious, see my previous post.
- Coders spend lots of time fighting with the VCS rather than coding.
- Google ``clearcase evil twin''.
- It tends to require full-time administrators. That's a lot of overhead.
- Atomic commits and ``commit sets'' are not supported (?). Only single files can be checked in at a time. This is unacceptable in a modern version control system.
- It's not open source.
Common Limitations
There are also common problems to all three of the previously mentioned VCS systems due to their centralized nature.
- One must have access to the repository server to view or diff older revisions or (probably) view commit messages.
- Only privileged people have the ability to commit.
- If the system uses a lock/unlock concurrency model rather than an edit/merge model, you cannot do any development unless you have access to the repository server. ClearCase is this way, or at least was set up in such a way when I used it.
- They are slower than decentralized VCSs because most operations are non-local.
- A centralized repository is a single point of failure: if the server dies, you better have good backups; if the server goes down for a day, development will come to a standstill.
- Centralized development does not scale well. As the number of developers increases, so do lock contention (in a lock/unlock system) and merge necessity (in an edit/merge system).
Mercurial
I have been using Mercurial for my own projects for several months now. Because I have only been using it for my own stuff, I have not exercised its merge capabilities very well. Nevertheless, I still identify many advantages:
- Being decentralized, every operation save pushing and pulling from other repositories is local. You have access to all the capabilities of the system even without a network connection.
- Mercurial is fast (probably second fastest on most operations, with only git being faster). For example... need some benchmarks.
- By its decentralized nature, history is nonlinear and merges are tracked.
- It ships with tools to convert subversion, git, darcs, and CVS repositories.
- A Mercurial repository is typically more compact than a Subversion repository of the same stuff.
- It is easy to split or combine existing Mercurial repositories, and it ships with tools to do this (and they actually work).
- It is easy to write extensions. Much of the program is written in Python.
- It has good support for email. Revisions can be emailed directly and emailed revisions can be imported without much trouble.
- By its distributed nature, one can do work as normal on one feature, and once that feature is ready to be merged into another repository, the changesets can be rewritten into one ``final draft''. How often do you commit a bunch of changes only to immediately after realize you forgot one file? Or you made a typo in what you just checked in? With Mercurial you can merge those revisions into one (at least before you have merged with other repositories, at which point it gets messy) and make sure no ``garbage'' revisions appear in the project's history.
- Although it is decentralized, it can be used in a centralized fashion (one ``master'' repository that everyone checks into). This is how I use it with my own projects.
- It ships with some graphical tools for viewing history and such.
- It can be easily set up to use graphical merge tools if they are available.
- It has built-in patch capabilities, a la quilt.
- ``Pulling'' between repositories scales well. I understand the Linux kernel is developed in this way: Linus at the top, who pulls from a small number of committers he trusts, who pull from a small number of committers they trust... and so on. Merging changes gets distributed throughout this tree of contributors. The Linux kernel currently has over 1000 contributors.
- All contributors have access to all the features of the version control system. There is no technical distinction between those who can commit and those who cannot. Instead, the distinction is social and just defines whose changes get incorporated into the ``official'' project.
- By its decentralized nature, each working copy of a repository is in itself a repository. So there is not a single point of failure. Explicit backups are less important than with a centralized VCS.
- It can supposedly interoperate with git repositories.
One slight disadvantage of Mercurial compared to Subversion is that in the former, one cannot check out just a piece of a repository---one must check out the entire thing.
Here's how I rank these version control systems that I have experience with:
- Mercurial
- Subversion
- CVS
- ClearCase
Understand that the difference for me between ranks 1 and 2 is immense---I quite dislike the three besides Mercurial that I have used. Many of its benefits come from its decentralized nature.
I would almost always recommend Mercurial or probably any other free, distributed VCS (git and bazaar come to mind) over Subversion, ClearCase, or CVS---but it depends whom you will be working with, development platform, and how hard it would be for them to learn new concepts and new tools.
ClearCase: Version Control for Masochists
At this place I used to work, they had ClearCase set up to use dynamic views---in that mode it integrates with the filesystem and uses a ``config spec'' to determine which versions of files to show. (I'm not sure why this is considered a good idea. Sure, only one copy of the repository is needed this way, but then a very fast network and repository server are needed. This ``config spec'' strategy also allows mixing and matching of versions, which really seems like a bad idea---your development process can become _very_ tightly coupled to the version control system, so you could switch to something else only with great difficulty.)
Anyways, I needed to branch off of an old version of ``mainline'', or ``trunk'', or whatever you will call it. Here is my first attempt at a config spec:
element * CHECKEDOUT element /vobs/project/... /main/some_project_branch/LATEST element /vobs/project/... /main/LATEST -time 13-Jan-2006 -mkbranch some_project_branch element * /main/LATEST
Basically, first look at checked out versions of elements. Next, for everything under project, look at the version in some_project_branch. If there isn't a version on some_project_branch, look at the version on mainline from January 13, 2006 and create a version on the branch. For everything else, look at the version on mainline.
After creating new directories on the branch using this config spec, I would see ``[no version selected]'' from ``ct ls'' and ``No such file or directory'' when doing a normal ls. Very odd. I scoured the documentation and came up with my second attempt:
element * CHECKEDOUT element /vobs/project/... /main/some_project_branch/LATEST time 13-Jan-2006 element /vobs/project/... /main/LATEST -mkbranch some_project_branch end time element * /main/LATEST
This should be exactly the same as the previous, just using different syntax. This config spec didn't cause any of the aforementioned errors, but still didn't work completely. New elements created in this view would not be created on the branch. Now why are the two syntaxes for time semantically different? No one at the company knew. Bug or feature, it's definitely a ClearCase wart that doesn't seem to be documented anywhere. My third attempt:
element * CHECKEDOUT element /vobs/project/... /main/some_project_branch/LATEST time 13-Jan-2006 element /vobs/project/... /main/LATEST -mkbranch some_project_branch end time element /vobs/project/... /main/LATEST -mkbranch some_project_branch element * /main/LATEST
This time it finally did what I wanted---branched off of an old version of main, and put new files and directories on the branch. This task took me about an hour to figure out. It should not have been so difficult.
It is prudent never to trust completely those who have deceived us even once.
I spent nearly 3 days hunting for a bug in a queue data structure we are using in LVM. The queue is used to store a worklist of references to be marked during garbage collection. An LVM object reference (object_reference_t) is a 64-bit unsigned integer.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 | class TCircularArray{ private: // Backing array. object_reference_t* array; // Size of the array. u_int32_t array_size; // Index of first used spot in the queue. u_int32_t start; // Index of first free spot in the queue. u_int32_t end; // ...snip... public: // Constructor. TCircularArray(u_int32_t start_size); // ...snip... /** Adds an item to the end of the queue. */ void Append(object_reference_t a); /** Removes the item at the front of the queue. */ object_reference_t GetAndRemove(); }; |
The queue is implemented as an expandable circular array---it has a backing array, a start index and an end index, and the backing array grows as needed. The value 0 is used to signify an unused spot in the queue. Pretty simple.
Here is the implementation of the interesting functions, sanitized for comprehension:
26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 | // We allocate enough space for the requested size // using calloc, which zeroes the memory it provides. TCircularArray::TCircularArray(u_int32_t start_size){ array_size = start_size; array = (object_reference_t*) calloc(array_size, sizeof(object_reference_t)); assert(array); // Make sure we actually got memory. start = end = 0; } void TCircularArray::Append(object_reference_t a){ // The queue should never be completely full---we grow // when needed to preserve this invariant. assert(array[end] == 0); // Make sure we're not inserting the sentinel value by // mistake. assert(a != 0); // ...snip code to enlarge the array if needed... // Insert to end of queue and increment end index. array[end] = a; end = (end + 1) % array_size; // Again, shouldn't ever be completely full. assert(array[end] == 0); } object_reference_t TCircularArray::GetAndRemove(){ // Return sentinel if empty. if (is_empty()) return 0L; object_reference_t current = array[start]; // Empty the front of the queue. array[start] = 0L; // Increment the start index. start = (start + 1) % array_size; return current; } |
Upon construction, the backing array is allocated using calloc. So, the backing array should be initialized to all 0s---our sentinel value---the queue starts empty. Append adds an item to the queue, which stores the item in the backing array at the end index, then increments the end index. GetAndRemove sets the array at the start index to 0, increments the start index, and returns the item at was at the old start index. Again, pretty simple.
When running JCheck, a model checker we are using for benchmarks, I sometimes saw the assertion at the end of Append fail: the index to the end of the queue pointed to a non-empty spot in the array. Sometimes the entire array was non-empty. Not good.
This assertion failure was seemingly nondeterministic. Sometimes the program ran for a few minutes before aborting, and other times it happened right away, with the exact same input. I saw this behavior only on machines in our older cluster. ``Data race!'' was my first thought---LVM is heavily multithreaded. Improper synchronization is often the cause of nondeterminism in programs.
I looked very closely at the thread synchronization code, added many asserts, and found a few other bugs, but nothing that would explain the nondeterminism. It looks like a bug in calloc---it is supposed to return memory that has been zeroed, but sometimes on those certain machines, this wasn't happening.
I switched to a normal malloc + memset, and have not seen any assertion failures from TCircularArray since. I searched briefly for other people experiencing calloc bugs, but only found out that many implementations of calloc do some complex things with the MMU (when I told Ronald about my suspected calloc bug, he said very seriously, ``calloc does magic. Don't you like magic?''). Very tricky. I've so far been unable to reproduce this suspected calloc bug outside of LVM.
The moral of this story: don't trust the OS facilities to actually do what they say they do.



