Does perfect code exist? (Abstractions, Part 1)

Bryan Cantrill recently wrote a blog entry, where among other things, he philosophized on the concept of “perfect code”. He compares software to math, arguing that Euclid’s greatest common denominator algorithm shows no sign of wearing out, and that when code achieves perfection (or gets close to perfection), “it sediments into the information infrastructure” and the abstractions defined by that code becomes “the bedrock that future generations may build upon”. Later, in the comments of his blogs, when pressed to give some examples of such perfection, he cites a clever algorithm coded by his mentor to divide a high resolution timestamp by a billion extremely efficiently, and Solaris’s “cyclic subsystem”, a timer dispatch function.

Watching his talk at Google where his introduction and sidebar book review on Scott Rosenberg’s “On Dreaming in Code”, it’s clear that he very passionately believes that it is possible to write perfect code, and that one should strive for that at all times. Perhaps that’s because he mostly writes code for operating systems, where the requirements change slowly, and for one OS in particular, Solaris, which tries far harder than most software projects to keep published interfaces stable for as long as possible. In contrast, the OS I’ve spent a lot of time hacking around, Linux, quite proudly states that at least inside the kernel, interfaces can not and should not be stable. Greg Kroah-Hart’s “Stable API Nonsense” is perhaps one of the strongly and most passionate expositions of that philosophy.

I can see both sides of the argument, and in their place, both have something to offer. To Bryan’s first point, it is absolutely true that interfaces can become “bedrock” upon which entire ecosystems are built. Perhaps one of the most enduring and impactful example would be the Unix programming interface, which has since become enshrined by POSIX.2 and successor standards. I would argue, though, that it is the interface that is important, and not the code which initially implemented it. If the interface is powerful enough, and if it appears at the right time, and the initial implementation is good enough (not perfect!), then it can establish itself by virtue of the software which uses it becoming large enough that it assumes an important all out of scale with its original intention.

Of course, sometimes such an interface is not perfect. There is an apocryphal story that when S. Feldman at AT&T labs first wrote the ‘make’ utility, that he did so rather quickly, and then put it available for his fellow lab members to use, and then went home to sleep. In some versions of the story he had stayed up late and/or pulled an all-nighter to write it, so he slept a long time. When he came back to work, he had come up with a number of ways to improve the syntax of the Makefile. Unfortunately, (so goes the story) that too many teams were already using the ‘make’ utility, so he didn’t feel he could change the Makefile syntax. I have no evidence that this ever took place, and I suspect it is an urban myth that was invented to explain why Makefiles have a rather ugly and unfortunate syntax that many would call defects, including the use of syntactically significant tab characters which are indistinguishable from other forms of leading whitespace.

Another example which is the bane of filesystem designers everywhere are the Unix readdir(2), telldir(2), and seekdir(2) interfaces. These interfaces fundamentally assume that directories are stored in linear linked lists, and filesystems that wish to use more sophisticated data structures, such as b-trees, have to go to extraordinary lengths in order support these interfaces. Very few programs use telldir(2) and seekdir(2), but some filesystems such as JFS maintain two b-trees instead of one just to cater to telldir/seekdir.

And yet, it is absolutely true that interfaces can be the bedrock for an entire industry. Certainly no matter what its warts, the Unix/Posix interface has proven the test of time, and it has been responsible for the success of many a company and many billions of dollars of market capitalization. But is this the same as perfect code? No, but if billions of dollars of user applications are going to be depending on that code, it’s best if code which implement such an interface be high quality, and should attempt to achieve perfection.

But what does it mean for code to be perfect? For a particular environment, if the requirements can be articulated clearly, I can accept that code can reach perfection, in that it becomes as fast as possible (for the given computer platform), and it handles all exception cases, etc., etc. Unfortunately, in the real world, the environment and the requirements inevitably change over time. For example, Bryan’s cyclic subsystem, which he proudly touts as being if not perfect, almost so, and which executes at least 100 times a second on every Solaris system in the world. I haven’t looked at the cyclic system in any detail, since I don’t want to get myself contaminated (the CDDL and GPLv2 licenses are intentionally incompatible, and given that companies — including Sun — have sued over IPR issues, well, one can’t be too careful), but waking up the CPU from its low-power state 100 times a second isn’t a good thing at all if you are worried about energy conservation in data centers — or in laptops.

For example, on my laptop, it is possible to keep wakeups down to no more than 30-35 times a second and it would be possible to do more but for an abstraction limitation. Suppose for example an process wants to be sleep and then receive a wakeup 1 milliseconds later, and so requests this via usleep(). At the same time, another application wants to sleep until some file descriptor activity takes place, or after 1.2 milliseconds takes place. 0.3 milliseconds later, a third process requests a sleep, this time for 0.8 milliseconds. Now, it could be that in all of the above cases, the applications don’t actually need exact timing; if they all get their wakeups plus or minus some fraction of a millisecond, they would be quite cool with that. Unfortunately, the standard timer interfaces have no way of expressing this, and so the OS can’t combine the three wakeups at T+1.0, T+1.1, and T+1.2 milliseconds into one wakeup at T+1.1ms.

So this is where Greg K-H’s “Stable API Nonsense” comes into play. We may not be able to solve this problem at the userspace level, but we darn well can solve this problem inside the kernel. Inside the kernel, we can change the timer abstraction to allow device drivers and kernel routines to provide a timer delta plus a notion of how much accuracy is required for a particular timer request. Doing so might change a timer structure that previously external device drivers had depended upon — but too bad, that’s why stable ABI/API’s are not supported for internal kernel interfaces. Is this that the interface could have been extended? Well, perhaps, and perhaps not; if an interface is well designed, it is possible it can be extended in an API and/or ABI compatible way. There usually is a performance cost to doing so, and sometimes it may make sense to pay that that cost, and sometimes it may not. I’ll talk more about that in a future essay.

Yet note what happened to the timer implementation. We have a pretty sophisticated timer implementation inside Linux, that uses heap data structures and buckets of timers for efficiency. So while I might not call it perfect, it is pretty good. But, oops! Thanks to this new requirement of energy efficiency, it will likely need to get changed to support variable levels of accuracy and the ability to fire multiple timers that are (more or less) coming due in a single CPU wakeup cycle. Does that make it no longer perfect, or no longer pretty good? Well, it just had a new requirement impact the code, and if criteria for perfection is for the abstraction defined by the code to be “bedrock” and never-changing, and no need to make any changes in said code over multiple years, I would argue that little to no code, even OS code, can ever achieve perfection by that definition — unless that code is no longer being used.

(This is the first of a multi-part series of essays on abstractions that I have planned. The next essay which I plan to write will be entitled “Layer surfing, or jumping between layers”, and will be coming soon to a blog near you….)

15 thoughts on “Does perfect code exist? (Abstractions, Part 1)

  1. funny. well-known rename race survived years in solaris code. indeed, that’s because of perfect and stable VFS.

  2. Interesting post. Even in our so-called “science”, what’s “right” is contextual and often changes over time. Aesthetic considerations are important, too.

  3. “Aesthetic considerations are important, too.”

    Aesthetic (as in elegance, simplicity) should be a part of perfection.

    For instance, in /etc/fstab there lately is the disease of not using something like

    root=/dev/hda2

    but instead something like

    root=UUID/()=-2-252598259802589025 890258902589025189205890525289025

    ok thats not the right number ;)
    But this is not beauty for sure. It is not elegant. It makes life for humans harder. IF only computers work with stuff like that, why is there even a /etc/fstab file at all, if not because humans modify on it? I dare claim that whoever proposed this idea for humans was someone that either didnt care, or was ignorant that this is an ugly, non-transparent decision. This mindset should go away and in fact this is why we should strive for perfection, so that we realize what is ugly and bad, and do away with it … like do away with perl ;)

  4. @alex – hehe! The Linux-is-faster bugs have survived since 1999 too. No guess required – slowness is part of backwards compatibility. (Some Solaris 2.6 era application might break if the threads switch context too fast or something :)

  5. You picked an unfortunate example: the cyclics sub-system is deadline-based and only wakes up when requested. It’s currently
    /used/ by the heartbeat HZ timer, but that’s something outside
    of cyclic.c itself, and so isn’t a problem with the cyclics code
    (indeed, fixing that and other non-tickless usages should be do-able
    without changing the cyclic code at all). I don’t believe that the
    “accuracy” part is really worth it, you can just choose between
    normal periodics (most cases) or real-time.

  6. The probablem is that any use of periodics are a really bad idea from a power management standpoint — whether it is a 100 times a second or once every two seconds. And if it is once every two seconds, then accuracy makes a huge difference, since normally accuracy down to milliseconds isn’t important at the second granularity scale, and it does a huge difference to wake up the CPU once and do all of your “once a second” housekeeping tasks than to wake up a dozen times to do the same amount of housekeeping.

  7. If there is perfect code, I believe it is demonstrated in a subset of Haskell’s Prelude. While people’s opinion’s on Haskell overall vary greatly among different groups, I believe the basic function definitions for the language are the cleanest, most math-worthy definitions we will ever have of them. They are so perfect they almost deserve to be etched into a giant stone tablet somewhere ;-)

    (List append)
    [] ++ ys = ys
    (x:xs) ++ ys = x : (xs ++ ys)

    map f [] = []
    map f (x:xs) = f x : xs

    sum [] = 0
    sum (x:xs) = x + sum xs

  8. I think the word “perfect” may be being problematic here — I was seeking to use the word to rhetorically sharpen a point, not to boast about work that I had done or to proclaim my own infallibility. For the point that I was trying to make — and specifically for the code that I referenced — I think “correct” is the much more apt term. The cyclic subsystem is certainly not perfect in a broad never-can-be-improved sense, but I do believe that it is largely correct. That said, I also take issue with the specific issues that you appear to think that you have with its design and implementation, but those issues are orthogonal to my point about correctness. And besides, resolving them would require me to — in your parlance — “contaminate” you, and we certainly don’t want that! ;)

  9. Bryan,
    I agree that “perfect” probably caused red flags to fly, but “correct” has similar problems in that its meaning can’t be bound until you specify “the set of requirements the code is meant to meet”. The bigger problem in your blog entry which I tried to call to call out is that you conflated interfaces and the code which implement them — and the presumption it is necessary for them to be perfect in order for them to become bedrock. I think we can point out multiple examples where less-than-perfect code and interfaces have become the bedrock on which vast amounts of code and functionality have been built.

    Furthermore, while my passion is, like yours, involves making the portions of the bedrock that I am responsible as seismically stable as possible, I don’t think it really helps solve the general problem of complex software engineering at the scale of Chandler, Lotus Notes, Microsoft Office, Sun’s Java System Application Server, Tivoli Total Storage Productive Center, and other Very Large Products which include complex UI’s — something which those of us who do OS engineering generally don’t have to worry about much.

    Finally, while I’m not particularly interested in getting contaminated, I will argue that if cyclic’s job is to run code on some regular period, such as the HZ heartbeat timer code, this is functionality which is never a good idea if you are truly interested in power saving. In order to get there, code should be woken up when it needs to be, and ideally batched with other code that needs to be woken up at roughly the same time within the permissions given to the timer subsystem regarding accuracy requirements. There should be no reason why a low-level regular heartbeat callback should be called on a system which is trying to save power in a truly radical way. (What? You have a blinking cursor? That should be turned off, or if you must have blinking cursor, it should blink at the same time as all other blinking UI elements. The former is still preferable, though! :-)

  10. I’m not conflating code and interface, rather my point was that the sedimentation of interface often allows for new latitudes of innovation in terms of the code (because one no longer has to worry about getting the abstractions right — one can just focus on innovating beneath them). And again, if it makes it easier, don’t think about perfect code — think about correct code. Much of that “less-than-perfect” code that you describe as becoming bedrock is — other faults aside — correct.

    In terms of the relevance to software engineering in-the-large: remember, I’m not maintaining that deep software engineering is representative of all of software engineering, but rather that the Chandler-esque cock-ups are themselves but one facet of our craft — and not necessarily the most instructive facet.

    Finally, as for cyclics, a little contamination would probably do you some good. ;) As John Levon mentioned, the cyclic subsystem is entirely deadline scheduled; if there are no cyclics, there is no interrupt — there is nothing about its design or implementation that is mutually exclusive with the kind of “radical” power savings that you describe. (As background, the design center for the subsystem was to allow arbitrary resolution interval timers, a must-have feature when developing a real-time system. Solaris is used as the basis for quite a few such systems, especially in the defense and telecommunications sectors.)

    But all of this serves to remind me that I would have been better off pointing to some subset of the tens-of-thousands of lines of DTrace implementation, for which there exists no Linux analogue to confuse matters. ;)

  11. Bryan,
    The problem I have with “correct” is that its definition is still extremely squishy and dependent on an individuals needs. When you say that code “faults aside” that has become bedrock is correct, if those faults causes special headaches to implementors, or if deployed code which relies on ambiguities in the interface specification, such that it can’t be changed, they might take exception to the description of the code or the interface as “correct”.

    As far as interval timers are concerned, this is a great example of what I am talking about. The mere fact that you have cyclic interrupts means something is non-ideal from a power-savings point of view; so if the cyclic system is being used at all, something is almost certainly using more power than it should. You might argue that it can’t go away because an interface (BSD and/or POSIX interval timers) promises that functionality will exist, but that’s the point. What if the interface is busted, or shouldn’t exist, or should be used extremely rarely? I can think if a few examples where perfectly exact interval timers are needed — for example, for real-time data collection applications, and servo control — but most of the time, applications don’t need precise interval timers. Yet they may very well end up using that particular interface because nothing better existed at the time.

    So to fix these sorts of problems, sometimes you have to change or augment interfaces that had been previously thought of being bedrock. Sometimes it’s not enough to assume that a particular interface is a given, and only try to implement innovations underneath such an interface; instead, you have to somehow innovate around, over, or through a particular interface/abstraction. And yes, sometimes that means trying to find all of the users of that interface, and convincing them to change how they use it.

  12. I think something like GMP(adding and multiplying big integers) can be made perfect, or as perfect as possible. GMP is certainly not a trivial software — it is actually incredibly complex. But “it sediments into the information infrastructure” indeed. Witness that GCC 4.3 requires GMP to build.

  13. Pingback: Sp3w

Leave a Reply

Your email address will not be published. Required fields are marked *

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>