A criticism of Ruby

This is a criticism of the Ruby language and its community. Some of the criticisms point out fundamental errors in the language design, or poor choices in what historical examples to follow. Others are about errors in the process of developing the language that have reduced its usability.

I have intentionally avoided all criticisms based on my own familiarity with some particular language. This means I have had to lay out a clear statement of the criteria a given language construct is supposed to address in practice, with examples of how various languages have handled it, and then turn to how Ruby has failed in this area.

There are things about Ruby that I think are poor choices—usings symbols like @ and $ instead of spelling out names, or Perlisms such as $/ for newline separator. I can make cases for my tastes here, but I acknowledge these as questions of taste. I am concerned here with substantive errors in Ruby’s language design, often cases where an infelicitous combination of small choices led to a cascade of complexities.

This document is necessarily negative about Ruby. Every section culminates with a criticism of Ruby. I have used what I found frustrating about Ruby as a lens for examining issues of language design. If I had used C++ or PL/I as a lens, then this document would be a sequence of negative statements about those languages instead. I must admit that the choice of Ruby was no accident, and I felt a certain gleeful sadism in the dissection, perhaps in revenge for my own frustrations in programming in the language.

Documentation

The degree and form of documentation is nearly uniform within a programming language community. Look at the language, its standard library, and those libraries and tools in common use, and the organization, the amount of detail, the structure of the reference material, and the form in which the documentation is presented and distributed is nearly uniform. For those who work in a single community this may seem unremarkable, but for those of us who wander (more or less uncomfortably) among many communities, it is a cause for astonishment.

The appendix goes into detail about the documentation cultures of several programming communities, but there are some defining questions:

How is expository and reference information organized? Is API documentation completely separate from tutorials, as in Java, or does every API reference begin with a short introduction and some examples of use, as in Python?
How closely intertwined are code and documentation? They may be completely separate, as in C, or one and the same, as in Knuth’s literate programming.
How is the documentation accessed? Is it from static manuals, even hard copy books, is it found by quering a running instance of the language runtime, or both?
How much detail is typical in the expository and reference information for the language? This can range from scanty vignettes as in BioConductor to Common Lisp’s closely defined standard.

Each of these decisions has tradeoffs. If I use docstrings extracted by a running image, then a piece of code that defines a large number of functions on the fly, none of which have an explicit representation in the source code, can easily give them documentation accessible like any other function. On the other hand, it is easy to lose the docstring on an object when transforming it in the instance (as with Python’s decorators)—a problem that static comments a la Java do not have. Other criteria have some choices that are clearly superior. The lack of both complete exposition and complete reference documentation in BioConductor’s vignettes, for example, is inferior to Common Lisp’s standard by all criteria except saving the implementor the time and effort to write documentation.

What are Ruby’s characteristics in this area?

Expository information is separate from reference information. Reference information is provided in HTML format. Expository information is scattered among books, blog posts, and tutorials.
The reference documentation is extracted from the source code, though there are at least three separate tools—RDoc, Tomdoc, and Yard, each with separate formatting conventions—for doing so.
The documentation is accessed via static manuals. Ruby does not use docstrings or provide documentation in the runtime.
The reference documentation is sketchy. It is typical for a Ruby programmer to refer to the source code to figure out the semantics of a function.

Interfaces, protocols, and abstractions

What are the methods on a socket object? Most programmers will immediately respond: read, write, and close. There may be more—whether the socket has data ready to read, is it closed already, and various others depending on the exact semantics a programmer learned for sockets—but those three will be universal. What methods will a stream have? Read and close at least. Such fixed sets of methods on a group of types we refer to as an interface.

All languages have interfaces. In Java, they are explicitly called interfaces and are distinct entities in the language:

public interface Stream { 
     public String read(int n); 
     public void close(); 
}

Clojure and Haskell give them their own existance under the names “protocol” and “typeclass”, respectively. Common Lisp, Python, and (weirdly) C don’t give interfaces a language entity, but make heavy use of them in practice. Any object in Python with a read method and a close method, obeying some basic semantics, is usable as a stream, whether it is a file or a network socket.

In C, we can write a generic function over streams by passing, in addition to the stream to work with, a function to read from that stream and a function to close that stream. Such interfaces are essentially untyped, but common practice among good practitioners of the language, and are found even in the C standard library. For example, qsort (a quicksort function) has the signature

void qsort(
     void *base, 
     size_t nel, 
     size_t width, 
     int (*compar)(const void *, const void *)
);

The first three arguments give a pointer to an array of memory, the number of elements in the array, and the number of bytes per element. The last argument is a function to compare two elements. The actual types and widths of the elements, and the types and widths the function expects to operate on, are completely unavailable to the compiler.

Interfaces extend beyond simple things like streams. Something as complicated as a SAX parser has an interface. There may be multiple SAX parsers in a language, with different tradeoffs of speed, memory use, ease of installation, etc., but there is no reason that they should not all share an API for SAX parsing. Python has codified this. Many elements of its standard library will have a version written in Python (for portability) and one written in C (for speed). The C version has its library name prefixed with a lowercase ‘c’, but the API within the libraries is identical. So whether you use ElementTree or cElementTree, your code should produce identical results, but it will run much faster with the latter.

The most obvious argument for strict interfaces is reducing how much a programmer must memorize. You avoid having distinct blocks of code to handle this kind of socket versus that kind of socket. But the real argument for interfaces is not what they save you from having to do, but what they make possible. For example, you can define a function that takes two streams, and returns a stream which concatenates them, or that returns a stream with all XML doctype definitions removed, or returns a stream that allows you to peek an arbitrary number of characters ahead into the stream it is transforming. These are all still streams. The glory of interfaces is not that they save you work, but that they make disparate types with common behavior into something that can be combined and transformed. Any one of these transformed streams can be used as an argument to any other of them. They have become an algebra. In the absence of an interface, none of these transformed streams are usable by existing code.

This is one of Ruby’s weaknesses, possibly because the documentation is sparse and it is impossible to satisfy an undocumented interface. A few (by no means comprehensive) examples:

TcpSocket and SslSocket do not name their read and write methods the same thing.
The SAX parsers in Rexml (which is pure Ruby) and Nokogiri (which is a binding to the C library libxml2) differ only cosmetically. Their constructors are slightly different, and the name of the event handling functions they expect are different, though they do precisely the same thing. There is no reason they could not have identical APIs.
The XML libraries, and everything else that uses a stream, don’t use any well defined interface, so writing stream transformers is an exercise in frustration: run, see what method the library tried to use that was missing, implement it, repeat.

Namespaces, compilation units, and modules

I know of three concepts in computer science for organizing large amounts of code. The three are orthogonal. That is, if we write out the algebraic properties satisfied by the operations on them, there are no properties relating operations of one to operations of another (though there are optimizations in compilation that can be made that interact among them).

The three concepts are:

Compilation units to control how much work a compiler must do to recompile a program when parts of it have changed.
Namespaces to make the bindings of symbols predictable.
Modules to define interfaces among components of the system.

All languages in general use today provide at least a partial implementation of all three, at least by social convention if not by actual language support.

Many language provide a notion of library, package, or assembly, but these can be seen as recursion in these three notions: a library as a namespace containing namespaces, versioning of packages as a compilation unit containing compilation units.

Compilation units

When recompiling a program, the simplest way is to compile all of the code from scratch, as if this were the first time it had been compiled.

In practice, this is often impractical. Some systems require hours or even days to build. In systems where the build time is long, the code going into the present build usually differs only slightly from the code that went into previous builds. To take advantage of this, we can divide a program into pieces and draw a directed, acyclic graph between the pieces, with one piece linked to another if its code, in the course of its life, will transfer control to a piece of code in the other. When we change code in one piece, we only need to recompile it and any pieces with a path to it in the graph. For systems with thousands or tens of thousands of pieces, this can be a remarkable speedup. These pieces are what we refer to as compilation units.

In most languages today, compilation units are some combination of files. In Python, the compilation unit is a single file. In C and C++, it is a source file and one more headers.

For a compilation unit we expect to be able to compile it, to be able to measure if it has changed since the last compilation, and, for any pair of compilation units, whether changes to one will also require the other to be recompiled.

Namespaces

Say you decide to use a third party library when writing your program. You don’t want to worry about what every binding in that library is, and whether you are going to collide with it when making your own bindings. Further, you want to be able to write your code to be interpreted in the context of a known set of bindings. Yet you also want to be able to override existing bindings, or attach bindings from other libraries to your current context. Handling these cases has leads us to namespaces.

A namespace encapsulates a set of bindings—functions, classes, constants, macros, or whatever other constructs the language allows a name to be assigned to—so they are not impacted by bindings in other namespaces. To make namespaces useful they must have what I call the “relocatability property”: if I move some code from one namespace to another, then attach that namespace, there should be no change in the behavior of the program.

In languages without explicit namespace support, such as C, Smalltalk, and Emacs Lisp, developers usually prefix their bindings with a library name. Every binding in GLib in C is prefixed with g_. All of org-mode’s bindings are prefixed with org-. If everyone adheres to this convention, leaving unprefixed symbols to the default language and its standard library, then there need be no namespace collisions.

Continuously prefixing everything gets awkward quickly, so languages with more explicit namespace support, such as C++ and Python, allow you to attach part or all of a namespace, possibly qualified or renamed in some systematic way, to another namespace. In Python, where namespaces are files (which are also compilation units and modules as well) you can write

import something
something.f()

import something as new_name
new_name.f()

from something import f
f()

from something import *
f()

In C++, where namespaces are separate from compilation units, you can do the same thing:

namespace something {
     void f() { … }
}
something::f();

namespace new_name = something;
new_name::f();

using namespace something::f;
f()

using namespace something;
f();

There is another notion of namespace which is similar enough to be justifiably called a namespace, but different enough to be confusing: different syntactic usages of a symbol may refer to different bindings. That’s obscure, but a bit of obfuscated Java will make all clear:

public class Main {
     public static class T<T> {
          public T T;

          public T(T value) {
               this.T = value;
          }
     }

     public static String T() {
          return "Hello, World!";
     }

     public static void main(String[] argv) {
          String value = T();
          T<String>
          T = new T<String>(value);
          System.out.println(T.T);
     }
}

Everything in sight is called T, but almost every T refers to something different. In Java, the same symbol can refer, based on its syntactic position, to

a local variable
a function or method
a type or package
a generic type parameter

The extreme cases of this kind of namespace are Common Lisp, where you can add your own namespaces of this kind to the language (and Common Lisp already has five or six of its own built in), and, at the other end of the spectrum, Scheme, which has one namespace for everything.

These are also namespaces, but the operations on namespaces that we define next make no sense on them.

Returning to our namespaces for encapsulating bindings in code, there are a clear set of operations to support. We must be able to import the bindings from one namespace into to another, we must be able map a namespace into another with all its bindings qualified, typically by a prefix, and we must be able to extract a subset of a namespace.

These operations don’t always map directly to a language’s primitive constructs. I have chosen them because they are easy to write as functions with clear algebraic laws relating them. In Python they correspond to:

# Attach all bindings in X to the current namespace.
from X import *

# Attach a subset {a, b, c} of the bindings in X
# to the current namespace.
from X import a, b, c

# Qualify namespace X with prefix 'X.' and attach it
# to the current namespace.
import X

# Qualify namespace X with prefix 'Y.' and attach it
# to the current namespace.
import X as Y

In C++, any namespace in scope is attached to the current namespace, qualified by the name under which it is in scope. The other operations correspond to:

// Attach an in scope namespace X qualified by the prefix Y.
namespace Y = X;

// Attach namespace X to the current namespace.
using namespace X;

// Extract a subset {a, b, c} from the namespace X and attach
// its elements to the current namespace.
using namespace X::a;
using namespace X::b;
using namespace X::c;

Modules

Modules, strictly speaking, are aspects of program design, not programming language design. A modular program is one made of distinct parts that can be reasoned about, manipulated, tested, and replaced without touching the rest of the program. It’s a fascinating problem, and the best advice I have yet seen on it is from David Parnas’s 1972 paper “On the Criteria To Be Used in Decomposing Systems into Modules”: “…one begins with a list of difficult design decisions or design decisions which are likely to change. Each module is then designed to hide such a decision from the others. Since, in most cases, design decisions transcend time of execution, modules will not correspond to steps in the processing. To achieve an efficient implementation we the assumption that a module is one routines, and instead allow subroutines to be assembled collections of code modules.”

That being said, a programming language and its community can have tools to declare and enforce modules once they have been designed. Since we are talking about tools to support design, there isn’t a clean, mathematical formulation here as there is for namespaces or compilation units. There are certain properties that have showed up in tooling to support modular design in various languages, and I cannot offer much more than an enumeration of those I have recognized:

Visibility

One of the simplest ways to enforce a module’s boundaries is to make its internals unreferencable from outside. For example, the method getInput in the Java class

class ReaderModule { ... public Stream getInput() { ... } ... }

could be reading from a file, a network stream, or generating random data without reference to the outside world. If there is no other information available than that its return type is a Stream, any other code using this class cannot depend on the module’s internals simply because there is no way to refer to them. Similarly, the internal functions to manipulate or examine data structures may be hidden. If the right bindings are hidden, it makes the values and behavior of a module inscrutable from the outside.

Most languages in use today have some constructs to control visibility, such as scoping of local variables, namespaces, and private/public declarations on class fields and methods. In C, top level bindings in a compilation unit that are declared static are visible only in that compilation unit. C++ inherits this ability, though in some circles it is eschewed in favor of anonymous namespaces, the contents of which are visible outside of the namespace in the same compilation unit, but not in other compilation units. In Python, any binding prefixed with an underscore is (by unenforced convention) private. Common Lisp also lets any binding be declared private, though it uses a distinct syntax to override the private declaration and access the binding instead of a naming convention.

Parameterization

It is a common pattern for one module to be parameterized over another. A stream transformer may be parameterized over a stream type. A queue may be parameterized over the type of its contents. So we might have functions on a stack in Haskell with the types

push :: Stack a -> a -> Stack a
pop :: Stack a -> a
empty :: Stack a -> Bool

Looking at these functions, it is clear that the stack is parameterized over the type of its contents. A few languages allow that parameterization to be declared once and for all, as in SML, where the module declaration for the stack might be written

signature STACK = sig 
     type 'a stack
     val push : 'a queue -> a -> 'a queue
     val pop : 'a queue -> a
     val empty : 'a queue -> bool
end

Other parts of the program refer to a parameterization of the module. Most of the common uses of parameterized modules and the manipulations of them available in SML are handled more simply with constructs like Haskell’s typeclasses, but the notion of declaring parameterizations at this level is worth knowing about.

Contracts

The simplest case of a contract is compile time type checking, as in the C function

double square(double x) { ... }

This function always takes a double and returns a double. The compiler can check that this is true in most cases (though not when pointers are involved, or the signature of qsort,

void qsort(
     void *base,
     size_t nel,
     size_t width,
     int (*compar)(const void *, const void *)
);

would be useless). Modern type systems have pushed this much further, until in recent languages like Agda, the compiler can assert the type of every expression in the program at compile time, without the gaps that C has around pointers, and the types can express details such as the lengths of lists or the dimensions of matrices. Actually, Agda’s type declarations are themselves a Turing complete language.

Compile time isn’t the only time to check assertions. For decades, the mathematics wasn’t in place to do very sophisticated contracts in the type system, so some languages, beginning with Eiffel, added run-time contracts. Here is an example of a contract in PLT Racket for an absolute value function:


(-> ; Constraint on the input
    number?
    ; Constraint on the output
    (and/c number? (or/c positive? zero?)))

How Ruby does it

Ruby’s compilation unit is a single file. The language’s support for modular programming is restricted to providing the keywords public, private, and protected to control visibility of methods defined on modules and classes. We saw in the section on interfaces that the libraries in the language make using visibility and parameterization to enforce module boundaries unnecessarily difficult.

The Ruby community uses a language construct called module for namespaces, as well as for mixins. Like C++, any Ruby module in scope is attached to the current namespace qualified by the module’s name. Ruby modules can be attached to other modules, and qualified by assigning them to a different variable, but they cannot be subsetted, nor do they have the relocatability property, since

def f()
     puts "Hello"
end

def g()
     f()
end

g()

must be changed to

module Something
     def self.f()
          puts "Hello"
     end

     def self.g()
          f()
     end

     g()
end

and

def f()
     puts "Hello"
end 

class A
     def g()
          f()
     end
end

def g()
     f()
end

A.new().g()
g()

cannot be put in a Ruby module at all.

Multiplication of like things

The phrase “orthogonal” is often bandied about in praise of programming languages, but what does it mean and why is it desirable? Consider pointers and references in C++. They are similar in that both are a way of passing parameters by reference, so

void increment(int *n) { *n += 1; }

and

void increment(int &n) { n += 1; }

do exactly the same thing. They differ in that pointers may be assigned to point to new memory locations, may be incremented and decremented to shift the memory they refer to, and they must be dereferenced in order to access the values they refer to. References are used like local variables, and the memory they refer to is fixed at their creation. The semantics of pointers and references overlap, though they have their differences, so we say that they are not orthogonal.

Another example is the distinction between superclass and interface in Java. Both are used to provide polymorphism (any subclass of A can be used in a function that takes an argument of type A, and the same is true of interfaces). But superclasses may provide implementations of methods that their subclasses will inherit, while interfaces may only declare that implementing class must define a given method with a given signature. Though the two concepts are not orthogonal, they let Java retain the simplicity of single inheritance (since inherited methods can only be inherited from the superclass), but interfaces also give it the polymorphism of multiple inheritance while avoiding its complexities (which are principally how to order calls to superclass methods).

Similarly, having both pointers and references in C++ is a tradeoff. C++ inherited pointers from C. The language was originally conceived of as a superset of C, so pointers had to stay. Yet pointers are a source of a disproportionate number of the errors in C programs. References fill one of the most common uses for pointers while avoiding all the errors that were possible with pointers, and so they were incorporated.

Now that we have established what it is, what makes orthogonality desirable? Simply that humans are good at memorizing how very distinct things work, but bad at keeping the details of similar things straight. No one confuses for loops and variable assignment, though both create a binding of a certain value to a name. They are as hard to confuse in your memory as a small Chinese woman with a giant black man.

Beyond that, nonorthogonal concepts are not intrinsically bad. C++ references and Java interfaces are both clever, useful solutions. Nonorthogonal concepts become a problem when they become a significant mental task for a programmer to disentangle. Beyond that, they can be a symptom of problems that arose in the course of in language design. Language designers don’t set out to incorporate nonorthogonal constructs in their language. Once the outline of the language is established, problems will rear their head in the details. It is resolving these details that leads to the addition of nonorthogonal concepts.

Ruby has accumulated a number of nonorthogonal concepts which significantly burden the programmer.

There are two methods to attach a module to the current context. extend adds the methods in a module to the current object; include adds them to whatever will be created by object’s new method. So in a class, include adds a mixin’s methods as class methods. extend adds a mixin’s methods as instance methods to the objects a class creates. For a non class object, include adds methods to it, and extend shouldn’t do anything at all.

Ruby has four notions that resemble a function: methods, blocks, procs, and lambdas. Methods are hunks of code that can be executed by sending messages to objects, and that terminate their execution and return to the message sender when the return statement is called. A block is a hunk of code, derived in analogy with Smalltalk, but unlike Smalltalk, where a block is a function and is the only form of function in the language, a block in Ruby is not usable directly. It has to be wrapped in a proc or a lambda, which differ in how they handle omitted arguments and how return behaves in them. Procs fill in default values of nil for omitted positional arguments (so if I call a three argument proc with two arguments, the third will be bound to nil in the body of the proc). Lambdas do not. The return statement of a proc returns from the next enclosing method. The return statement of a lambda returns from the lambda itself. And methods turn out to be a different type equivalent to lambdas plus names.

Ruby also provides two exception handling systems, identical except for their intended purpose. raise/rescue is meant for normal exception handling, but it has become a proverb not to use exception handling for control flow, since it is hard to understand and reason about code that does so. In Ruby this reason was apparently forgotten, but the letter of the proverb was obeyed: a separate system of throw/catch was created for control flow.

None of these justifies the mental burden they place on the programmer.

MN vs. M+N and what it did to the language

Smalltalk was the first object oriented language. It was built around the notion of objects which received messages and executed blocks of code in response. That is the underpinning of most object oriented languages to this day. However, there is a basic problem with it in practice: how do you write a polymorphic max function? That is, a function that takes two arguments of the same type and returns the larger of the two, ordered according to whatever ordering the type defines. In Smalltalk, you must define it on every class that you want it to have, which leads to an enormous amount of repeated code. This is true of any other algorithm you want to work on a given type, so for M algorithms and N types, you end up writing MN methods.

This has been solved in various ways. Common Lisp’s CLOS, and its descendents such as Dylan removed messages and instead defined generic functions. Each generic function could have multiple implementations, and which implementation was used was chosen at runtime based on the types of the arguments passed to it. This led in turn to the key insight of Stepanov’s Standard Template Library: you can separate iteration strategies from algorithms. For each of the N types you implement its iteration strategy, and you implement each of the M algorithms in terms of that strategy. Result: you write M+N methods.

The other major solution was to keep message passing and add multiple inheritance. This let programmers inherit from both the natural superclass of a class and also from a class carrying implementations of the various methods, though it adds its own complexities over how to order calls to superclass methods.

Ruby took the multiple inheritance route, but not openly. Instead it retained single inheritance from a superclass, introduced a new inheritance hierarchy of Ruby modules, and provided two separate mechanisms, include and extend, to have a class inherit from them. Now you must memorize how inheritance and Ruby’s two mixin expessions, include and extend, interact as well as how they order calls to superclass methods.

Tooling and reasoning

Most programming communities have their expectations about what tools are necessary and what are frivolous, and there is no core that every community would agree on as necessary. Turbo Pascal programmers assumed that an integrated debugger was a basic tool of a programmer, but the term unit testing did not exist yet. Python programmers today regard integrated debuggers as a luxury, but a unit testing library as a necessity.

There are a number of language agnostic tools—version control systems, build systems, literate programming tools—but beyond that the tooling which can be straightforwardly built in a language depends on two things: how easy the language is to parse, and how easy code in the language is to reason about.

Parsing is an old and largely solved problem. We know how to define unambiguous grammars that are easy and fast to parse. ALGOL 60 already had a mature, precise specification of its grammar, and most of the imperative languages that followed it were at least straightforward to parse. C, for example, is harder to parse than Lisp, but not terribly onerous. C++, unfortunately, is a nightmare to parse.

For many languages, such as Lisp and recent versions of Python, the live instance will parse code for you and return an abstract syntax tree, further reducing this burden.

How easy a language is to reason about is roughly equivalent to how rich a set of program transformations it supports. Some of these transformations are program preserving, ranging from from renaming a variable in the source code, to converting programs to continuation passing style and all the other tricks of writing optimizing compilers.

Others transform the program to other useful forms. Transform the code into a list of all the entities defined in the program and what regions of code they correspond to and you have the underpinnings of a code browser. Transform it to a simplified, decidable execution model where common errors can be detected without running the program and you have a static analysis tool. Alter the code to track what parts are run and what are not when exercised by a test suite and you have a code coverage tools. Mutate the code randomly when it is exercised by a test suite and you have a measure of how incisive the tests really are.

These transformations are also what a programmer does in his head when he reasons about code, so how easy code in a language is to reason about is equivalent to how easy it is to write tools in the language.

Ruby has no well defined grammar. All the Ruby implementations today reuse Matz’s original parsing code. There are various BNF grammars people have written for the language, but they may or may not match the actual implementation. Nor does Ruby provide a mechanism to turn code into an abstract syntax tree for you as Python and Lisp do. Anyone writing tools beyond a unit testing library must solve this problem first, before ever doing any real work.

Reasoning about Ruby is not much better, and the tools reflect this. There are a plethora of unit test libraries ()all incompatible). Beyond that there is a single code coverage tool providing line coverage, but not branch or instruction coverage, and which will ignore large hunks of code if not configured perfectly, including having the order of require statements in a file be just right. There are a handful of static analysis tools which have the insight you would expect from the first prototype of lint written over a weekend in the 1970’s. There is a debugger that may or may not skip the body of a loop when single stepping through, depending on how the loop is written, and may or may not crash, and usually fills the console of any frontend its is hooked to with garbage so that any console output of the program itself is obscured. And that’s it. The Ruby tool ecosystem.

Summary

The criticisms above are not matters of taste. They are errors in language design. Unrelocatable namespaces are an error. Introducing three separate methods for inheritance and a separate inheritance hierarchy when multiple inheritance was well understood before Ruby’s initial creation is an error. Documentation so sparse that its reader must turn to reading the source code instead is an error.

If Ruby were the only language in its niche, these errors might be tolerable, but it is not. So let me make my position on the language clear: Ruby is deprecated. Let it follow Perl into the dustbin of language history.

Appendix: Documentation conventions

C programmers on Unix-like systems

There are many communities of C programmers—C programmers in Microsoft’s ecosystem, in the Macintosh ecosystem, those who worked in Borland’s Turbo C, those who write for Unix-like systems—and these communities are distinct and have their own conventions for documentation. For those who work in the only community around their language of choice, such as PHP, or in the presence of slightly fragmented subcommunities, such as the scientific community centered around NumPy and SciPy in Python, this will seem strange, but a Windows C programmer would be quite lost on a Unix-like system and in the surrounding community, and vice versa. This section is concerned with the community of C programmers writing for Unix-like systems.

The community writing for Unix-like systems in C has memorized their core language, which is quite small, and almost never refer to a language reference, though such programmers usually have a copy of some C textbook, and one or more books on Unix-like systems (Stevens and Rago’s Advanced Programming in the Unix Environment or its ilk).

The community’s reference material is divided into man pages, which are read on text mode terminals. Each man page describes a small set of related functions, such as all the variants of printf or fork, and the pages are organized in a fixed way. The man page of fork is typical:


FORK(2) 		    BSD System Calls Manual		       FORK(2)

NAME
     fork -- create a new process

SYNOPSIS
     #include <unistd.h>

     pid_t
     fork(void);

DESCRIPTION
     Fork() causes creation of a new process.  The new process (child process)
     is an exact copy of the calling process (parent process) except for the
     following:

	   o   The child process has a unique process ID.

	   o   The child process has a different parent process ID (i.e., the
	       process ID of the parent process).

	   o   The child process has its own copy of the parent's descriptors.
	       These descriptors reference the same underlying objects, so
	       that, for instance, file pointers in file objects are shared
	       between the child and the parent, so that an lseek(2) on a
	       descriptor in the child process can affect a subsequent read or
	       write by the parent.  This descriptor copying is also used by
	       the shell to establish standard input and output for newly cre-
	       ated processes as well as to set up pipes.

	   o   The child processes resource utilizations are set to 0; see
	       setrlimit(2).

RETURN VALUES
     Upon successful completion, fork() returns a value of 0 to the child
     process and returns the process ID of the child process to the parent
     process.  Otherwise, a value of -1 is returned to the parent process, no
     child process is created, and the global variable errno is set to indi-
     cate the error.

ERRORS
     Fork() will fail and no child process will be created if:

     [EAGAIN]		The system-imposed limit on the total number of pro-
			cesses under execution would be exceeded.  This limit
			is configuration-dependent.

     [EAGAIN]		The system-imposed limit MAXUPRC (<sys/param.h>) on
			the total number of processes under execution by a
			single user would be exceeded.

     [ENOMEM]		There is insufficient swap space for the new process.

LEGACY SYNOPSIS
     #include <sys/types.h>
     #include <unistd.h>

     The include file <sys/types.h> is necessary.

SEE ALSO
     execve(2), sigaction(2), wait(2), compat(5)

HISTORY
     A fork() function call appeared in Version 6 AT&T UNIX.

CAVEATS
     There are limits to what you can do in the child process.	To be totally
     safe you should restrict yourself to only executing async-signal safe
     operations until such time as one of the exec functions is called.  All
     APIs, including global data symbols, in any framework or library should
     be assumed to be unsafe after a fork() unless explicitly documented to be
     safe or async-signal safe.  If you need to use these frameworks in the
     child process, you must exec.  In this situation it is reasonable to exec
     yourself.

4th Berkeley Distribution	 June 4, 1993	     4th Berkeley Distribution

A skilled C programmer in this community can find what he needs in these strictly formatted pages with great speed, and documents his own libraries in the same way. The Perl community—Perl began life as a normalization of the diverging scripting languages associated with shells across various Unix-like systems—inherited this tradition.

The exception to this pattern is the GNU project. Richard Stallman came from a Lisp background, which had a very different tradition, and brought that tradition with him. Thus GNU software in this community tends to have both man pages and the long form manuals more typical of Lisp.

Both the man pages and the GNU manuals are independent documents. They are usually kept in a separate directory from the code, and compiled for viewing with entirely separate tools. The correspondence between the man pages and the source code is maintained by hand. Comments in the code are to understand its workings. Its purpose and intended behavior are recorded in the (separate) documentation.

Common Lisp

Common Lisp is a language with a single community. Indeed, the language was a political compromise meant to unify a number of divergent Lisp communities with shared interests. The compromise defined the language in great detail, from the branch cuts of the numerical functions over complex values, to the interface to the debugger, to standard ways of controlling whether code was to be compiled or interpreted, and the Common Lisp community is nearly unique in that their standard is their primary reference documentation while they work. If their particular implementation does not match the standard, it is expected that the vendor will fix the implementation rather than the programmer work around it.

Third party libraries in Common Lisp have similar manuals. Vendors provide such manuals for their extensions. The culture thinks in terms of coherent, book-like documentation, in contrast to the man pages of the C community on Unix-like systems described above. Indeed, a few still use a hard copy of Common Lisp, the Language, 2nd ed. as their reference, though most of the community uses the HTML based Common Lisp Hyperspec.

The manuals in Common Lisp are entirely separate from the code, but the language defines “docstrings”: the first expression in a Common Lisp definition, if it is a string, will be taken by Common Lisp systems to be documentation for the definition. A docstring is accessible by asking a running Common Lisp system for the docstring of a definition loaded into it:

(defun square (x)
  "Return the square of x"
  (* x x))

(documentation #'square 'function)
; Evaluates to "Return the square of x"

Docstrings in the Common Lisp community don’t have as fixed a structure as man pages, and often have a much narrower scope, since there tend to be separate manuals describing how to use the system.

Python

Python occupies a middle ground between the Common Lisp community and the C programmers described above. The core language and standard library are documented in a series of HTML pages similar to the Common Lisp Hyperspec. The documentation usually begins with enough exposition to understand the topic at hand, followed by reference documentation for the public functions and classes provided. Unlike Common Lisp, the documentation is specific to each version of Python.

Also like between Common Lisp, Python has docstrings, which are again a string as the first expression of a definition, as in

def square(x):
    "Return the square of x."
    return x*x

print square.__doc__ # Prints "Return the quare of x."

Unlike in Common Lisp, the docstrings are not only accessible in the running Python instance, but are are extracted into the HTML manuals, so there are conventions governing their form and how they will be formatted for extraction. The most basic is that the docstring should begin with a quick, one line description, followed by a more comprehensive one (this convention was inherited from the Lisp community via Emacs Lisp).

Java

Like Python, Java puts its documentation into its source and uses tools to extract it into manuals. Unlike Python, it uses comments prefacing definitions to document them, as in

/**
 * Return the square of x.
 */

public static double square(double x) {
     return x*x;
}

This documentation is lost in compilation, so the extracted manual is the only reference. There is nothing equivalent to looking up a docstring in a running Python or Common Lisp instance. Like Python, there are strict conventions for organization and formatting the documentation.

Unlike Python, the manuals tend to be only API documentation, with very little exposition. Java libraries tend to have separately written tutorials to teach a programmer enough about the library that he can hopefully figure out whatever else he needs from the API reference.

R

R is an interactive language with a lineage going back to Bell Labs, the home of Unix, so it is no surprise that the documentation of its functions is nearly identical to man pages (though typeset in Rd, a language resembling LaTeX, instead of in groff), but accessed from within R itself. R itself also has a number of long form manuals introducing the language, specifying its grammar, and covering certain major areas such as importing data or writing extensions.

The general statistical community around R documents its libraries in the same way—man pages plus, sometimes, longform manuals covering particular areas—but there is a second, increasingly separate subcommunity centered in bioinformatics around the BioConductor libraries, which has a completely different documentation tradition. BioConductor eschews R’s online references and instead produces “vignettes” for its various packages. A vignette is PDF file with a few paragraphs explaining the library, some annotated code examples to quickly get started, and some terse reference documentation of the most commonly used functions in package.

TeX

Donald Knuth’s TeX language is meant for typesetting documents, so it is no surprise that it has an interesting documentation tradition. Indeed, I mention it to describe the logical conclusion of automatically extracting documentation from programs: literate programming. Knuth, who is concerned with producing code to be used and read for decades to come rather than years to come, proposed writing a document that happened to contain a program in it that a tool could extract to a compilable form, and this is exactly what he did with TeX.

The code need not be organized linearly in the document, nor kept together in any particular way. Blocks of it may be defined anyway and glued together elsewhere. Here is an example taken from the port of the wc command to the noweb literate programming system:

Here, then, is an overview of the file <tt>wc.c</tt 
that is defined by the <tt>noweb</tt> program <tt>wc.nw</tt>: 

<<*>>=
<<Header files to include>>
<<Definitions>>
<<Global variables>>
<<Functions>>
<<The main program>>
@

We must include the standard I/O definitions, since we want to send
formatted output to [[stdout]] and [[stderr]].
<<Header files to include>>=
#include <stdio.h>
@

« Back to Programming | Home