madhadron

Writing programs

This is the part of the series on getting better as a programmer. The articles are:

Here is where software engineering starts: producing a program text that expresses the solution to a problem. The other categories in this series emerge by splitting off concerns from this central one, but in the end ours is a craft of producing programs.

To produce such a program text we need

We’ll work through each of these in turn.

Medium of expression

Before we had automatic digital computers (as opposed to earlier humans with the job title of ‘computer’ doing calculations or the analog computers built for specific calculations), everyone assumes that the hard part would be building the machine and programming it would be easy. Then we got automatic digital computers in the 1950’s and the field went through the stages of grief before finally accepting that programming is hard. And with that acceptance began the hunt for ways of making it less hard. The two artifacts from those early attempts that we still recognize the names of today are FORTRAN and Lisp. Since then our field has produced thousands of programming languages.

Fortunately for us, most of those languages fall into a small number of families and once you know one member of the family, picking up others is generally a matter of figuring out minor differences. Learning Ruby when you know Python is straightforward, and is learning OCaml when you know Standard ML or Kanren when you know Prolog.

I’ve written elsewhere about these families, but the short version is that every software engineer should learn an imperative programming language to start with, but shouldn’t stop there because some families make things trivial that are difficult in others. For example, searching for parses of a sequence of tokens is nearly trivial in Prolog but requires some machinery in Java, and setting up and traversing an abstract syntax tree is child’s play in Standard ML and a lot of manual labor in C++.

The mediums we care about aren’t limited to traditional programming languages. Spreadsheets are the most widespread medium for programming on the planet. HTML and CSS are mediums for expressing things that would be horribly awkward in something like C. JSON and XML and graphical diagrams like Simulink are mediums.

When you start out, mediums seem liked fixed things you interact with. Your C compiler expects a specific, formal language and if you deviate from that language the compiler punishes you. The computing environment you work in will encourage this as well. On Unix-like systems it’s typical to have “real programs” written in C or another compiled language, configuration files for those programs written in some limited format, and scripts written in shell or Perl or Python. In Windows you have applications written in C++ or C#, configuration for them as values in the registry, and batch or PowerShell scripts.

Over time exceptions start building up:

At some point you realize that there’s nothing really different about all of these. A particular medium may be more convenient because you already have a tool built for working with it, but they’re all just ways of expressing texts in formal languages. Then you arrive at the question of how you give meaning to arbitrary mediums that you may want to express your intent it.

Giving meaning to texts

The “real programs” written in C on a Unix-like system are given meaning by the combination of a compiler and a computer to run the compiled result. The program text turns into behavior of a physical machine. Configuration files get meaning from how programs they are meant to configure change behavior. Scripts get meaning from their interpreters turning them into instructions for a computer.

All of this consists of translating one medium to another until we bottom out at something. Sometimes we bottom out into mathematical structures (look into Hoare triples or Standard ML’s formally defined semantics), but usually we bottom out into instructions for an actual, physical computer. Giving meaning to our program texts thus typically means learning how to write, or at least reason about, interpreters and compilers to take one medium and produce another, and how to design the mediums we translate.

Another path is to make your medium extensible so that you can express changes to it in itself. This drags part of what we think of as compilation into the language. Lisp macros are the classic case. Reitit and Malli in Clojure do something interesting in this vein. Because Clojure carries its compiler in the runtime, they take a model described as a Lisp macro, but instead of doing simple code generation, they do optimized compilation to JVM bytecode. They also get to impose lots of “compile time” checks with type checking and static analysis that would normally be unavailable in dynamically typed languages like Clojure. A less extreme form is writing in one language and outputting another (or my Python to JavaScript translator. Whirlisp uses macros to shrink the expression of a nontrivial system (in this case, a database). The Tcl language has a particularly interesting approach by treating the whole language as, explicitly, text.

Forth programmers do this, too, such as my old friend’s embedded finite state machine library or embedded FORTRAN subset (docs).

More recent languages like Rust introduce them as well. That community wanted them because they were used to the power of C++’s templates, but didn’t want to persist in that insanity.

The other direction is to cut down a medium so that you can express less in it, but it’s easier to catch mistakes or express specific things more concisely. This is what configuration files in their various formats are doing. It’s much easier to produce a helpful error message for your user if the only thing you accept are key: value pairs, one per line, than if you let your configuration file be arbitrary C++.

There is nothing fundamentally different here. We’re still using a program to transform one medium into another to give it meaning. What you actually produce is different because your goals are different. The model common name I’ve seen for this is ‘model oriented development’ or ‘model driven development.’

There are very successful examples of this. The most commercially successful is Simulink from MathWorks. Rascal and Xtext are systems for creating these “little languages” separate from your programs.

iMatix has a series on how they use model oriented programming that explains the concepts and practices well. But then also read this comment as a caution about using it casually. Oh, and this kind of system tends to seem messy.

We’ve talked about large sources to large targets (programming language to programming language or computer instructions or macros within a language). We’ve talked about small sources to large targets (model driven development). What about large sources to small targets? Indeed, we do this, too. This is type checking, linting, and static analysis where we take a rich program and give it a different, limited meaning in order to verify some property of it.

Tools of construction

Having discussed mediums of expression and ways of giving those expressions meaning, we come to the problem how how we actually construct program texts in those mediums that we believe express our intent.

We don’t discuss this very much, but this is really hard. A new programmer can learn loops and variables and conditionals in half an hour, but what do they do with them? How do you go from “I want the second cheapest seltzer water brand available in this store from list list of items for sale” to a program text? It’s really easy for beginners to bounce off this difficulty and grab for something like learning some syntax, anything that requires memorizing or analysis instead of synthesis.

As an example, MIT used to teach their introductory computer science course in Scheme, using a famous book called Structure and Interpretation of Computer Programs. That book is considered one of the gems of our field. So I was surprised to hear from the little brother of a friend of mine that had taken the course that he hated it. In particular he hated the programming language they taught it in, Scheme.

A little prodding revealed the true problem, which is also the reason that they used a Lisp for SICP: there’s almost no syntax. The amount of language they need for the book is tiny. If you’re used to C++ or Java where you can feel a sense of accomplishment from learning the various corners of the language, it’s an unnerving experience to be basically locked in a room with S-expressions and lambdas and try to build what you need. There is nowhere to hide. It is purely your mental skills at program construction.

Now, most students don’t start university with those skills, which led another group of academics to write How to Design Programs, which is designed to be the course before SICP. They lay out a sequence that builds this skill steadily, starting from simple functions on fixed size data. A student can pretty quickly figure out the mental habits needed to write a function that squares its argument

(define (square x) (* x x))

Then you add the notion of conditionals and build up muscles there, and multiple arguments, and structured arguments. Then you introduce variable sized data and the patterns of iterating over it and accumulating a result, then modulating that iteration by putting a conditional inside it, and on and on. I taught a student one on one using this approach in Python and we went far and fast. There’s an amazing interdisciplinary doctoral thesis between computer science and applied behavior analysis on zero error shaping sequences to teach these skills that’s just begging to be written.

After that, the best material I know comes from music, where the problem of learning to construct phrases and lines is front and center. Victor Wooten’s book The Music Lesson is the best book on actually making music I’ve ever read, and much of it translates to program construction if you free associate hard enough. For example, he emphasizes improvising along with recordings of the best musicians you can find. Our equivalent is reading code and trying to write similar fragments and programs. When I was fifteen or sixteen I spent a few days reading OpenBSD’s core shell utilities and writing variations on them to make sure I understood them, often picking out a particular subset of behavior and implementing that simpler version. I learned more in those few days than I did most of the rest of that year.

These habits of construction are different for each programming language family. One of the hardest things about learning a new family is running into this wall that you thought you had escaped. The first language family you learn is hard, but you expected it to be hard. The second is traumatizing because you didn’t expect it. The third, fourth, etc. go much easier because the difficulty is familiar.

At some point we have our basic structures in place, but we run afoul of not knowing how our intent can be expressed. That knowhow is what we call data structures and algorithms, but not in the sense of memorizing some specific things for class. Ideally they are the larger scale structures of putting together programs that we reapply again and again. I’ve written written about how I keep track of algorithms and data structures, and it’s really about the mental circuitry used to construct them.

How to ascend

Given these three aspects, how do we actually get better in this category?

The transcendent goal is to regard any program text as data and have the chops to write whatever transformation you need. When we think of the wizards of our field, people like Charles Moore or Niklaus Wirth, the magic is in their definition of the whole path from program text to transformation to meaning given by a computer. Moore created Forth in the early days of the field and by the end had designed processors for which it was the native format, using a VLSI CAD system he wrote for the purpose in a refined version of Forth that fit in the boot sector of a floppy disk. Niklaus Wirth’s Oberon workstation was computer, compiler and language, and operating system and applications written in that language.

But you can’t start there. You have to be able to program before you can write the compilers and intepreters, and you have to learn to program using an existing compiler or interpreter. So we all start with an existing programming language and learn to writes programs in it.

At that point everyone’s path diverges. We bite chunks at a time off of the following sequences:

Remember, this is years of advancing in your career, not stuff you have to learn before you start. You are unlikely to learn all of it, or need all of it, but it’s there if your path takes you in that direction.

This series is still being written. Subscribe to get emailed when each is section is posted:

Appendix: Python to JavaScript translator

Here’s a fun snippet I wrote long ago. It takes advantage of the fact that Python keeps the source code of functions in its runtime and provides a library to parse and manipulate Python code. The genjs function takes a function written in Python (well, a subset, just the little bit that’s implemented) and emits the corresponding JavaScript.

import inspect
import ast

def ast_of(f):
    lines, count = inspect.getsourcelines(f)
    source = ''.join(lines)
    a = ast.parse(source)
    return a

def genjs(node):
    f = globals()['_' + node.__class__.__name__]
    return f(node)

def _FunctionDef(node):
    print node._fields
    arglist = ', '.join(genjs(arg) for arg in node.args.args)
    s = 'var ' + node.name + '=function(' + arglist + ') {\n' + '\n'.join(genjs(stmt) for stmt in node.body) + '\n};'
    return s

def _Module(node):
    return '\n\n'.join(genjs(stmt) for stmt in node.body)

def _Assign(node):
    return 'var ' + genjs(node.targets[0]) + '=' + genjs(node.value) + ';'

def _Num(node):
    return str(node.n)

def _BinOp(node):
    return genjs(node.left) + genjs(node.op) + genjs(node.right)

def _Add(node):
    return '+'

def _Return(node):
    return 'return ' + genjs(node.value) + ';'

def _Name(node):
    return node.id

def f(a,b):
    a = 3
    b = a+5
    return a+b

print genjs(ast_of(f))
# Prints:
# var f=function(a, b) {
# var a=3;
# var b=a+5;
# return a+b;
# };