madhadron

Creating effective pull requests

Status: Draft
Confidence: Likely

I get a sinking feeling in my stomach when I’m asked to review an enormous pull request. There are probably multiple distinct changes in there, maybe some tiny fixes thrown in as they were found. Its author built it up steadily over time, and now I have to deconstruct it. It’s exhausting. Pull requests like that usually languish, waiting for a review, because they will require me to set aside a whole afternoon when I’m feeling up to digging through them.

As a courtesy to my coworkers I try to post small pull requests, pull requests that make one coherent change, usually no more than a hundred or so lines changed. As a result I can usually ping someone for a code review and they will deal with it quickly because they know it will probably only take them five or ten minutes. It’s much easier to get people to review the same changes as ten tiny pull requests than as one large one.

Doing this presents two new problems, though:

  1. Each of the small pull requests must leave the system in a state you can compile and ship, even though the change isn’t finished.
  2. If you aren’t going to stop coding after a hundred lines, you need to stack additional changes on top of previous ones, so you need a way to handle this stack of pull requests.

Shipping intermediate states

Every commit should leave your repository in a shippable state. This is sometimes called the “not-rocket-science rule.” I have yet to see a task that couldn’t be broken up into small changes that leave the repository in a good state, but there are a few tricks you sometimes need.

Scaffold first, then fill in.

Write your scaffolding, such as request handlers and core data types first. Don’t hook them up to anything. Throw NotImplementedException or its equivalent everywhere. Then fill in the scaffolding over a subsequent set of commits.

Example. I have to add a set of HTTP request handlers that will invoke various long running workflows to an existing system. I first write a commit that adds a request handler that returns 400 (Bad Request) for all requests to its supported methods (say, GET, PUT, and DELETE), hooks it up at the correct path, and writes a stubbed out unit test for it. Then I write a second commit that parses the body of the PUT request and validates it, and extends the tests to check the validation. A third one makes PUT invoke the appropriate workflow, which I mock in the tests. Another one implements the DELETE handler and tests it, and a final one implements the GET handler and test it.

Write one layer at a time.

If there is a whole sequence of things to be done, write the first and its cleanup as one commit, then the next as another. Each commit should include deallocation and cleanup for everything allocated and done.

Example. Say I am writing a network packet capture pipeline. First I write a pool of memory blocks that I can fetch from and return to, and a workflow that fetches a block from the pool and then returns in. In the second commit, I insert a stage where the I fill the block with packet data. The third commit adds parsing of the packet headers, and subsequent commits add further behavior.

Add new paths, then remove old.

If you have a working path, don’t break it. Either extend it to handle the new case, or plumb a new path in parallel. Then over a series of commits fill in the new path, while leaving the old one still working. Once that is done, stack up a few more commits to remove the old path.

It can require a little more work, plumbing new paths and cleaning up old ones instead of just revising everything at once, but I have found that I get code checked in much faster despite the extra work.

Example. A request handler has been taking an unstructured JSON blob of parameters that it passes on to another system, but now you need to replace that with specific parameters. In the first commit you add the parameters to the same type that contains the JSON blob, but they will always be null. The second commit changes the place where the JSON blob is used, and adds a condition that if the JSON blob is null, calculate an appropriate JSON blob from the specific parameters. Then the subsequent commits shift each place that passes in a JSON blob one at a time to pass in specific parameters, but only one path is changed and the others all continue to work.

Handling stacked pull requests

Now we come to the other difficulty: if you don’t want to write a few lines and then wait until it’s reviewed and committed to the main branch before writing anymore, you need some way to stack up your changes.

The best way to do this is to use a tool that handles stacked reviews correctly, like Gerrit or Phabricator. But most of the more common git hosting tools (GitHub, GitLab, Azure DevOps, BitBucket) do this wrong. After using something like Phabricator that explicitly lets you build up stacks of commits to be reviewed separately, the branch merge model is painful to use. But until the tooling gets fixed, the best we can do is work around it.

Let us imagine I am working on a feature to add swizzles (whatever those may be) to my program. I create a local branch called swizzle,

$ git switch -c swizzle

Then I write my first small set of changes, usually a bit of scaffolding to provide a way to invoke swizzles, though nothing is wired up yet. Then I add those changes to the index, and run git commit, with a commit message

Scaffold a request handler for swizzles.

PR-Branch: swizzle00-scaffold

What is this PR-Branch line? We’re going to parse it out later on and use it to push this commit to a specific remote branch named swizzle00-scaffold. I have settled on using two digits after the main branch name, then a quick summary name for naming individual commits. So the next one will start with swizzle10-, then swizzle20-, etc. This leaves me space if I need to insert additional commits inbetween (swizzle15-, swizzle17-). For those who grew up with old programming languages that had line numbers this should seem familiar. And gross. Because it’s the 21st century and the tools should be better than this. Moving on.

I add a few core data types and functions to operate on them, maybe a couple unit tests and commit them (PR-Branch: swizzle10-core_types), and then fill in one of the request handlers (PR-Branch: swizzle20-get_handler), and then another (PR-Branch: swizzle30-put_handler).

Now I have a branch swizzle with a set of commits in it.

 swizzle
/
|- cbec364 Fill in PUT handler for swizzles.
|- ef85f60 Fill in GET handler for swizzles.
|- 0c82977 Add core swizzle data types.
|- 02ee497 Scaffold a request handler for swizzles.
.
.
.

Say I realize I have a typo in the core swizzle data types commit. We go back and edit it with git’s interactive rebase. git rebase -i is vastly overpowered for this purpose. I would love more special purpose tooling. But it works, and even though it seems intimidating at first, you will get used to working in it most of the time.

You run git rebase -i master on the swizzle branch, edit the commit you want, and finish your rebase. There are lots of articles on using git rebase -i out there. Maybe I’ll write another one at some point, or, more likely, on how to use get reflog to fix things when you screw up with git rebase -i.

Now how do we get these commits pushed to GitHub or wherever our remote is? What I settled on doing (partially because I do a lot of work on Windows these days, and my configuration has to work on Windows, macOS, and Linux) is the following alias in my ~/.gitconfig:

[alias]
    rbi = "rebase -m --rerere-autoupdate -i -x \
        'branch=$(git show -s --format=%B | grep ^PR-Branch: | cut -d : -f 2 | xargs); \
        test -z $branch || git push --force-with-lease origin HEAD:refs/heads/$branch'"

With that in place, instead of running git rebase -i master, I run git rbi master. It adds some nice things such as choosing how to resolve conflicts (-m -rerere-autoupdate), and between every commit adds that long command. What it does is find the PR-Branch: entry, get the branch name from it, and pushes that to origin.

So if I run git rbi master from swizzle, it will give me an editor to configure the rebase. With just straight git rebase -i master it would be

pick cbec364 Fill in PUT handler for swizzles.
pick ef85f60 Fill in GET handler for swizzles.
pick 0c82977 Add core swizzle data types.
pick 02ee497 Scaffold a request handler for swizzles.

With the -x flag in git rbi master, I get

pick cbec364 Fill in PUT handler for swizzles.
exec branch=$(git show -s --format=%B | grep ^PR-Branch: | cut -d : -f 2 | xargs); test -z $branch || git push --force-with-lease origin HEAD:refs/heads/$branch'
pick ef85f60 Fill in GET handler for swizzles.
exec branch=$(git show -s --format=%B | grep ^PR-Branch: | cut -d : -f 2 | xargs); test -z $branch || git push --force-with-lease origin HEAD:refs/heads/$branch'
pick 0c82977 Add core swizzle data types.
exec branch=$(git show -s --format=%B | grep ^PR-Branch: | cut -d : -f 2 | xargs); test -z $branch || git push --force-with-lease origin HEAD:refs/heads/$branch'
pick 02ee497 Scaffold a request handler for swizzles.
exec branch=$(git show -s --format=%B | grep ^PR-Branch: | cut -d : -f 2 | xargs); test -z $branch || git push --force-with-lease origin HEAD:refs/heads/$branch'

After every step in the rebase, it will push to the target branch. If I don’t want to do that (which occasionally I don’t) I just delete the exec line.

Once I have these branches pushed, I start created pull requests. First I create a pull request from swizzle00-scaffold into master. Then I create a pull request from swizzle10-core_types into swizzle00-scaffold, and so on. This lets my colleagues review all of them. When the first pull request is accepted, I merge it, rebase the swizzle branch onto master after the merge, repush all of these branches, and retarget the next pull request from swizzle00-scaffold into master.

This is all awkward. The tools like GitHub and Azure DevOps need to fix this and start doing stacked reviews like Phabricator, and we need a git porcelain that is less powerful than git rebase -i specialized for editing stacks of commits.