Creating effective pull requests
Status: Draft
Confidence: Likely
I get a sinking feeling in my stomach when I’m asked to review an enormous pull request. There are probably multiple distinct changes in there, maybe some tiny fixes thrown in as they were found. Its author built it up steadily over time, and now I have to deconstruct it. It’s exhausting. Pull requests like that usually languish, waiting for a review, because they will require me to set aside a whole afternoon when I’m feeling up to digging through them.
As a courtesy to my coworkers I try to post small pull requests, pull requests that make one coherent change, usually no more than a hundred or so lines changed. As a result I can usually ping someone for a code review and they will deal with it quickly because they know it will probably only take them five or ten minutes. It’s much easier to get people to review the same changes as ten tiny pull requests than as one large one.
Doing this presents two new problems, though:
- Each of the small pull requests must leave the system in a state you can compile and ship, even though the change isn’t finished.
- If you aren’t going to stop coding after a hundred lines, you need to stack additional changes on top of previous ones, so you need a way to handle this stack of pull requests.
Shipping intermediate states
Every commit should leave your repository in a shippable state. This is sometimes called the “not-rocket-science rule.” I have yet to see a task that couldn’t be broken up into small changes that leave the repository in a good state, but there are a few tricks you sometimes need.
Scaffold first, then fill in.
Write your scaffolding, such as request handlers and core data types
first. Don’t hook them up to anything. Throw
NotImplementedException
or its equivalent everywhere. Then
fill in the scaffolding over a subsequent set of commits.
Example. I have to add a set of HTTP request handlers that will invoke various long running workflows to an existing system. I first write a commit that adds a request handler that returns 400 (Bad Request) for all requests to its supported methods (say, GET, PUT, and DELETE), hooks it up at the correct path, and writes a stubbed out unit test for it. Then I write a second commit that parses the body of the PUT request and validates it, and extends the tests to check the validation. A third one makes PUT invoke the appropriate workflow, which I mock in the tests. Another one implements the DELETE handler and tests it, and a final one implements the GET handler and test it.
Write one layer at a time.
If there is a whole sequence of things to be done, write the first and its cleanup as one commit, then the next as another. Each commit should include deallocation and cleanup for everything allocated and done.
Example. Say I am writing a network packet capture pipeline. First I write a pool of memory blocks that I can fetch from and return to, and a workflow that fetches a block from the pool and then returns in. In the second commit, I insert a stage where the I fill the block with packet data. The third commit adds parsing of the packet headers, and subsequent commits add further behavior.
Add new paths, then remove old.
If you have a working path, don’t break it. Either extend it to handle the new case, or plumb a new path in parallel. Then over a series of commits fill in the new path, while leaving the old one still working. Once that is done, stack up a few more commits to remove the old path.
It can require a little more work, plumbing new paths and cleaning up old ones instead of just revising everything at once, but I have found that I get code checked in much faster despite the extra work.
Example. A request handler has been taking an unstructured JSON blob of parameters that it passes on to another system, but now you need to replace that with specific parameters. In the first commit you add the parameters to the same type that contains the JSON blob, but they will always be null. The second commit changes the place where the JSON blob is used, and adds a condition that if the JSON blob is null, calculate an appropriate JSON blob from the specific parameters. Then the subsequent commits shift each place that passes in a JSON blob one at a time to pass in specific parameters, but only one path is changed and the others all continue to work.
Handling stacked pull requests
Now we come to the other difficulty: if you don’t want to write a few lines and then wait until it’s reviewed and committed to the main branch before writing anymore, you need some way to stack up your changes.
The best way to do this is to use a tool that handles stacked reviews correctly, like Gerrit or Phabricator. But most of the more common git hosting tools (GitHub, GitLab, Azure DevOps, BitBucket) do this wrong. After using something like Phabricator that explicitly lets you build up stacks of commits to be reviewed separately, the branch merge model is painful to use. But until the tooling gets fixed, the best we can do is work around it.
Let us imagine I am working on a feature to add swizzles (whatever
those may be) to my program. I create a local branch called
swizzle
,
$ git switch -c swizzle
Then I write my first small set of changes, usually a bit of
scaffolding to provide a way to invoke swizzles, though nothing is wired
up yet. Then I add those changes to the index, and run
git commit
, with a commit message
Scaffold a request handler for swizzles.
PR-Branch: swizzle00-scaffold
What is this PR-Branch
line? We’re going to parse it out
later on and use it to push this commit to a specific remote branch
named swizzle00-scaffold
. I have settled on using two
digits after the main branch name, then a quick summary name for naming
individual commits. So the next one will start with
swizzle10-
, then swizzle20-
, etc. This leaves
me space if I need to insert additional commits inbetween
(swizzle15-
, swizzle17-
). For those who grew
up with old programming languages that had line numbers this should seem
familiar. And gross. Because it’s the 21st century and the tools should
be better than this. Moving on.
I add a few core data types and functions to operate on them, maybe a
couple unit tests and commit them
(PR-Branch: swizzle10-core_types
), and then fill in one of
the request handlers (PR-Branch: swizzle20-get_handler
),
and then another (PR-Branch: swizzle30-put_handler
).
Now I have a branch swizzle
with a set of commits in
it.
swizzle
/
|- cbec364 Fill in PUT handler for swizzles.
|- ef85f60 Fill in GET handler for swizzles.
|- 0c82977 Add core swizzle data types.
|- 02ee497 Scaffold a request handler for swizzles.
.
.
.
Say I realize I have a typo in the core swizzle data types commit. We
go back and edit it with git’s interactive rebase.
git rebase -i
is vastly overpowered for this purpose. I
would love more special purpose tooling. But it works, and even though
it seems intimidating at first, you will get used to working in it most
of the time.
You run git rebase -i master
on the swizzle
branch, edit the commit you want, and finish your rebase. There are lots
of articles on using git rebase -i
out there. Maybe I’ll
write another one at some point, or, more likely, on how to use
get reflog
to fix things when you screw up with
git rebase -i
.
Now how do we get these commits pushed to GitHub or wherever our
remote is? What I settled on doing (partially because I do a lot of work
on Windows these days, and my configuration has to work on Windows,
macOS, and Linux) is the following alias in my
~/.gitconfig
:
[alias]
rbi = "rebase -m --rerere-autoupdate -i -x \
'branch=$(git show -s --format=%B | grep ^PR-Branch: | cut -d : -f 2 | xargs); \
test -z $branch || git push --force-with-lease origin HEAD:refs/heads/$branch'"
With that in place, instead of running
git rebase -i master
, I run git rbi master
. It
adds some nice things such as choosing how to resolve conflicts
(-m -rerere-autoupdate
), and between every commit adds that
long command. What it does is find the PR-Branch:
entry,
get the branch name from it, and pushes that to origin.
So if I run git rbi master
from swizzle
, it
will give me an editor to configure the rebase. With just straight
git rebase -i master
it would be
pick cbec364 Fill in PUT handler for swizzles.
pick ef85f60 Fill in GET handler for swizzles.
pick 0c82977 Add core swizzle data types.
pick 02ee497 Scaffold a request handler for swizzles.
With the -x
flag in git rbi master
, I
get
pick cbec364 Fill in PUT handler for swizzles.
exec branch=$(git show -s --format=%B | grep ^PR-Branch: | cut -d : -f 2 | xargs); test -z $branch || git push --force-with-lease origin HEAD:refs/heads/$branch'
pick ef85f60 Fill in GET handler for swizzles.
exec branch=$(git show -s --format=%B | grep ^PR-Branch: | cut -d : -f 2 | xargs); test -z $branch || git push --force-with-lease origin HEAD:refs/heads/$branch'
pick 0c82977 Add core swizzle data types.
exec branch=$(git show -s --format=%B | grep ^PR-Branch: | cut -d : -f 2 | xargs); test -z $branch || git push --force-with-lease origin HEAD:refs/heads/$branch'
pick 02ee497 Scaffold a request handler for swizzles.
exec branch=$(git show -s --format=%B | grep ^PR-Branch: | cut -d : -f 2 | xargs); test -z $branch || git push --force-with-lease origin HEAD:refs/heads/$branch'
After every step in the rebase, it will push to the target branch. If
I don’t want to do that (which occasionally I don’t) I just delete the
exec
line.
Once I have these branches pushed, I start created pull requests.
First I create a pull request from swizzle00-scaffold
into
master
. Then I create a pull request from
swizzle10-core_types
into swizzle00-scaffold
,
and so on. This lets my colleagues review all of them. When the first
pull request is accepted, I merge it, rebase the swizzle
branch onto master
after the merge, repush all of these
branches, and retarget the next pull request from
swizzle00-scaffold
into master
.
This is all awkward. The tools like GitHub and Azure DevOps need to
fix this and start doing stacked reviews like Phabricator, and we need a
git porcelain that is less powerful than git rebase -i
specialized for editing stacks of commits.