Debugging MIT 6.824: Distributed Systems

Separate Logical and Timing Requirements
- State Changes
Debuggers
- MapReduce
- Raft

This November I worked on MIT 6.824: Distributed Systems (the spring 2023 semester is called 6.5840). I chose to work on this class because:

the tasks are complicated enough that I’m stretching my abilities
the projects have clear guidelines and success criteria (automated testing is awesome)
I get to learn some fundamentals of systems I use in my real job (like Raft in Kubernetes)

I spent a lot of time debugging my distributed systems, and I got a lot of help from these amazing guides.

6.824’s official lab guidance
TA’s blogpost on pretty printing logs
Students’ Guide to Raft has extremely specific advice on implementing Raft for this class

One thing that surprised me about those resources is that they didn’t talk about two techniques I used all the time: unit tests and debuggers. In fact, one of the TAs says:

There are no easily accessible debuggers like gdb or pdb that let you run your code step by step.

I don’t think that’s entirely true! I agree that in Raft a large class of bugs are related to timing and race conditions. And the use of threads definitely makes debuggers more complicated. Even so, I got a lot of value out of these techniques (in addition to printing and logs!), so I wanted to share some ideas here.

Separate Logical and Timing Requirements

One key idea I used again and again was to test logical requirements separately from timing requirements. Timing requirements need servers talking to one another in real time, while logical requirements are really about inputs and outputs.

Let’s use a real example by looking at two conditions in a function called AppendEntries that we have to implement in the Raft lab.

If an existing entry conflicts with a new one (same index but different terms), delete the existing entry and all that follow it (§5.3)

Append any new entries not already in the log

We can treat these requirements as pure function to unit test: given some input, return a new log (possibly with some entries deleted and possibly some new entries appended).

The professor asked the students not to share real code so I won’t show exactly what I did, but the function might look like this:

func deleteAppend(rfEntries []Entry, newEntries []Entry) []Entry {
  //...
}

// in AppendEntries RPC
rf.mu.Lock()

// code before
rf.log = deleteAppend(rf.log, args.Entries)
// code after

rf.mu.Unlock()

Now we can write a unit test by giving it an input and desired output:

rfEntries  := []Entry{{1}, {2}, {3}}
newEntries := []Entry{{2}, {3}, {4}}
want       := []Entry{{1}, {2}, {3}, {4}}

if deleteAppend(rfEntries, newEntries) != want {
  panic()
}

deleteAppend can be surprisingly tricky to implement, but by separating this from elections, heartbeats, and appends we can make sure we get it right without worrying about timing.

State Changes

We can extend this idea further by unit testing state changes. In Raft, each RPC could change the state of both the sender and receiver.

In the same way we wrote the inputs and output above, we can set the initial and desired state of Raft after the RPC:

initialRaft := &Raft{
  commitIndex: 4
  // ...
}

desiredRaft := &Raft{
  commitIndex: 5
  // ...
}

args := AppendEntriesArgs{
  LeaderCommit: 5
  // ...
}
reply := AppendEntriesReply{}

initialRaft.AppendEntries(&args, &reply)

if initialRaft.commitIndex != desiredRaft.commitIndex {
  panic()
}

I’m not necessarily suggesting you do this for each RPC: in fact, doing this can take a lot of time and often you get it right on your first try or it’s obvious what the problem is. But I want to show that it’s totally possible to handle the logical requirements totally separate from the timing and lock problems.

Debuggers

There are modern debuggers for Go! The easiest one to use is the Go Extension in VSCode, which adds visible breaks and state inspection like an IDE. I also like delve on the command line, which is what I use below (but VSCode works too).

So how do you use a debugger in the projects?

MapReduce

Let’s start with Lab 1: MapReduce. The coordinator and workers run on separate processes. For example, you might open 3 terminals and run

# window 1
go run mrcoodinator.go pg-*.txt

# window 2
go run mrworker.go wc.so

# window 3
go run mrworker.go wc.so

Since each of those belong to a separate process, you can attach a debugger right at the start:

dlv debug mrcoodinator.go -- pg-*.txt

Workers use a plugin which we first have to build with debugger flags:

go build -buildmode=plugin -gcflags="all=-N -l" ../mrapps/wc.go
dlv debug mrworker.go -- wc.so

And in lab 1, there are no timers to worry about so you can take your time in the debugger!

Raft

This is useful for lab 1, but Raft (labs 2, 3, 4) is different. Raft uses threads instead of processes, but that’s not a problem! You can totally use dlv to debug threads. Either by setting a breakpoint directly in a function used by a goroutine or by switching threads while paused in the debugger! (which is really awesome)

The real problem with Raft is timing. We have election timeouts, heartbeat intervals, etc.! If you are sitting paused in a debugger they aren’t going to work (which might be okay depending on what you want to test).

However, for the Raft projects I found the debugger most useful on tests! For example, I love putting a debugger on my unit tests that are failing:

dlv test -- -test.run TestDeleteAppend

Another option is to use the debugger to investigate failed grading tests. For example, suppose you run a test and see the following:

go test -run TestConcurrentStarts2B
#> --- FAIL: TestConcurrentStarts2B (1.10s)
#>     test_test.go:440: cmd 100 missing in [104 103 102 101 101]

The meaning of the failure is a little unclear. What is cmd 100 and where is it missing? To answer this question, we can to run the test again with a breakpoint at that failure.

dlv test -- -test.run TestConcurrentStarts2B
break test_test.go:440
continue

Raft runs as normal: you don’t have to worry about timing issues since you placed the breakpoint at the failure. From there, you can investigate the state of the testing fixtures to understand what the failure means: in this case, Command 100 was sent to the leader but was never committed. You can also look at the state of the servers at the time of the failure to figure out why.