← Back to Napkin Math

Problem 8

https://sirupsen.com/napkin/problem-8

Attempt

(a)

updated_at is in a db, so we also might need to think about loading pages, iterating over them, etc.

Assuming they are on similar computers, I don’t see why the server or client would be faster than the other.

Say 200 bytes per row. Pages are 16 KB.

Most of the time is spent loading the pages, not hashing them.

This would require more disk and page reads if the table was not index on user_id.

Client

Okay, maybe the client and server have similar computers, but if the client is user_id = 1 then:

That could easily be in the the client’s memory.

Worst case, we need a random access per page:

Total: 62.5 + 400 us (to hash) = 450 us

(b)

An index on user_id is important, but I basically assumed in (a) that it already was.

At first I thought maybe we could use updated_at to filter the rows we need to hash, but user 1 could have updated any of the rows.

(c)

The total dataset is only 20mb, which is reasonable to send to both client and server.

But how do we merge the rows? Suppose we went offline at 0:

-- initial
id | user_id | desc1 | desc2 | updated_at
1  | 1       | hello | world | 100

-- client
id | user_id | desc1 | desc2 | updated_at
1  | 1       | hello | WORLD | 100

-- server
id | user_id | desc1 | desc2 | updated_at
1  | 1       | HELLO | world | 200

If all we had were the final states, the best we could do would be to take the latest value. Without an additional data structure tracking the changes, we don’t know what the original values were.

For all the server knows, the old value of desc2 was WORLD and the update at 200 changed WORLD -> world.

Given that there were only two updates, a client at 100 and the server at 200, we know the merged dataset should be:

id | user_id | desc1 | desc2 | updated_at
1  | 1       | HELLO | WORLD | 200

Therefore, we know that we need to store each update, what was changed, and when it happened. Then apply that sequence during the merge.

Something like this:

{
    "id": 1,
    "desc2": "WORLD",
    "updated_at": 100
},
{
    "id": 1,
    "desc1": "HELLO",
    "updated_at": 200
}

We can sort these by updated_at and apply them in order.

merges

In the last example there’s a pretty clear “right” way to merge the rows. But there is no universal algorithm for merging: it depends on the programmer to decide what’s right for the application.

For example, let’s say the client runs two updates, and the server runs one:

-- client
id | user_id | desc1 | desc2 | updated_at
1  | 1       | hello | WORLD | 100
1  | 1       | hello | World | 200

-- server
id | user_id | desc1 | desc2 | updated_at
1  | 1       | hello | worlD | 300

The application needs to make a decision on how to merge the changes. But changed the original value world. And the client changed it first! But the server made the latest change. However, the client made the last “version” change.

In text fields like desc2, you might often want to combine them into a final result like WorlD. For merging text fields specifically, check out: https://timmastny.com/rga/

Detecting the conflict is the first step, the application also needs to decide how to resolve it.

state

Funny enough, the revision table from problem 7 provides a nice way to store all the updates if we wanted to use a state-based CRDT.

Solution

https://sirupsen.com/napkin/problem-9

(a)

Estimates:

Hash:

Reading from db:

This is way faster than mine, what did I miss?

Syncs per second:

Client:

(b)

Idea: hash (user_id, updated_at)

(c)

Merkle Trees: what do I know about them before reading the solution?

I think they work something like this:

The idea is that each of you can compute and compare hABCD. If they differ, then you compare hAB and hCD. Say hAB match: then you know those rows are the same. If hCD differs, then at last you compare hC and hD to find out which row is different.

What does the implementation look like?

From the solution:

Send the tree

Send part of the tree

Sync