← Back to Napkin Math

Problem 7

https://sirupsen.com/napkin/problem-7

Attempt

(a)

Let’s estimate size of a row:

Updates:

That doesn’t seem so bad: am I missing something? Given the preamble, is there exponential growth hiding anywhere?

I don’t think so:

Cost:

So after 10 years, we would need an SSD with 600 GB just to hold our revisions.

(b)

Could we store it in a relational database?

Storing more efficiently:

(id, revision) vs. (revision, id)

id | revision         id | revision
1  | 1                1  | 1
1  | 2                2  | 1
1  | 3                3  | 1
2  | 1                2  | 2
2  | 2                3  | 2
3  | 1                4  | 2

Now we are more likely to keep old revisions on disc and newer revisions in memory.

The problem here is that the latest revision per rows are not grouped together. If id 1 is on revision 10, and id 2 on revision 100 it’s unlikey they would ever be loaded into memory together. We’d have to load 10 and 100 revisions into memory, when we really want the latest revisions in memory.

Proposal: new boolean column is_latest, with index (is_latest, id):

id | revision | is_latest
1  | 10       | true
2  | 100      | true

This achieves what we want most! Now reads and updates will always be ran with where is_latest = true. Those values are all grouped together, so are much more likely to be in memory, and old revisions can stay on disc until they are migrated off.

As an side, you’d now likely want to create a view such as latest_products with where is_latest = true, since the typical CRUD operations would not want to waste time querying revisions, and you don’t want to accidentally forget and have a really slow query.

overall

I feel like I’m missing something, but I’m not sure what:

Solution

https://sirupsen.com/napkin/problem-8

Mentions decreased performance, which I was worried about.

Solution suggests some problematic cases:

Mentions compression:

Solution did not mention alternate indices, so I’m proud I got that one.