That doesn’t seem so bad: am I missing something?
Given the preamble, is there exponential growth hiding anywhere?
I don’t think so:
each update only generates one new row.
it would be exponential if each update generated a new row
for all previous revisions
then it would grow like: 1, 2, 4, 8, 16, …,
geometrically.
Cost:
this is a database, so we are hosting it on a server
with an SSD.
Zonal SSD: ($0.2 / 1 GB) / month
60 GB * $0.2 / GB = $12
$12 / month * 12 months/year = $144 / year
So after 10 years, we would need an SSD with 600 GB just to hold
our revisions.
again, I don’t think this is growing exponentially but it’s a
a big SSD.
(b)
Could we store it in a relational database?
one benefit is that if we had row-level updates or deletes,
it’s unlikely to block normal read-functionality of the app,
which would typically be looking at the latest revision.
so we could have a background job writing the old revisions
to longterm storage and deleting them from the main database.
blob storage is 10x cheaper, so would only be $14.4 / year.
it’s also not crucial we store and delete the old revisions right away,
so we can save on blob insert costs.
Storing more efficiently:
one improvement is to make sure the old revisions are on disc
and the latest revisions are more likely to be in memory.
(id, revision) groups rows with a common id together.
Instead we want to group the latest revisions together,
because read queries typically will only look for the latest revision:
(revision, id)
Now we are more likely to keep old revisions on disc and
newer revisions in memory.
for example, let’s say updates are evenly distributed across
all rows. Eventually, all rows will update at least once.
That means revision = 1 will always be on disc for normal
inserts and reads.
with (id, revision), if we searched for id = 1, we would
load id = 1 and revision = 1 into memory, even though we
don’t need revision = 1.
The problem here is that the latest revision per rows are not grouped
together. If id 1 is on revision 10, and id 2 on revision 100
it’s unlikey they would ever be loaded into memory together.
We’d have to load 10 and 100 revisions into memory, when we really
want the latest revisions in memory.
Proposal: new boolean column is_latest, with index (is_latest, id):
This achieves what we want most! Now reads and updates will always be ran
with where is_latest = true. Those values are all grouped together, so
are much more likely to be in memory, and old revisions can stay on disc
until they are migrated off.
As an side, you’d now likely want to create a view such as latest_products
with where is_latest = true, since the typical CRUD operations
would not want to waste time querying revisions, and you don’t want to
accidentally forget and have a really slow query.
overall
I feel like I’m missing something, but I’m not sure what:
10 updates/s is our fixed assumption, so the rate of updates
is not growing.
it’s relatively inexpensive to copy an entire row, so we don’t
have to pay much for storage.
maybe database performance is an issue, but the question is
about storage costs.
Pretty good! I really thought I was missing something here.
Mentions decreased performance, which I was worried about.
in particular, given the current index, you will always
fetch up “useless” rows, i.e. rows with outdated revisions.
Solution suggests some problematic cases:
I was considered about growing revisions, but even
random sporadic behavior by a single merchant could cause problems,
if they are constantly creating revisions.
Mentions compression:
good idea, we’d be storing a lot of near identical data.
Even though I mentioned needing to think about compression in
the last problem, I didn’t realize databases could implement compression.
But in general, there’s no reason the underlying file on disc couldn’t
be compressed. There would be decompression overhead:
ref: x compression ratio decreases 10x performance
2x on English is 200 MB / s
3x is 20 MB / s
useful to think about, but that would only move us from 60 GB / year
to 20-30 GB / year: not a ton of benefit
Solution did not mention alternate indices, so I’m proud I got that one.