https://sirupsen.com/napkin/problem-3
Worse case:
Data size:
int64
is 8 * 10^6 bytes = 8 MBSpeed for set membership tests: hash value and then random read into hash table. (not sure if using the rate is correct: I’m assuming I can do do the hash and random read while iterating over attributes and product ids)
matched = set()
ids = []
for attribute in attributes:
for id in attribute.product_id:
if id in matched:
ids.append(id)
else:
matched.add(id)
Overall:
Does this meet product requirements? Say you are on a shopping website and you need the filter to return at a reasonable speed. According to Claude:
Maybe each attribute list is in a different disk file:
https://sirupsen.com/napkin/problem-4
Things I investigated the solution did not:
Missed approach: I forgot to count the sequential memory read: I assumed it was negligible I think.
Ref: sequential read is 100 us / 1 MB
So the additional 3.2 still keeps us less than 60 ms.
I calculated what it would be if each attribute was stored on disk, but that analysis was a little shallow.
First, how many attributes total? Let’s say 1000. Then that’s 8 MB * 1000 = 8 GB, probably more than we would want to keep in memory, but not impossible.
And keeping 8 MB per attribute in a file is reasonable, since it would only take 7 ms to read that from disk.
In memory we calculated 3.2 ms. For disk, I calculated 7 ms:
Looking back, it doesn’t make sense that reading from disk is only twice as slow. Rule of thumb: SSD is 10x cheaper, 10x slower than disk.
Claude says 1 ms / 1 MB is a good approximation. And that would agree with the 10x 100 us / 1 MB for sequential reading in memory.
That leaves me confused about the reference stating Sequential SSD read is 200 us / 1 MB, but Claude and the blog post agree.
With this new estimate, what is the latency?
But this is the same as hashing and random read…
This made me realize my estimates were way too high for hashing and random read.
int64
hash is usually just XOR, shift, and multiplyRandom read:
But we don’t need to read 32 MB randomly. We pay the 50 ns once for each element we look up in the hash table:
That’s actually significantly worse than I calculated before.
So in total:
Still managable! No longer “instant”, but humans tolerate it.
Cost is something I did not consider at all.
Let’s assume we want to store 8 GB in memory. That’s 8 GB * $1 / GB / year = $8 / year. Even if we needed to double that to effectively work with the data in memory (to create the hash table, for example), that’s still only $16 / year.
Now if we have 1000 merchants, that’s $16,000 / year.
SSD was 10x cheaper, so $1600 / year which is very manageable, and still fits latency budget.
Could we use a bloom filter? For 1,000,000 bits, that’s 125 KB. That could fit in L2 cache, part of it in L1 cache.
And I don’t think it would thrash the sequential reading cache: we would only need the next few cache lines of the sequential scan to continue that quickly (i.e. good streaming).
If there is a match in the filter, it could be a false positive. So we would need a backup hash map to check for sure. But if we have a false positive rate of 1%, that’s only 40,000 extra elements, or 40,000 * 50 ns = 2 * 10^-3 s = 2 ms of extra time.
Recall: we compute 4 hashes, giving a number between 0 and 999,999. Each number is mapped into that bit array.
So let’s say
So say 24 ns per element
Twice as fast as the hash map: I wonder if I’m missing something, I thought it would be faster.
Ah, I’m forgetting that the bloom filter is space efficient: it might be faster, but the real payoff is that we only need 128KB + 1% of original data size for the backup hash map, rather than 8MB for the normal hash map.
Sorting?
1 s / 200 MB * 32 MB = 160 ms.
Just sorting all of them is slower, and we still would need someway to process through them and associate them with attribute.
So bloom filter is best bet.