https://sirupsen.com/napkin/problem-6
# pages
rg --files --glob '*.{md,Rmd}' | wc -l
# bytes
rg --files --glob '*.{md,Rmd}' | xargs wc -c
Estimated bytes per page:
That seems reasonable!
Aside: next time I should start in reverse: instead of measuring my personal website, come up with an estimate and then verify it.
Given that my website is a simple static website, the client can’t make a special request to receive all webpages in one request.
Let’s go with Network NA East <-> West for our latency
Then to download each page at 25 MB/s:
Searching for key words in memory is fast:
6 seconds is not a good UX: but if we could request all pages in one request, the performance would be
That’s “instant” from the user perspective.
How big is the NYT? Let’s say each article is 1000 words and they publish 10 articles a day.
this estimate I’m not very confident about, because I don’t follow websites like the NYT. My initial thought was 100 articles a day, but that’s probably too high.
Let’s say the last 100 years have been digitized and are available online on their website.
Then:
Then:
One request per article then would be:
That’s just latency: then we also have to download the 2.2 GB:
So even if we downloaded all the articles in one request, and pay a latency of time of 60 ms, the download time is still 90s!
So it’s not feasible for the NYTs to download all the articles and search them.
Out of curiousity, the time to find a simple keyword with a linear scan through the articles:
Actually not that bad! But real bottleneck is the download time. That’s an “instant” UX feel.
https://sirupsen.com/napkin/problem-7
Solution’s estimates:
I had a similar starting place:
But I estimated 450 KB This is more accurate in a sense because I measured it. But it also includes a lot of non-searchable words, since I looked at the “raw” markdown files.
As I mentioned, I should have started in reverse:
He also suggests thinking about gzip! Great idea.
Scanning:
NYT:
Out of curiousity, he estimates search speed about about 50 ms: