← Back to Napkin Math

Thinking about TPUs

MXU performances

bf16[8,128] @ bf16[128, 128] -> f32[8, 128]

every 8 cycles.

q1

Ref:

TPU v4p

How long to load 200B parameter model in bf16, split between 32 TPU v4ps, from HBM to systolic array?

200e9 * 2 = 400e9 bytes 400e9 bytes / 32 TPUs = 12.5e9 bytes/TPU So each TPU needs 12.5 GB, within HBM cap.

Then we can stream that to the tensor core at 1.2e12 bytes/s:

Maybe 20.8 ms because we have to stream it out too?

solution

This represents sampling: generating the next token.

We also need to leave capacity in the TPU for data, not just the parameters.

q2

Consider a full TPU v5e pod. How many total CPU hosts are there? How many TPU TensorCores? What is the total FLOPs/s for the whole pod? What is the total HBM? Do the same exercise for TPU v5p pod.

full v5e pod

CPU hosts

Cores

FLOPs/s

HBM

q3