MXU performances
bf16[8,128] @ bf16[128, 128] -> f32[8, 128]
every 8 cycles.
Ref:
TPU v4p
How long to load 200B parameter model in bf16, split between 32 TPU v4ps, from HBM to systolic array?
200e9 * 2 = 400e9 bytes 400e9 bytes / 32 TPUs = 12.5e9 bytes/TPU So each TPU needs 12.5 GB, within HBM cap.
Then we can stream that to the tensor core at 1.2e12 bytes/s:
Maybe 20.8 ms because we have to stream it out too?
This represents sampling: generating the next token.
We also need to leave capacity in the TPU for data, not just the parameters.
Consider a full TPU v5e pod. How many total CPU hosts are there? How many TPU TensorCores? What is the total FLOPs/s for the whole pod? What is the total HBM? Do the same exercise for TPU v5p pod.
full v5e pod
CPU hosts
Cores
FLOPs/s
HBM