Llama 405B 506 tokens/second on an H200

by moondistanceon 10/14/2024, 1:21 AMwith 5 comments

by EgoIncarnateon 10/14/2024, 2:58 AM

not "an H200", "In the table above, tensor parallelism is compared to pipeline parallelism with each across eight GPUs"

by 7eon 10/14/2024, 2:49 AM

And this is why nobody submits MLPerf against NVIDIA.

by moondistanceon 10/14/2024, 1:21 AM

Significant further optimizations. FP8!