To date, this was mostly for Nvidia hardware. MosaicML’s stack can achieve over
70% hardware FLOPS utilization (HFU) and 53.3% model FLOPS utilization (MFU) on Nvidia’s A100 GPUs in large language models without requiring writing custom CUDA kernels. Note that Google’s stack for PaLM model on TPUv4
only achieved 57.8% HFU and 46.2% MFU. Likewise, Nvidia’s own Megatron-LM stack
only achieved 52.8% HFU and 51.4% MFU on a 175B parameter model. Mosaic’s stack, much of which is open source, is an obvious choice unless every last drop needs to be squeezed out with many dedicated scaling engineers for clusters of 10,000s of GPUs.
Now, MosaicML is going to be able to offer the same with AMD hardware. They have only just gotten their Instinct MI250 GPUs this quarter, but they are already close to matching Nvidia.