BETA
This is a BETA experience. You may opt-out by clicking here

More From Forbes

Edit Story

MLPerf Training 4.0: It’s All About Scale

Following
Updated Jun 12, 2024, 02:28pm EDT

While there isn’t a lot of new hardware (none!), Nvidia and Intel show off their muscles and ability to run new models at scale.

Ok, here we go again. MLCommons has released new AI benchmarks, this time for training. And again. Nvidia runs all AI models better than anyone, AMD decides once again not to play ball, and Intel does the best they can with old hardware (Gaudi3 wasn’t quite ready).

This time around, the MLCommons community has added two new benchmarks: one for Graph Neural Networks and one for LLM Fine Tuning using Llama 2 and LoRA (Low Rank Adaptation). LoRA can reduce the number of trainable parameters by 10,000 times and the GPU memory requirement by 3 times. Let’s take a look. We will also discuss the “yearly cadence” announcements from Nvidia and AMD.

Nvidia Sweeps Every MLPerf Benchmark

Nvidia isn’t just waiting for Blackwell, due out in full force later this year. They are improving performance of the Hopper-based GPU systems by tuning models and software. The company’s engineers have set a new LLM record with 11,616 Hopper GPUs, tripling training performance with near-perfect scaling from last years results.

First, as usual, Nvidia ran all the benchmarks, and touted improvements since the last run with the H100 a year ago. Four of the benchmarks are useful for generative AI, and Nvidia brought out the big guns, scaling to over 11,000 GPUs for the GPT-3 run in 3.4 minutes (this is not indicative of how long a full training run would take).

As the world waits for Blackwell, Nvidia needs to sell a ton of Hoppers. Historically, Nvidia typically increase performance with full-stack optimizations, and they have done so again, with decent results, reducing training time by some 27% on a 512-GPU cluster on GPT-3. A lot of this came from the Transformer engine which can optimally determine the precision that best meets the needs at each layer. Note that no Nvidia competitor has something similar.

And in the text-to-image space, Nvidia was able to increase performance by som 80% min just seven months. See the chart below for details.

While the new benchmarks are for training, Nvidia just couldn’t contain their excitement about H100 based inference, and claimed a 50% improvement in batch size one, with some undisclosed future software that will increase throughput even more. Stay tuned.

At Computex in Taipei, CEO Jensen share the following slide that details what he means by a yearly cadence. It doesn’t mean a new GPU architecture every year. Rather, he means a new GPU archtecture every two years, with an intervening kicker provided by adding more layers in the HBM stack. This is a far more consumable roadmap that many had feared, and is similar to what AMD announced at the show as well.

Intel continues to be the only other company to share MLPerf results.

Intel also ran the benchmarks, including the vital LoRA model, but on Gaudi2. Gaudi3 just wasn’t ready yet. Intel raised the bar on scale, using Ethernet which is the native networking on the Gaudi architecture. The Intel engineers ran on a large Gaudi 2 system (1,024 Gaudi 2 accelerators) trained on the Tiber development cloud.

And Intel is banging the drums for better AI affordability, with eight Intel Gaudi 2 accelerators with a universal at $65,000, which the company estimates to be one-third the cost of comparable competitive platforms (a.k.a. Nvidia). Intel Gaudi 3 accelerators lists at $125,000, estimated to be two-thirds the cost of comparable competitive platforms.

Since AMD still isn’t sharing results, Intel can claim to be the best benchmarked alternative to the more expensive (and faster) Nvidia GPUs.

Conclusions

Once again, we hear the sound of one hand clapping. Ok, two if you count Intel, God bless them. And Google did post some results as well. Keep in mind, running these benchmarks tells a vendor where they are good, and where they can improve. So, trust me, AMD ran the benchmarks.

Nvidia keeps winning.

Follow me on Twitter or LinkedInCheck out my website

Join The Conversation

Comments 

One Community. Many Voices. Create a free account to share your thoughts. 

Read our community guidelines .

Forbes Community Guidelines

Our community is about connecting people through open and thoughtful conversations. We want our readers to share their views and exchange ideas and facts in a safe space.

In order to do so, please follow the posting rules in our site's Terms of Service.  We've summarized some of those key rules below. Simply put, keep it civil.

Your post will be rejected if we notice that it seems to contain:

  • False or intentionally out-of-context or misleading information
  • Spam
  • Insults, profanity, incoherent, obscene or inflammatory language or threats of any kind
  • Attacks on the identity of other commenters or the article's author
  • Content that otherwise violates our site's terms.

User accounts will be blocked if we notice or believe that users are engaged in:

  • Continuous attempts to re-post comments that have been previously moderated/rejected
  • Racist, sexist, homophobic or other discriminatory comments
  • Attempts or tactics that put the site security at risk
  • Actions that otherwise violate our site's terms.

So, how can you be a power user?

  • Stay on topic and share your insights
  • Feel free to be clear and thoughtful to get your point across
  • ‘Like’ or ‘Dislike’ to show your point of view.
  • Protect your community.
  • Use the report tool to alert us when someone breaks the rules.

Thanks for reading our community guidelines. Please read the full list of posting rules found in our site's Terms of Service.