Google’s machine learning trained the largest version of the BERT model ever, with a parameter scale of 481 billion

RMCTeam

2 years ago

The authoritative running score MLPerf v1.1 training list in the field of machine learning has been released. This time, there is an “abnormal” number on the score sheet of the BERT model: 1196.638 (minutes), from Google.

how? It takes nearly one day for Google to train a BERT, and it only takes a few minutes for others? This is actually a giant version of BERT that Google has never revealed, with a parameter scale of 481 billion , not the kind of BERT that others have only a few hundred million parameters.

It is also a work submitted by Google in the “non-standard area” of MLPerf this year: it took a total of 2048 TPUv4 and about 20 hours of training!

The largest version of BERT ever
The parameter scale of the standard BERT model (BERT Large) is only 340 million, and the 481 billion giant BERT this time is the largest version ever.
There is a direct difference of several orders of magnitude between the two .
Google said that training large models is the company’s “top priority” (mainly for cloud services).
So this time they didn’t participate in any running scores in the standard zone at all, and only “released themselves” in the non-standard zone.
The MLPerf competition has two divisions:
The Closed area is also the standard area. Participants will run points on prescribed models such as ResNet-50;

Join RealMi Central on Telegram, Facebook & Twitter

The Open zone is also a non-standard zone. Participants can try any models and methods other than those specified to achieve the target performance. When most of the competitors are crowding in the standard area to train small-scale models, Google employees Versailles said:

It’s really cool (cool) to throw 4000 chips in just a few seconds to train the giant BERT.

Google also hopes that the MLPerf benchmark will introduce more large models because they feel that in reality, they will not use so many chips to train such a small model like the entries in the non-standard area. The performance of this giant BERT is not bad. Its prediction accuracy is 75%, which is higher than the 72.2% required by MLPerf.

At the same time, like other competitors in the standard area, Google also uses fewer text data samples to achieve the target accuracy. Specifically, the standard area requires a program to use nearly 500 million token sequences for training, and the length of each sequence is mostly 128 tokens. Google only uses about 20 million sequences, but the length of each sequence is 512 tokens.

In addition, the 2048-block TPU system that completed this work was originally to cater to the company’s production and research and development needs, so it has not been “shelved”-it is currently used for Google Cloud services.

Nvidia has an excellent record in the standard zone

The remaining MLPerf results are mainly in the standard zone. As always, Nvidia has the highest record. For example, it uses the latest generation GPU A100 system to take the top four in the time it takes to train ResNet-50, of which the fastest is only 21 seconds faster than the highest score of 24 seconds in June this year.

Of course, this record cost a total of 4320 A100s, completed in parallel with the help of 1080 AMD’s EPYC x86 processors.
But in the absence of chips and host processors, competitors can crush Nvidia.
Among them, Intel Habana uses 256 Gaudi acceleration chips to train ResNet-50 in only 3.4 minutes.
Graphcore only takes 3.8 minutes, using 256 IPU accelerator chips and 32 AMD EPYC host processors.
Nvidia spent 4.5 minutes on a 64-way A100 system equipped with 16 EPYC processors.
Defeating Nvidia’s Graphcore emphasizes that the balance between performance and cost is the most important thing.

For example, Graphcore training ResNet-50 on a 16-way system takes 28 minutes, which is one minute faster than the Nvidia DGX A100 system, but the POD-16 they use is half the cost of the DGXA100. Among other manufacturers participating in this competition, Samsung won second place in the speed of electronic training normal version of BERT, which was only 25 seconds. It took 256 AMD chips and 1024 NVIDIA A100.

Microsoft’s Azure cloud service entered the competition for the first time. It used 192 AMD EPYC processors and 768 A100s to train an image segmentation model on medical data to get the highest score. At the same time, Azure also said that it will submit some results in non-standard areas like Google. Although Microsoft and Nvidia released the largest model “Megatron Turing” not long ago, they said:

Many companies want to use artificial intelligence for a specific purpose, rather than a giant language model that requires 4,000 chips to run.

Join RealMi Central on Telegram, Facebook & Twitter

Share this: