Site icon Real Mi Central

Google’s machine learning trained the largest version of the BERT model ever, with a parameter scale of 481 billion

Google Machine Learning

The authoritative running score MLPerf v1.1 training list in the field of machine learning has been released. This time, there is an “abnormal” number on the score sheet of the BERT model: 1196.638 (minutes), from Google.

how? It takes nearly one day for Google to train a BERT, and it only takes a few minutes for others? This is actually a giant version of BERT that Google has never revealed, with a parameter scale of 481 billion , not the kind of BERT that others have only a few hundred million parameters.

It is also a work submitted by Google in the “non-standard area” of MLPerf this year: it took a total of 2048 TPUv4 and about 20 hours of training!

Join RealMi Central on Telegram, Facebook & Twitter

The Open zone is also a non-standard zone. Participants can try any models and methods other than those specified to achieve the target performance. When most of the competitors are crowding in the standard area to train small-scale models, Google employees Versailles said:

It’s really cool (cool) to throw 4000 chips in just a few seconds to train the giant BERT.

Google also hopes that the MLPerf benchmark will introduce more large models because they feel that in reality, they will not use so many chips to train such a small model like the entries in the non-standard area. The performance of this giant BERT is not bad. Its prediction accuracy is 75%, which is higher than the 72.2% required by MLPerf.

At the same time, like other competitors in the standard area, Google also uses fewer text data samples to achieve the target accuracy. Specifically, the standard area requires a program to use nearly 500 million token sequences for training, and the length of each sequence is mostly 128 tokens. Google only uses about 20 million sequences, but the length of each sequence is 512 tokens.

In addition, the 2048-block TPU system that completed this work was originally to cater to the company’s production and research and development needs, so it has not been “shelved”-it is currently used for Google Cloud services.

Nvidia has an excellent record in the standard zone

The remaining MLPerf results are mainly in the standard zone. As always, Nvidia has the highest record. For example, it uses the latest generation GPU A100 system to take the top four in the time it takes to train ResNet-50, of which the fastest is only 21 seconds faster than the highest score of 24 seconds in June this year.

For example, Graphcore training ResNet-50 on a 16-way system takes 28 minutes, which is one minute faster than the Nvidia DGX A100 system, but the POD-16 they use is half the cost of the DGXA100. Among other manufacturers participating in this competition, Samsung won second place in the speed of electronic training normal version of BERT, which was only 25 seconds. It took 256 AMD chips and 1024 NVIDIA A100.

Microsoft’s Azure cloud service entered the competition for the first time. It used 192 AMD EPYC processors and 768 A100s to train an image segmentation model on medical data to get the highest score. At the same time, Azure also said that it will submit some results in non-standard areas like Google. Although Microsoft and Nvidia released the largest model “Megatron Turing” not long ago, they said:

Many companies want to use artificial intelligence for a specific purpose, rather than a giant language model that requires 4,000 chips to run.

Exit mobile version