Microsoft and Nvidia launch MT-NLG, the largest and strongest language model trained to date

The Language Model is simply the probability distribution of a sequence of words. The main function is to determine a probability distribution P for a text of length m, which indicates the possibility of the existence of this text.

You may have heard of GPT-3, the latest language model of OpenAI, which can be called the strongest language model on the surface, and it is also regarded as a revolutionary artificial intelligence model. In addition, there are heavyweight products such as BERT and Switch Transformer, and other companies in the industry are also working hard to launch their own models.

Microsoft and Nvidia today announced the Megatron-Turing Natural Language Generation Model (MT-NLG) powered by DeepSpeed ​​and Megatron, which is the largest and most powerful decoding language model trained to date.

As the successor of Turing NLG 17B and Megatron-LM, this model includes 530 billion parameters, and the number of parameters of MT-NLG is 3 times that of the largest existing model GPT-3 of its kind. Unparalleled accuracy is demonstrated in a wide range of natural language tasks, such as:

  • Complete forecast
  • Reading comprehension
  • Common sense reasoning
  • Natural language inference
  • Word sense disambiguation

Join RealMi Central on Telegram, Facebook & Twitter

The 105-layer, converter-based MT-NLG improves the previous state-of-the-art models in zero, single, and few-sample settings, and sets new standards and quality for large-scale language models of two model scales. It is reported that the model training is completed with mixed precision on the Selene supercomputer based on NVIDIA DGX SuperPOD.

The supercomputer is supported by 560 DGX A100 servers, which are networked with HDR InfiniBand in a complete FatTree (FatTree) configuration. Each DGX A100 has 8 NVIDIA A100 80GB Tensor Core GPUs, which are fully connected to each other through NVLink and NVSwitch. Microsoft Azure NDv4 cloud supercomputer uses a similar reference architecture.

Leave a Comment